Popular in Course
Popular in Naval Science
This 35 page Class Notes was uploaded by Linnea Blick on Sunday September 6, 2015. The Class Notes belongs to N S 0 at University of Texas at Austin taught by Staff in Fall. Since its upload, it has received 28 views. For similar materials see /class/181677/n-s-0-university-of-texas-at-austin in Naval Science at University of Texas at Austin.
Reviews for DRILL
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/06/15
Optimization and Performance Engineering for Scientific Applications x8664 ByoungDo Kim TACC June 12 2009 THE UNIVERSITY OF TEXAS AT AUSTIN i TEXAS ADVANCED COMPUTING CENTER Outline Introduction Compiler Options Performance Libraries Code Optimizations 99 F WAGG 1 General Optimization Procedure Optimization in code designdevelopment Requires understanding of common architecture features Requires sense of how compilers map code to instructions Optimization is an iterative process Profile code Work on most time intensive blocks Repeat REVIEW PROFILE TUNE MOST TIMEINTENSIVE SECTION TIME COMMITMENT o gt STOP 0lt RE EVALUATE SUFFICIENT PERFORMANCE YES INCREASE 0 WAGG 1 Compiler Options Three important Categories Optimization Level Architecture Specification Interprocedural Optimization You should always have at least one option from each category WAGG 2 Compilers and Optimization Compilers can perform significant optimization The compiler follows your lead Structure code to make apparent what the compiler should do so that the compilers and others can understand it Use simple language constructs eg don t use pointers or 00 code Use latest compilers Always check compiler options ltcompilercommandgt help listsexplains options Look for architecture options for your system See User Guides usually lists best practice options cat proccpuinfo shows cpu information Experiment with different options May need routinespecific options use ipo WAGG Optimization Level On 00 no optimization Fast compilation disables optimization O1 optimize for speed but disables optimizations which increase code size 02 default optimization 03 aggressive optimization rearrangement of code ie scalar replacement loop transformation Compile timespace intensive andor marginal effectiveness may change code semantics and results sometimes even breaks codes WAGG Optimization Levels Operations performed at default optimization level instruction rescheduling copy propagation software pipelining common subexpression elimination prefetching some loop transformations Operations performed at aggressive optimization levels Usually enabled by 03 more aggressive prefetching loop transformations WAGG Architecture Specification X87 instruction sets are now replaced by SSE Vector instruction sets SSSE Supplemental Streaming SIMD Extension SSE instructions sets are chip dependent SSE instructions pipeline and simultaneously execute independent operations to get multiple results per clock period The xltcodesgt code W P T O S directs the compiler to use most advanced SSE instruction set for the target hardware WAGG Architecture Specification Intel SSSE is for Intel chips only Processorspecific optimization options all do SSE and SSE2 xT includes SSE3 amp SSSE3 instructions for EM64T Lonestar v 101 xW no supplemental Instructions Ranger v 101 xO includes SSE3 Instructions Ranger v 101 PGI tp barcelona64 uses instruction set for barcelona chip WAGG Interprocedural Optimization IP Most compilers will handle lP within a single file option ip The Intel ipo compiler option does more It adds additional information to each object file Then during loading the code is recompiled and IP among ALL objects is performed May take much more time Code is recompiled during linking It is Important to include options in link command ipo O xW etc special Intel xild loader replaces ld When archiving in a library you must use xiar instead of ar WAGG Interprocedural Optimization IP Intel ip enable singlefile interprocedural IP optimizations within files Line numbers produced for debugging ipo enable multifile IP optimizations between files PGI Mipafastinine Interprocedural Optimization WAGG Other Intel Compiler Options Other options g debugging information generates symbol table vecreport 05 controls vector diagnostic reporting C enable extensive runtime error checking CA CB CS CU CV convert ltkwdgt specify file format keyword bigendian cray ibm littleendian native vaxd openmp enable the parallelizer to generate multithreaded code based on the OpenMP directives openmpreport controls level of diagnostic reporting static create a static executable for serial applications MPI applications compiled on Lonestar cannot be built statically WAGG Other PGI Compiler Options Processorspecific optimization options fast 02 Munroc1 Mnoframe Mre Mautoinine Mvectsse Mscalarsse Mcachealign Mfushz mp thread generation for OpenMP directives Minfompipa OpenMPInterprocedural Opt reporting WAGG Compilers Best Practice Normal compiling for Ranger intel iccifort O3 ipoxW progccclf90 pgi pgcclpgcpplpgf95 fasttpbarcelona64 Mipafastinline progccclf90 gnu gcc 03 fast xipo mtunebarcelona marchbarcelona progc 02 is default opt compile with 00 if this breaks very rare The effects of xW and x0 options may vary Don t includedeb 9 options for a production compile ifort 02 g B testc WAGG 3 Performance Libraries Optimized for specific architectures Use library routines instead of handcoding your own In hot spots never write library functions by hand Offered by different vendors ESSLPESSL on IBM systems Intel MKL for x8664 AMD ACML Cray Iibsci for Cray systems SCSL for SGI Numerical Recipes books DO NOT provide optimized code Libraries can be 100x faster EQGG Linux x8664 LonestarRanger Libraries 3rd Party Applications Performance I I Math Libs I Method Libs Applications gprof SPRNG PETSc Amber NetCDF NAMD HDF 45 TAU Metisparmetis PLAPACK GROMACS PAPI SCALAPACK Parallel FFTW 23 SLEPc Gamess IIO DDT NWchem MKL GridFTP GSL GotoBLAS WAGG Intel MKL 100 Math Kernel Library Optimized for the IA32 x86 64 IA64 architectures supports both Fortran and C interfaces Includes functions in the following areas BLAS levels 13 LAPACK FFT routines others Vector Math Library VML WAGG Intel MKL 100 Math Kernel Library Enabling MKL module load mkl module help mkl Example Compile mpicc TACCMKLINC mkltestc LTACCMKLLIB lmklltgt mpif90 mkltestf90 LTACCMKLLIB lmkllt gt WAGG Code Optimization Always minimize stride length Stride length 1 is optimal for vectorizable code This increases cache efficiency and sets up hardware and software prefetching Stride lengths of powers of two are typically the worst case scenario leading to cache misses Strive to write Vectorizable Loops Can be sent to a SIMD Unit Can be unrolled and pipelined Can be parallelized through OpenMP Directives Can be automatically parallelized be careful G45 Velocity Engine SIMD lntelAMD MMX SSE SSEZ SSE3 SIMD Cray Vector Units WAGG 4 Code Optimization Write loops with independent iterations so that SSE instructions can be employed SIMD Single Instruction Multiple Data SSE Streaming SIMD Extensions instructions operate on multiple data arguments simultaneously WAGG Approx Memory Bandwidths amp Sizes Relative Memory Sizes Relative Memory Bandwidths Functional Units Registers 12 GBs L3 Cache Off Die 8 G Bs I Local Memory Latency 5 CP 15 CP Processor L1 Cache 1632 KB L2 Cache 1 MB Memory 1GB G53 Code Optimization When is Inlining important When the function is a hot spot When the calloverhead to work ratio is high When it can benefit from Interprocedural Optimization The C inline keyword provides inlining within source As you develop think inlining Use ip or ipo to allow the compiler to inline WAGG Code Optimization Example procedure inlining program MAIN integer ndim2 niter10000000 real8 xndim x0ndim r integer i j doii1100000 redistxx0ndim endido endiprogram real8 function distxx0n real8 x0n xn r integer jn r00 do j1n rrXjx0j2 end do distr function dist IS called end function n ertwnes program MAIN integer parameter real8 xndim x0ndim r integer i j ndim2 do i1100000 r00 do j1ndim rrXjx0j2 end do end do end program function dist is expanded inline inside loop Loopj is called niter times WAGG Code Optimization The following snippets of code illustrate the correct way to access contiguous elements ie stride 1 for a matrix in both C and Fortran Fortran Example real8 39 amn bmn cmn do i1n do j1m ajibjicji end do end do C Example double am n bm n cm n for i0i lt mi for j0j lt nj ai jbi jci j WAGG Code Optimization Also for large and small arrays always try to arrange data so that structures are arrays with a unit 1 stride Performance of Strided Access 3000 2500 i i f id h frf L ib s l 2000 15 39 suri a at atv i 1 end do 1000 500 is E 3 13 C N m w 2 1a 8 a DJ 4 5 Stride TGG Code Optimization Loop interchange can help in the case of a DAXPY loop integerparameternkb16kb1024nnkbkb8 real8 do n 0 do j1n ssaijxj end do yis end do 11 50 WAGG xn ann yn integer parameter nkb16kb1024 nnkbkbl8 Real8 xn ann yn do j1n do i1n yiyiaiixi end do end do Code Optimization Array Blocking The objective of array blocking is to work with small array blocks when expressions contain mixedstride operations It uses complete cache lines when they are brought in from memory and hence avoid possible eviction that would otherwise ensue without Mocmng do i1n do j1n AjiBij end do end do gt do i1n2 do j1n2 Aj i Bi j Aj1i Bi1j Aj i1Bi j1 Aj1i1Bi1j1 end do end do WAGG Code Optimization real8 ann bnn cnn t do iilnnb maltilglication do JJ1nnb do kk1nnb do iiiminniinb 1 d0 jjjminnjjnb 1 d0 kkkrminnkknb 1 Array Blocking m IA 3 Z cijcijajkbki quot R g39 P E 39 v v quotv nbxnb nbxnb nbxnb nbxnb Mallix Ilimensiuu ll results from old system end do end do end do end do end do end do Much more efficient implementations exist in HPC scientific libraries ESSL MKL ACML EQGG Code Optimization Even lowstride is effective when accessing data in cache Performance of Strided Access 9000 8000 7000 6000 5000 4000 3000 2000 1000 4 quotr39vL quot k39 l v 2 2 w W w I rmanee Ged is E 3 395 C N m a 2 1a 8 a LIJ as i 39i500005is dd Asum sum data39 i end do St de TGG Code Optimization In some cases an entire loop can be replaced with a single call to a vector function For example the loop below can be written as a call to vdlnqurt in the Intel VML for i0iltni gt vdlnqurtnxy yi10lsqrtxi vdSinCosnxsc for i0iltni for i0iltni yi1asinXi bCOSXi yiaSi I0lti But how do you make something like this portable ifdef in C and F90 WAGG Code Optimization mFDEFexanme program main ifdef IBM endif ifdef IA32 endif end program integer Parameter real8 Parameter n100 nn2n napnnnn12 xmax200 xmin xmax integer integer parameter iopt20 naux3nn elif defined IA32 real8 apnap evalnn worknaux integer info real8 apnap evalnn work3nn call DSPEV39n3939u39nnapevalevecnnworkinfo elif defined IBM call DSPEVioptapevalevecnnnnworknaux WAGG Code Optimization Loop fusion Loop fusion combines two or more loops of the same iteration space loop length into a single loop for i0iltni aixiyi for i0iltni gt ai1 Xiyi for i0iltni bi110xiizi bi10xizi Only n memory accesses for X array Five streams created Division many not be pipelined Costly at least 30 CP Code Optimization Loop Fission The opposite of loop fusion is loop distribution or fission Fission splits a single loop with independent operations into multiple loops do i1n aibicidi end do do 1n a bgicidi do i1 e39f3939g39 h39p39 eifigihipi do i1quot 0iri5i end do WAGG References Books High Performance Computing by Kevin Dowd and Charles Severance O Reilly book general study of high performance computing Performance Optimization of Numericaly Intensive Codes by Stefan Goedecker and Adolfy Hoisie Siam book Society for Industrial and Applied Mathematics TACC User Guides wwwtacc utexaseduservicesuserguidesranger wwwtacc utexaseduservicesuserguideslonestar Compilers www intelcomcdsoftwareproductsasmonaengcom pilers278607 htm www intelcomcdsoftwareproductsasmonaengcompilers279831 htm wwwpgroupcomdocpgiugpdf Optimization httpcachewwwintelcomcdOOOO2192219281compileroptimizationpdf WAGG References Libraries GotoBLAS wwwtaccutexaseduresourcessoftware Dense and band matrix software ScaLAPACK wwwnetiborgscaapack Large sparse eigenvalue software PARPACK and ARPACK wwwcaamriceedusoftwareARPACK WAGG
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'