Popular in Course
Popular in Computer Information Technology
This 83 page Class Notes was uploaded by Haylie Satterfield on Saturday September 19, 2015. The Class Notes belongs to CISC879 at University of Delaware taught by Staff in Fall. Since its upload, it has received 42 views. For similar materials see /class/207175/cisc879-university-of-delaware in Computer Information Technology at University of Delaware.
Reviews for ADVANCEDPARALLELPROGRAMMING
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/19/15
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 4 February 21 Lecturer Xiaomz39ng Li Scribe Scott GrauerGray Pujzm Kafle 41 Intro to CUDA and GPU Programming Professor Xiamoming Li of the Electrical and Computer Engineering department presented an introduction to CUDA7 a langauage used for GPU programming He discussed the CUDA Programming Model7 the CUDA memory model7 the CUDA APL and then went on to give an example CUDA program and an application of GPU processing and CUDA in his own research with GRAVlTl 411 CUDA Programming Model Ideas of CUDA Programming Model 0 CUDA is abbreviation for Compute Uni ed Device Architecture 0 Makes general programming for GPU easier 0 Donlt need to worry about graphics processing 0 Bad idea to compute all instructions on GPU o Branching not good on GPU The CPU graphics processor unit compute device with own memory DRAM used to execute portion of application that 0 must be executed many times 0 can be isolated as a function called a kernel 0 works independently on different data Threads that execute the kernel are organized as a grid of thread blocks Thread Block 0 number of threads in a block limited to 512 0 have access to highspeed shared memory 0 can synochronize execution between threads in block 0 if threads not in same block7 communication must be through slower global memory 4 1 472 Lecture 4 February 21 0 threads in block haVe IDs and can be indexed into 1 2 or 3 dimensions choice doesn t make a difference in execution time but can simplify programming When processing multidimensional data in applications like image processing solVing PDEs on Volumes etc Thread Grid 0 consists of multiple thread blocks of same dimensions and size 0 all threads in grid execute same kernel 0 threads in di erent blocks cannot communicate and synchonize With each other 0 blocks haVe IDs and can be indexed into 1 or 2 dimensions choice doesn t make a di erence in execution time but can simplify programming 412 CUDA Memory Model Figure 41 Illustration of CUDA memory model Ideas of Memory Model 0 memory hierarchy of CPU di erent from CPU 0 need to be careful about memory use to speed up programs 0 global memory is slow constant and texture memory are readionly for deVice Shared memory 0 shared between threads in the same block 0 much faster than localglobal memory Lecture 4 February 21 4 3 fast as register when no bank con icts divided into memory banks of equal size bast cases one to one mapping between threads and memory bank see gure 42 all threads accessing same bank worst case two threads accessing every memory bank not all details given as to how memory accesses are serialized Figure 42 One to one mapping between threads and memory bank Blocks processed by multiprocessor in one batch called active active blocks split into SIMD groups of threads called warps warp size is 32 threads amount of shared memory per multiprocessor is 16 KB organized into 16 banks memory request therefore requires two cycles per warp 413 CUDA API CUDA is extension to C Function type quali ers specify location of execution of function device function executed on device only callable from device global function executed on device only callable from host Jqost function executed on host only callable from host Variable type quali ers specify memory location on device 4 4 Lecture 4 February 21 0 device variable in global memory space has lifetime of application is accessible from all threads in grid and host 0 constant variable in constant memory space has lifetime of application is accessible from all threads in grid and host 0 shared variable in shared memory space of thread block has lifetime of block only accessible from threads Within block Execution con guration 0 must be given for calls to global functions 0 de nes dimension of grid and blocks 0 speci ed between function name and argument lists Example the function global void Func oat parameter must be called like Func ltltlt Dg Db7 Ns gtgtparameter o Dg gives dimension and size of grid 0 Db gives dimension and size of block 0 Ns optional argument that gives number of bytes dynamically allocated in shared memory default is 0 Builtin variables 0 gridDim type dim3 dimensions of grid 0 blockldx type uint3 block index Within grid 0 blockDim type dim3 dimensions of block 0 threadldx type uint3 thread index Within block 414 Example CUDA Program Calculation of scalar product of 32 vector pairs of 4096 elements each 0 calculation organized as grid of 32 blocks With 256 threadsblock gt 4096256 16 slicesvector 0 data given to device as two data arrays and results saved in result array 0 each product of vector pair calculated in slices that are added up for nal result Host program Law 4 szzuazy 21 shce 54 VednrAD Pamarresunsno 34 Figure A 3 xuusmum Di scale mam calculaunn usng cum m mman alga chaz ugv cutcmkmavlc gt M that mallacmATAjZ cuda a1acvad mu mmsz cuda emcpydA M mmsz cuda emcpy astTaDevlce maspwwmcm Tempxugtgtgtltac cu m cuda emcpyhc am amusz cuda emcpybevlceTaHast cumsnrum lt washes cu gtgt ixeemj CUT ITizgc my Kernel Bummer that us execuced on GPU gma de PdeGPLK aat u that u that am m that tmmEADJIIJ results at mpg threads shma that mucus sce mm stmes results at each shoe cans nt meaamm Hanna 1 111 black mm on may pm 0 zrzyzt mew mnfz pm H mm dz zmd block 171th gmi m 111 vecnmacmxx vs lt vmmm vs 5mm Lecture 4 February 21 int base ELEMENTN vecn run on every slice of input vectors for int slice 0 slice lt SLICEN slice base THREADN each thread calculates one product tI dAbase I dBbase I syncthreads calculate result for slice by summing all the values in the float array t result will be stored in tU for int stride THREADN 2 stride gt 0 stride 2 if I lt stride tI tstride I syncthreads if I 0 rslice tO save partial result of slice add up results for all slices for int stride SLICEN 2 stride gt 0 stride 2 if I lt stride tI rstride I syncthreads if I 0 save result to device memory dCvecn r0 415 GRAVIT Example application of GPU processing and CUDA in Dr Li s lab GRAVIT 0 each thread calculates forces on single particle 0 used in physics simulations 0 n2 algorithm Lecture 4 February 21 4 7 GRAVIT 0n the GPU o ideal algorithm for GPU o divide global memory between positionmass and velocity because of different access patterns see gure 44 sets of particles easily divided into blocks each block steps through all particles in slices and mirrors them into shared memory see gure 44 GPU Baseline speedup approx 60X 0 random nature of initial particles causes unpredictable memory access patterns causing unpredictable degree of speed up Block 1 Block 2 Shared memury 74 msi39mus mm masses UEUDEMHE S Figure 44 Memory use in GRAVIT Optimization optimization using spatial subdivision idea is to improve locality of memory accesses problems with spatial subdivision include recursion inter block commmunication unknown sizes of result sets memory usage 0 approaches to spatial subdivision show potential but optimizations hard to predict in advance one problem is handling dynamic results in global memory CUDA is easy to pick up but hard to push to limit crsc s79 Summe Support for Multicnre Architecmres Spring 2007 Lecture 13 March 25 Lectwer John Camus Scribe Ytng Yu 131 Bryan YouseOptimization ofSparse MatrixVector Multiplication on Emerging Multicore Platforms 1311 Basic Idea r u r r There are two kinds ofMatrix Sparse Matrix and Dense Matrix I I r I I I m quotw is tested which is one de signs 1312 SPMV overview Disadvantgges patterns Data Skucture W e compressed sparse row CSR form at N13251 SpMV rmplememeuen y ltry AxwhereArsmcsR fur1 r lt m xlt double ya ym for k pup k lt ptr1 1 Hk ya I k xbndm YD W gt 1313 SPMV Optimizations Goal As much autommng as possible opumrzau ons Optimizations l V 4 V o V Thread Blocking The rst phase in the optimization process is exploiting threadlevel parallelism In this paper we only exploit row partitioning The matrix is partitioned into NThreads thread blocks which may in turn be individually optimized There are three approaches to partitioning the matrix by row blocks by column blocks and into segments In both row and column parallelization the matrix is explicitly blocked to exploit NUMA systems Cache and Local Store Blocking For suf ciently large matrices we rst quantify the number of cache lines available for blocking and span enough columns such that the number of source vector cache lines touched is equal to those available for cache blocking Using this approach allows each cache block to touch the same number of cache lines even though they span vastly different numbers of columns TLB Blocking TLB misses can vary by an order of magnitude depending on the blocking strategy highlighting the importance of TLB blocking Register Blocking and Format Selection The next phase of our implementation is to optimize the SpMV data format Register blocking groups adjacent nonzeros into rectangular tiles with only one coordinate index per tile Key point For memorybound multicore applications we believe that minimizing the memory footprint is more effective than improving single thread performance Index Size Selection 16b integers to reduce memory traf c Architecture Speci c Kernels through autotuning SlMDization The xlc static timing analyzer provides information on whether cycles are spent in instruction issue doubleprecision issue stalls or stalls for data hazards thus simplifying the process of kernel optimi zation Loop Optimizations CSR data storage means that the column and value arrays are accessed in a streaming unitstride fashion We can explicitly software pipeline the code to hide any further instruction latency The code can be further optimized using a branchless implementation which is in effect a segmented scan of vectorlength equal to one 1314 Test result pram 35K XSEK 1 PWquot 5 r w hank w E mummy Evaluated Sparse Matrices nwsnms m Eesc pmn I w Enemych 2Kx JK M ZVKMTK o 4 urban snafse39cmvzr Cw mx r l7 K I r w 7 2amp3 s W F cenln 72 suns aaKx 33K Web mam 4w szmszx my FE AcaNlever 4 r w The dense matrix provides the performance upperbound SpMV is limited by memory through put 9 Dense case supports arbitrary register blocks no added zeros Q Loops are long running gt more CPU time Vs Memory fetch time 2 Peak Effective Bandwidth Results one core one cure AMD x1 loieiimtn Ninp Celllpgh C clllBlmw rd lmndllidlli nnrl rompnlntioiml inh rui s Icllsr miniii wired in sp 3952 l ornmt in oils turd percenhlce nl I w L39 39quot quot uanu lulu however only the full version c L i i i n t L theoretical potential Outside ofthis project we 3 Effective SPMV performance on Muticor typically expect only 10 ofpeak perfonnance e platfonn xztml l i n W quot Lh whom hull L 1th RBI In I norm mm com 39ac d 9 Ww rryquot ii39mmm quot m WWW lnmnmm m 3quotquot 7 i amassmom lac 3a 20 7 quot l m 3 w a a a ie e eww at es9 Marvwfev w um cell Brnlnband Eunine page 2 I u u e e Jaw s a w 4quot dv nusx I mmsdsrl mass I muml39l Results showthnt a n t II 39 a In vim quotI 75 quot quot 39LL W 4 A 39 quot39 M ops 4 Compansons median matnx iesults 7 25 i I I u e i mu syslelu wins 1m Tnble n The result shows that the optimized performance ofour spMV implementation as well as OSKI using a singlecore 39 39 39 quot 39 39 39 39 39 39 39 r n 39 39 39 39 ahimmmxAJx thAM nvr l mum 4 e quot quot quot 39 quotquot implementation 132 Dimitrij Krepis POSH A TLS Compiler that Exploits Program Structure 1321 Introduction to Thread Level SpeculationTLS performance ofthe resulting TLS system ms i iii The speedup ofTLS comes from two effects task parallelism and data prefetching Task 1 2 Chart a can bene t had Taekv E W from parallelism and piefetehing TLS bene ts from parallelism r ski T k a Wquot 5 when the two tasks run Tank I Li A when a task suffers a cache miss i 39 1 E i ondatumAthetaskisthen i 731 2 139 l squashed and later a second task 3 that will not be squashed obtains A from the cache Figure 3c illustrates this effect when the task vumwnon as Sequsun d ammoquot ml Darahllsm m pursuing I I 7 thathenelits from prefetching is l n i 39 i Flgme le mo mknllll nchh 01 TL plnllelnm a one matwassthed and pl cfcichmg 1322 POSH 1 and m am prefith effects pmvldzdbythz 5 an mks 2 I24 rigum I 51mm 0 x PnsvI muncwm 3 Hardware Assnmpnnn sma Mammy cw Na xegxslex mnsfexbelween mks Wnterhmugh an A Camp zx Phases TaskSalnnon spawn Hommg and Task mm 0 Task 52mm nm39y asks S nmmms S nmmmz cammnm lamp mmamcmnmm Fax Each Task dszy Eegmng and End mans Cammns befan everytnsk w Predlconn deuces am dzpndzncyvmlmmns men xemmwluzslnap mdvconn wn snug Insens smwu shamans atbegmnmg ah mks Spawn Pam39s Hms39s mks 5 earlyas pussfble Impmves Pnn ehnn andPxefetchmg befaxe m Mum afwnahlz used Except w predmmn cmm Flaw mmmm Spuwnmg m reverse m 0 Task Rafmamam Mm m m dzcmansanwhlchtnsks w m mm m rmbm m xefmzmzm phase xmhldzs m pamamm mu mg mgquot Dapamiama and ProMd mp5 Immanqu Spuwnmg 5 Pm lzx Tnmln su SequzmuJExecnon W w rm Mn lssxmplecachz Esnmatzdcachz mlsses L 5mmexecnonnandzskmp A 1 7717 mm m mmmmm quot mm m m spuwnpamt0vexhzad hm quot 0mm Inaknpmsmxe mm mm 1m nmeKSmxe me dzpndznce vinth W quot mam x amp 0 mmwm 5 Expnmem Smce mm 15m mmquot a famthmsnppuns n5 we target POSHm sEsc a 3455 can and has a mm L cash thatbn exs m 512th am m L Enchant summed dqmth a cmssbar m an ammp 5mm L2 cash m mu ms chchwmuv x Irwmdw e st Wm u 7 W m z mm mum FD mm mm Emndvvwdmur SDIwnummm mm mm mm Ham Squmo m a 22mm Era 2K 2m 1217 am hm man MB Sm mm m mm x on hunt luck mm 2 m vawy m mmu mmka mm Sinzycm mwm mam Table I Amnumm mummd AM cycle cuumlt u39c m pmccssor 515le 7 Mum w exammz smnhssvzs taskselzcmm 5m mam ukcmcmms memmy be vim Faxmm Ande emvemss fth pm lzxmdwhlz predmmn Inapcannmmnmls Loop an suchtnsks mymop I Sun a m a swamp an I J I pm my mum vrr m Mean Figum y Speedup m me TLS execution mm wqusnual one lol dmomm mu sckcuon algumhms Th2 hm spednps m mam whznbath yps aftxsks are selected SubvLoop o Canmbnnanaanfemhng m TLS Spednp m prefemhmg effecmfsqushzd mks canmhmzsm m spednp afTLS execman I pm I W m I I I I u g m m mew Figum m Speedup 01 TLL v nPrryv39tch and TLS mcr w qucnhal Cxccmmn bem 39s m mnsnsgap mm mm m wr M W W um mew Fig l u Speedup m TLS mm and wthmn he pm hng pass am a M quununl Newman spednp o Effecnvemss afVahIz Premman an my mex m a Mam gum 12 Speedup M TLS mm and wnhoul vahk pmdm lmn ovvr um szqucmml mmmon Optimization of Sparse Matrix Vector Multiplication on Emerging Multicore Platforms Samuel Williams Leonid Oliker Richard Vuduc John Shalf Katherine Yelick James Demmel Presented by Bryan Youse Department of Computer amp Information Sciences University of Dela ware Sparsity Dense Matrix quotcommonquot matrix Sparse Matrix few nonzero entries Sparsity typically expressed as the number 0 nonzero entries per row Why care pMV one o i the most heavily used kernels ir scientific computing Other applications Economic modeling Information retrieval Network theory Historically currently terrible performance Current algorithms run at 10 or less of machine peak performance on a singlecore cache based processor Dense matrix kernels gtgtgt similar Sparse kernels What39s the hold up Problems with sparse kernels data structure woes needs to both exploit properties of the sparse matrix fit machine architecture well runtime information needed this is the major disadvantage to the dense lgernels I I IILIILiII il III Many optimization tricks can be used to exploit t format r SpM V Optimizations 3 categories of optimizations Lowlevel code optimizations Data structure optimizations this includes the requisite code changes Parallelization optimizations Note that the first two largely affect single core performance Goal As much auto tuning as possible Optimizations Blocking Thread Blocking Threadlevel parallelism split matrix up by rows cols or blocks Cache Blocking Problem Very large matrices cannot fit the entire source or answer vectors in cache Solution Split matrix up into cachesized tiles 1k x 1K is common Optimizations Blocking 2 TLB Blocking TLB misses can vary by an order of magnitude depending on the blocking strategy Register Blocking Group adjacent nonzeros into rectangular tiles Key point to take from blocking SpMV is a memorybound application reducing memory footprint is more important than anything else More Optimizations Index Size Selection 16b integers to reduce memory traffic SIMDization Software Prefetching Get the data we know we39ll need soon in cache Architecture Specific Kernels through autotuning Last Optimization I Swear Loop Optimizations of course CSR format enables the core kernel loop to go fron Old amp Busted Nested loop with two loop variables New Hotness Remove inner loop variable Code Optimization Data Structure Optimization Parallelizalion Optimizatir x86 N2 Ccll x86 N2 Cell x86 N2 SIMDization J NA J BCSR J J Row Threading J4 J 4 Software Pipelining J BCOO J J J Process Af nity J 6 J 5 Branchless J10 J 16 bit indices J J J Memory Af nity J 6 NA Pointer Arithmetic J 39 32bit indices J J NA PFDMA Values amp Indices J J J Register Blocking J J J9 PFDMAl PointersVectors J Cachc Blocking J 5 J 3 J 5 inter SpMV data structure caching J J TLB Blocking J J Anon Name Clovertwn Opteron Cell Niagara 2 ChipsCores 24 8 22 4 18 8 18 8 Architecture 4 3issue SSE3 2VLIW 1issue 000 caches SIMDRAM MTcache Clock Rate 23 GHz 22 GHz 32 GHz 14 GHz Peak MemBW 21 685 21 685 26 685 41 GBs Peak GFLOPS 746 GF 176 GF 146 GF 112 GF Testing Suite Nonzeros Name Dimensions nnzjrow Description Dense 2K x 2K 40M Dense matrix in 2K sparse format Protein 36K x 36K 4 3 Pmte quot data 119 bank 1HYS t d t 39 Id FEW W FEM AC ua y e ense l I la FIX prOVI GS 3K 83K Spheres 8 x 72 spheres the performance upper bound Cantilever 55 FEM cantilever m aims SpMV is limited by memory SEE 47K x 47K Chailgsfgfhgfrbor u g h p oco 49K x 49K 1633 Quir ggl gg tors Dense case supports arbitrary FEMShip 14m x 141K 398 FEM Ship 33 sectiondetaii I Economics 207K at 207K LEGgM Macroe39 mic mo e Loops are long running gt more 43 of epidemic I I C P U ti me vs M emor fetch h m e Accelerator 22 desngn q Circuit 171K x 171K 9K Motoroia oircurt 5 Simulation 31M a 39 webbase 1M x 1M I lv eb connectnmy 3 matrix i l n li I 390 4K x 1 1M 1N13M Railways set correr 325 Constraint matnx Promising Initial Results Sustained Memoryr Bandwidth in GBs 92 of con guration peak bandwidth Sustained Performance in GFlops 123 of con guration peak computation one socket one socket one socket all sockets one core one core all cores all cores Machine one thread all threads all threads all threads AMD X5 524 GBr s 492 5246813 4023 073681 0 03 13350313 626 13105113 207 131CJFjs 1207 103 GFrs 101 331 GFr s 188 332 GBrs 353 332 GB 353 537033quot 5030 1124 31312 1527 rIJ 3971 12 0 mequot 00 cars 102quot 005 Gm 10200 133 51 3000 270 51 137 Niagmaz 000 0815 115 37003315 3351 33115 540 2323 3315 154000 39 010 CF23 117 001 cFrs 073 530 ans 51351 530 Gris 513 GEMS 476Cf Bi s 130 47668 3 1301 211003813 21166353 1320130 1150539 02000 1156515 02015 512 on 10001 512 1400 CemBlade 4750855 130 473 4815 13025 2473 G305 000 4720 331110 924 1150315 023 115 CF33 02020 500 40713 11358513 388 Table 3 Sustained bandwidth and computational rate for a dense matrix stored in sparse format in GBs and percentage con guration s peak bandwidth and GFlops and percentage of con gttmtion s peak performance Remember outside of this project we typically expect only 10 of peak performance More Results 35 x 2 Core 11 2 Core 30 I 1 CorePFRBCB El 1 CorePFRB m 25 I1 CorePF 20 Core Naive A OSKIPEI SC E 15 OSKI 10 05 00 9 9 539 3 9 6 do 06 1 9 o 0 0 3326 write aft we 0 6 way 63 35 Intel Clovertown in st 4 2 Cores 30 I 1 CorePFRBCB El 1 CorePFRB 25 I 1 CorePF 3 I 1 Core Naive a 20 AOSKIPETSc O E 15 0 10 05 00 9 9 o 0 0 0 9 lt0 6 1 9 3 a o o 0 6 0 g 0e 0quot 6 96 90 e 026 s 690 er M o a we More Results 6390 39 Sun Niagaraz 1 Core Naive l1 CorePF 1 1 CorelPFRB I 1 CorePFRBCB 50 E 1 Core x 9 Threadsquot D2 Cores x 8 Threads 40 I a i I M Cores x 8 Threadsquot E8 Cores x 8 Threadsquot 30 I a l m V 9 20 J Lo OIO I I I I I I I I I I I gt I I I I 60 w v 90 IQO 0 Q 0 0 95 9 9 12 39 39IBM Cell Broadband Enginek E19520 Dual Socket x s SPEsRBDMAcn m 19520 a SPEisBDMACB Elpsa 6 SPEisBDMACB 0520 1 SPERBDMACB GFlops 90 0w 0quot 93 9 9 06 aa quot e9 S o H O 25 39 Power Efficiency t 3 20 IAMD x2 3 IClovertown E 3 15 DNiagaraZ E 5 ZIPS3 3 395 lCell Blade a gt gt m U 3 L u 01 OI Architecture Cell39s SpMV times are dramatically faster All architectures showed good results motivating the need for these optimizations Lecture 9 March 11 CISC879 Software Support for Multicore Architectures Lecture 9 March 11 Lecturer John Cavazos Spring 2008 Scribe Bn39ce Dobry Cell Programming Tutorial Outline l Cell Basics ll Programming Models lll Programming Details IV Example Code see slides for example code Cell Basics Heterogeneous architecture 9 cores D 1 PPE General purpose processor 9 In the picture you can see that the PPE and takes up a large amount of space on the die 9 Basically just a slightly modified PowerPC 6 Good for controlplane or branchy code D 8 SPEs SIMD processors 9 Good for computational code with few branches 9 On the PS3 one is disabled for yield reasons and the other is used by the game OS 9 We noted that it is strange that the game OS can run well on an SPE since you would think that OS code would be very bran chy Program Structure D PPE code 9 Regular linux process main thread 9 Can spawn SPE threads D SPE code 9 Can be embedded in the PPE code 9 Also can be standalone SPUlet SPE details D All instructions are SIMD 92 Leclure 9 March 77 D 128 128bit registers for each SPE D 256 KB Local Store as Basically a software managed cache Accessed with loadstore instructions 16 bytes at atime Contains data and instructions Allows overlaying functions so that ifthere is not enough room to fit two mutually exclusinve functions they can occupy the same space in the LS and be brought in from main memory when needed DMA transfers are used to move data nstructions tofrom main aeaeae El memory High bandwidth 128 bytes per cycle 96 Need to verify whether or not DMA transfers can be used to go directly tofrom one L8 to another threads LS Nondeterministic features of standard general purpose proces sors are removed Out of order execution 6 HW managed cache 6 HW branch prediction Allows for smaller logic size of the SPEs D Register Layout El Pleten39ed Scalar Slot Preferred Slot D It is inefficient to operate on less than 16byte units smaller units should be packed and operated on with one instruction 98 Lecture 9 March 11 D For example 4 32bit values can be added with one instruction vector regs add VCVAVB Re VA Reg VB Reg VC D Another example of the SIMD instructions provided is the shuffle vector rugs shuffle VTVAVBVc RagVCI1I14391839103906I1519I1a1c1c1c13081d1b0o 6 This operation is byteoriented 6 VC is the control register First bit says which register to use 0 VA 1 VB Second portion says which index of that register to use 9 Results are placed into VT D InputOutput methods of the SPE Program arguments and return code 9 DMA transfers putget Mailboxes small messages tofrom other SPEs 6 Signals used for interrupts etc SM Lecture 9 March 77 D 39 I quotr a cmaminn model is oflen used System memory II Programming Models 2 can be parlitioned ED so need to consider how to DMA in and oute icienlly Possible models D Dala parallel 975 Leclure 9 March 77 A large array of data is fed through the SPEs whic do the same calculation on each data segment 5m Mum D Task parallel 9 Pipeline style where each SPE does a computation and then passes its output to the next SPE gym Mewy 9 This model seems to be too inefficient in practice and can usu ally be morphed into one of the other models D Job queue 96 Lecture 9 llhrch 11 4 Code and data are packaged together and processed by the System memory Job queue SPEs Good for when the dataor tasks are not evenly distributed Builtin load balancing III Programming Details Terminology SPE context holds the information about a logical SPE thread SPE gang context is a group of SPE threads with the same attrib utes SPE events are asynchronous events caused by the SPES stopped mailbox readwritten DMA completed Using an SPE basics 1 Create an SPE context 2 Load executable object into the SPE context s local store 3 Run SPE context Transfers control to OS requesting scheduling of the context to a physical SPE in the system 3 Destroy SPE context Note that the PPE thread that calls specontextrun will block until the SPE has completed The common practice is to spawn a pthread on the PPE for each SPE thread that will be spawned Communication mechanisms reviewed DMAtransfers LS tofrom main memory maybe tofrom other LS Mailboxes Lecture 9 March 77 32bit messages 2 for sending 1 for receiving Signals mfcputlsaddr ea size tagtid rid emory from my L8 to the main memory lsaddr is the address in my local store ea is the address in main memory and size is the size tag is used for calls to determine when the transfer has com pleted mfcgetlsaddr ea size tagtid rid Copy memory from the main memory to my LS lsaddr is the address in my local store ea is the address in main memory and size is the size tag is used for calls to determine when the transfer has com ed plet Doublebuffering can be used to help hide the DMA latency While doing operation n put the results of operation n1 to main memory and get the input for operation n1 CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 5 March 27 Presenter Kishen Maloor Ying Yu Sc be Yuanyuan Ding Two papers are presented 0 Matthew J Bridges Neil Vachharajani Yun Zhang Thomas Jablin David I August 77Revisiting the Sequential Programming Model for the Multicore Era IEEE Micro January 2008 0 K Asanovic R Bodik B Catanzaro J Gebis P Husbands K Keutzer D Patterson W Plishker J Shalf S Williams K Yelik 77The Landscape of Parallel Computing Research A View from Berkeley Technical report EECS Department University of California at Berkeley UCBEECS 2006183 December 2006 51 Revisiting the Sequential Programming Model for the Multi core Era 511 Motivation 0 Move to multithreaded programming is costly Parallel programming models costly to adopt 0 Need for automatic parallelization Large number of existing singlethreaded applications 0 Past attempts have been Insuf cient to keep many cores busy 512 Key Idea The paper argues that the current status of lack of parallelism is not an intrinsic limitation of the sequential programming model but rather occurs for the following two reasons 0 There exist no framework for automatic theread extraction that brings together key existing stateof theart compiler and hardware techniques 0 Existing sequential programming languages force programmers to de ne a single legal program out come rather than allowing for a range of legal outcomes The main contribution of this paper handles the two aspects above 0 Build a framework for automatic thread extraction 0 Tweak the sequential programming model a small bit As a result all of the C benchmarks in the SPEC CINTZOOO suit were parallelizable by automatic thread extraction and yielded a speedup of 454 on these applications under the constraint of limits of modern compilers Student Presentation 5 March 27 de ne CUTOFF 100000 I I I gig tart dictionalgyo fj lfmc39ccguW taE dI39Ct39Ona 0 while char read1 EOF While Char readu I EOF profitable compresschar QLCQ pro table compresschar QJQ Q ngRAflEC r0babiliti OOOO1 pro a e If lprO table gig estart dictionaingp gig estart dictionawwwiwgt lse if count CUTOFF Wish dictiona 9 w 39 39 Ex I dlCt estart dIctIona gig use of a ybranch count 0 tount nish dictionaryQigg Figure 51 Left side shows manual choice of parallelism right side shows the usage and semantics of the Y branch 513 A Framework for Automatic Parallelization Issues with previous techniques for parallelization o Thread Level Speculation TLS attempting to extract DOALL parallelism and executing loop iter atioins in parallel must rely on the synchronization of some dependencies rather than speculated to avoid mis speculation o Decoupled Software Pipelining DSWP partitioning each loop iteration into a series of stages must support speculation and leverage TLS like memory subsystems to privatize memory 0 Both of TLS and DSWP require judicious use of speculation to break infrequent or easily predictable dependencies inhibiting parallelization 5131 Compilation Scope 0 Signi cant parallelism exists in the outermost loop 0 Need to leverage parallelism at any loop level 0 Non trivial problem 0 Use of whole program optimization 5132 Extend sequential programming model 1 Y Branch 2 Commutative Student Presentation 5 March 27 5 3 static int seed Commutative int Yacmrandom int temp seed 127773L seed 16807L seed temp 127773L temp 2836L if seed lt 0 seed 2147483647L Return seed Figure 52 Motivating example for Commutative The code contains an internal dependencies recurrence on the seed variable The Commutative annotation informs the compiler that the calls to Yacmjandom can occur in any order 514 Experimentation Approach Parallelizing large fragments of an application and targeting many cores renders traditional simulation tech niques impractical This paper propose a parallelization paradigm that decompose application loops into three phases Construct an application execution plan 0 Decompose loops into 3 phases 0 Phase 1 tasks depend on prior phase 1 tasks 0 Phase 2 tasks depend on corresponding phase 1 task 0 Phase 3 tasks depend on corresponding 0 phase 2 task and prior phase 3 tasks Note here phases are statically selected 77regions77 in code 77Tasks77 are dynamic instances of these regions task dependence used to model speculation and instrument code to use hardware performance counters The experiment Code are compiled using GCC 03 and the simulator reports total parallel execution time 515 Special Notes The framework does not do everything automatically It simply provides some notation that are very easy to use for the programmer to help the compiler to do his her desired optimization 52 The Landscape of Parallel Computing Research A View from Berkeley 521 Motivation 0 The comparison of the outdated conventional wisdoms and their new replacements 5 4 Student Presentation 5 March 27 while condition 4 A A A line readt A 9 gt C 8 result worktline C print f result H c b Slauc Smgo Dzmndences mu Exmnple cde 5 39 2 CO39O 3 5610 I Icu Puenual Task Emcuucn Figure 53 b illustrates the static phase dependency graph for a code example in a c illustrates the execution that follows a parallel paradigm 10000 1000 52 IyIar Pnrvarmancnws VAXI mam 5 mm 11978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 39 VAX 25Iyear 1978 to 1986 RISC x86 52Iyear 1986 to 2002 RISC x86 Iyear 2002 to present Figure 54 Uniprocessor Performance SPECint Moore s Law continues thousands of processors can be put on a single economical chip Communication between these processors within a chip can have very low latency and high bandwidth The open source software movement means that the software stack can evolve much more quickly than in the past 522 Old amp New Conventional Wisdom In old conventional wisdom Power is free but the transistors are expensive Multiplies slow but loads and stores fast We can reveal more ILP via compilers and architecture innovation 2X CPU Performance every 18 months While in new conventional wisdom Power Wall Power is expensive but transistors are free Memory wall Loads and stores are slow but multiplies fast ILP wall Diminishing returns on nding more ILP Power Wall Memory Wall ILP Wall Brick Wall Student Presentation 5 March 27 55 523 Dwarfs o Here7 77dwarf7 means some special cases that can be easily parallelized in general7 ie7 the most suitable parallelizable unit 0 The rst seven dwarfs comes as Dense Linear Algebra Sparse Linear Algebra Spectral Methods NBody Methods Structured Grids Unstructured Grids Monte Carlo 0 Later on7 another 6 dwarfs are added Combinational Logic Graph Traversal Dynamic Program ming Backtrack and Branchand Bound Construct Graphical Models Finite State Machine 524 Hardware 0 Small is beautiful An energyef cient way to achieve performance An economical element that is easy to shut down in the face of catastrophic defects and easier to recon gure in the face of large parametric variation A simple architecture is easier to design and functionally verity more power ef cient and easier to predict 0 Memory Given the current slow increase in memory capacity7 the MIPS explosion suggests a much larger fraction of total system silicon in the future should be dedicated to memory 0 Switch 1 For Interconnection networks On chip topologies to prevent the complexity of the inter connect from dominating cost of manycore systems To augment the packet switches using simple circuit switches to recon gure the wiring topology between the switches to meet the application com munication requirements Using less complex circuit switches 2 For communication primitives Synchronization using transactional memory Synchronization using FullEmpty Bits in Memory Synchronization using Message Passing 53 Programming Model 0 How to evaluate Programming Model Programming model must allow programmer to balance competing goals of productivity and implementation ef ciency Programming model efforts inspired by psychological research Models must be independent of the number of processors Models must support a rich set of data sizes and types Models must support proven styles of parallelism o Humancentric Programming Model Success of models strongly affected by human beings who use them Efficiency7 productivity7 correctness important7 but no impact of research from psychology Transactional Memory helps with human errors Integrating research on human psychology and problem solving critical to developing successful parallel programming model and environments 56 Student Presentation 5 March 27 531 Systems Software 0 Instead of completely re engineering compilers for parallelism rely more on Autotunersthat search to yield ef cient parallel code 0 Instead of relying on the conventional large monolithic operating systems rely more on virtual ma chines and system libraries to include only those functions needed by the applications 532 Metrics of Success The most intuitive metric is how easy to Write programs that execute ef ciently on manycore computing systems 0 Maximizing programmer productivity 0 Maximizing application performance and energy ef ciency Challenges are 0 conventional serial performance issues 0 minimizing remote access 0 balancing load 0 granularity of data movement and synchronization 51 Lecture 5 Feburary 26 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 5 February 26 Lecturer John C avazos Scribe John Tully Part 2 at IBM Cell Lecture 1 Compiler Tools A Sony GNUbased compiler for ppu code use ppu gcc or ppu g for spu code use spu gcc or spu g gdb debugger supports both types of code B XLCC compiler only available for next 2 months for ppu code ppqucppuxlc for spu code spqucspuxlc includes optimization levels 00 7 05 recommended that you experiment with them C IBM singlesource compiler removes need for separating PPUSPU code 11 Performance Tools A Opro le GNU systemlevel PPUonly pro ler B Gprof GNU pro ler recommended to nd important functions inside code C spuitiming IBM static analysis tool annotates assembly with schedulinginstruction issue estimates D cellBE system simulator allows you to run and analyze code on X86 III Make les tips for compiling Make les used READMEbuildenVtxt details about les structures variables read this makefooter build rules makeheader required de nitions makeenV compilerstools to be used by make Recommended procedure don t change makeheader and makefooter change variables in makeenv via command line Important Makefile variables Lecture 5 Feburary 25 IBRARYiembed LIBRARYiembedM creates linked library from SPU program to be embedded into PPU program CC70PT7LEVEL CFLAGSigcc CFLAGSixlc Opt level additional gccxlc ags TARGETilNSTALLiDIR location to where built targets get installed DIRS list of subdirectories to build rst PROGRAM Jpn PROGRAMS Jpn PROGRAM qu64 PROGRAMS qu64 PROGRAMispu PROGRAMsispu PPU 32 or 64bit and SPU programs to build IV Sample Project PPU SPU les in different direcmres sample ma Wig IRS m m include ManamaWWW sampleW Wee Makgf g W 2 W WWQ INCLUDE I include CELLTOPme sampleml W W W sample IMPORTS Wsa INCLUDE I include ManamaWWW 53 Lecture 5 Feburary 26 Lecture 5 Pmemx or Parallel PMW39 l Parallelizing a Program Steps involved General Steps 1 Study the codeproblem 2 Look for parallelism 3 Keep all cores busy Formal Definition of Steps 1 Decomporiiion Identify concurrency and break up computation into tasks Requires an understanding the algorithm may even need to invent a new one Tasks can be dynamic and may vary with time need enough to keep processors busy 2 Amignmem Divide work balance computation and reduce communication Structured approaches ie using design panems work well Granulariiy The ratio of computation to communication F n u gtllt n r k Finegrain easier to load balance but coarse grain provides more opportunity for performance increase Ef ciency depends on hardware and algorithm 3 Orchemaiion Structure the communication synchronization of processors preserve locality of data 4 Mapping Schedule tasks to processors to satisfy dependencies maximize utilization and minimize communication over ead Pamuonrng r d C33 2 2 c s p 0 00 i p 7 7 J 39 lt3 9 m i Pu L Pr p O n quot in 7mg 7 94 s e 1 CO p 3 i 00 o o D C O sequenw Tasks Processes HEW Processors computation 54 Lecture 5 Feburary 26 11 Design Patterns for Parallel Programs Design Pattern A reusable solution to a commonly occurring problem ie a description for how to solve a problem that can be used in many situations Examples Architecture City planning room layouts 7 first manifested by C Alexander OOP Software Design 7 book written by Gamma et al with common solutions Parallelization 7 book written by Mattison et al Four design spaces 1 Finding concurrency exposing concurrent tasks data 2 Algorithm structure mapping tasks to processes 3 Support structures codedata structuring patterns 4 Implementation Mechanisms lowlevel mechanisms for programming This Lecture F indim Concurrencv 1 Decomposition Three things to consider exibility e iciency simplicity Typically you can get two at the cost of another ie to be highly exible and efficient it typically means simplicity is sacri ced A Data Domain Decomposition break data up into independent units Good when main computation involves manipulating a large data structure Good when similar operations are applied to different parts of a data structure Example finding largest element of an array Each CPU determines largest element of segment of the array Then the largest of each CPU s largest value is determined 55 Lecture 5 Feburary 26 B Task Functional decomposition break up problem into parallel tasks Often a natural method think of functions distinct loop iterations Example GUI Eventhandler diVide functions among cores CPU 1 CPU 0 Bl Pipeline decomposition a special type of task decomposition where multiple data sets are processed in parallel by multiple CPUs referred to as assembly line parallelism see diagram below Data set 0 Data set 1 Data set 2 Data set 3 Data set 4 56 Lecture 5 Feburary 26 2 Dependency Analysis Determine control and data dependencies ExamEle Domain DecomEosition for i O i lt 3 i Domain decomposition ai bi 20 possible Slide Source Intel Software College Intel Corp Another Exam 1e Domain Decom osition not ossible for i 1 i lt 4 i ai ai1 bi No domain decomposition Slide Source Intel Software College Intel Corn 57 Lecture 5 Feburary 26 decomposition with 3 CPUs 3 Design Evaluation 7 Considerations Suitability to target platform shouldn t depend on underlying architecture Design quality simplicity exibility ef ciency 7 pick two Preparation for next phase to help algorithm structure which is the next phase Next Algorithm Structure CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6 April 8 Presenter Pujan Ka e Deephan Mohan Scribe Kanik Sem The following two papers were presented A Synchronous Mode MPI Implementation 0n the Cell BE Architecture Presented by Pujan Ka e MPI Design In implementing MPI it is tempting to view the main memory as the shared memory of an SMP and hide the local store from the application This will alleviate the challenges of programming the Cell However there are several challenges to overcome Some of these require new features to the compiler and the Linux implementation on the Cell which have recently become available MPI Communication Modes MPI provides different options for the communication mode chosen for the basic blocking point to point operations MPIiSend and MPIiRecv Implementations can use either the buffered mode or the synchronous mode A safe application should not make any assumption on the choice made by the implementation In the buffered mode the message to be sent is copied into a buffer and then the call can return Thus the send operation is local and can complete before the matching receive operation have been posted Implementations often use this for small messages For large messages they avoid the extra buffer copy overhead by using synchronous mode Here the send can complete only after the matching receive has been posted MPI Initialization A user can run an MP1 application provided it uses only features that we have currently implemented by compiling the application for the SPE and executing the following command on the PPE mpirun in ltNgt executable arguments where ltNgt is the number of SPEs on which the code is to be run The mpirun process spawns the desired number of threads on the SPE Note that only one thread can be spawned on an SPE and so ltNgt cannot exceed eight on a single processor or sixteen for a blade Communication Architecture Associated with each message is metadata that contains the following information about the message Address of the memory location that contains the data4 sender s rank tag message size datatype ID MPI communicator ID and an error field For each pair of SPE threads we allocate space for two metadata entries one in each of the SPE local stores for a total of NNl entries with Nl entries in each SPE local store entry Bij is used to store metadata for a message from process i to process j ij Send Protocol The send operation from Pi to Pj proceeds as follows The send operation rst puts the metadata entry into buffer Bij through a DMA operation The send operation then waits for a signal from Pj notifying that Pj has copied the message The signal obtained from Pj contains the error value It is set to MPLSUCCESS on successful completion of the receive operation and the appropriate error value on failure In the synchronous mode an SPE is waiting for acknowledgment for exactly one send at any given point in time and so all the bits of the receive signal register can be used Receive Protocol The receive operation has four avors i It can receive a message with a speci c tag from a speci c source ii It can receive a message with any tag MPliANYiTAG from a speci c source iii It can receive a message with a speci c tag from any source MPIiANYisOURCE or iv It can receive a message with any tag from any source In the rst case the metadata entry in Bij is continuously polled until the ag eld is set The tag value in the metadata entry is checked If the application truly did not assume any particular communication mode then this tag should match and the check is super uous The receive call then transfers data from the source SPE s application buffer to its own buffer and signals Pi39s signal register to indicate that the data has been copied The second case is handled in a manner similar to the rst except that any tag matches The third and fourth cases are similar to the rst two respectively as far as tags are concerned However messages from any source can match So the receive operation checks the metadata entry ags for each sender repeatedly in a round robin fashion to avoid starvation even though the MPI standard does not guarantee against starvation Performance Evaluation The latency results for point to point communication on the pingpong test in the presence and absence of congestion The congested test involved dividing the SPEs into pairs and having each pair exchanging messages The throughput results for the same tests as above for large messages where the overhead of exchanging metadata and signals can be expected to be relatively insigni cant The maximum throughput observed is around 6 GBs Conclusion The authors have described an ef cient implementation of synchronous mode MPI communication on the Cell processor It is a part of an MP1 implementation that demonstrates that an ef cient MPI implementation is possible on the Cell processor using just the SPEs even though they are not fullfeatured cores Limitations of PlayStation 3 for High Performance Cluster Computing Presented by Deephan Mohan Cell Evaluation Cell addresses the following two main problems Memory Wall The Cell Broadband Engine s SPEs use two mechanisms to deal with long mainmemory latencies a A 3level memory structure main storage local stores in each SPE and large register les in each SPE and b Asynchronous DMA transfers between main storage and local stores These features allow programmers to schedule simultaneous data and code transfers to cover long latencies effectively Frequency Wall By specializing the PPE and the SPEs for control and compute intensive tasks respectively the Cell Broadband Engine Architecture on which the Cell Broadband Engine is based allows both the PPE and the SPEs to be designed for high frequency without excessive overhead The PPE achieves ef ciency primarily by executing two threads simultaneously rather than by optimizing singlethread performance Cell Application Features Vectorize the SPEs are vector units This means that in a code that is not vectorized every scalar operation must be promoted to a vector operation which results in a considerable performance loss Keep data aligned In order to achieve the best transfer rates data accesses must be aligned both on the main memory and the SPEs local memories Alignment will provide a better exploitation of the memory banks and a better performance of the DMA engine Implement double buffering In order to hide the cost of the latencies and memory transfers DMA transfers can be overlapped with SPE local computations If these local computations are more expensive than a single data transfer the communication phase can be completely hidden This technique is known as double buffering Improve data reuse to reduce the number of memory transfers it is important to arrange the instructions in order to maximize the reuse of data once it has been brought into the SPEs local memories Explicitly unroll due to the high number of registers on the SPEs and to the simplicity of SPEs architecture no register renaming no speculative execution no dynamic branch prediction etc explicit unrolling provides considerable improvements in performance Reduce branches in the code SPEs can only do static branch prediction Since these prediction schemes are rather inef cient on programs that have a complex execution ow reducing the number of branches in the code usually provides performance improvements A PlayStati0n3 cluster hardwaresoftware details Hardware Cell BE processor the Cell processor on each node has only six SPEs accessible to the user out of eight Dualchannel Rambus Extreme Data Rate For all practical purposes the memory can provide the bandwidth of 256 GBs to the SPEs through the EIB provided that accesses are distributed evenly across all the 16 banks Built in GigaBit Ethernet network card The network card has a dedicated DMA unit which permits data transfer without the PPE s intervention Software The Linux operating system runs on the PS3 on top of a virtualization layer also called hypervisor that Sony refers to as Game OS This means that all the hardware is accessible only through the hypervisor calls The hardware signals the kernel through virtualized interrupts The interrupts are used to implement callbacks for nonblocking system calls Parallel Matrix Matrix Product Matrixmatrix product is one of the most important linear algebra kernels since it represents the most computationally intensive operation on many applications The Cannon the PUMMA and the SUMMA algorithms are still extensively adopted in many high performance linear algebra applications run on parallel architectures Algorithm SUMMA fori l to nnb do if I own Ai then Copy Ai in bufl Bcast bufl to my proc row end if if I own Bi then Copy bi in buf2 Bcast buf2 to my proc column end if 10 C C bufl buf2 11 end for OPPE 9H39HFPRPE Performance Evaluation Double Buffering Performance can be improved by communicationscomputations overlap Computations of oaded to SPEs Communications taken care by PPE Data for step kl broadcasted at step k Performance Result Cost of Local Computations is small SurfacetoVolume effect props up for large problem sizes Memory Limitations inhibit best performance Conclusion Limitations Main memory access rate fp execution faster than peak transfer rate Network Interconnect speed Capacity of interconnect out of balance Main memory size Serious limitation Double precision performance Lower performance than sp Programming paradigm Write low level code Porting Financial Market Applications to the Cell Broadband Engine Architecture John Easton Ingo Meents Olaf Stephen Horst Zisgen Sei Kato Presented By Kanik Sem Dept of Computer amp Information Sciences University of Dela ware 8799 Software Support for Multicore Architectures J Why Cell BE for financial markets Porting strategies for the Cell BE platform Performance results Mixedprecision workloads Tying it all together Conclusions CISC 879 Software Support for Multicore Architectures Potential for dramatic impact on financial appHca ons Application codes ported to the Cell Optimized codes to fully exploit Cell Performance improvements of almost 40X CISC 879 Software Support for Multicore Architectures Code used to price a European Option Model based on Monte Carlo simulation technique Need to generate a large number 200000000 in this case of uniform pseudorandom numbers Using the random numbers generated execute the nancial model CISC 879 Software Support for Multicore Architectures Recompilation of existing code for Cell XLC better than gcc Make some structural changes Framework to start separate threads on each SPU Splitting RNG across all cores Make functional changes to the code Reengineered functions to exploit vectorization on SPU cores CISC 879 Software Support for Multicore Architectures J time Seconds Calls Function name 6270 1 1832 200000000 getRandom 3718 7016 1 simulateEuropeanOptionVaIue 014 027 1 hpcMonteCarIorandom 000 000 2 hpcBlackScholes SDK for Cell provides optimized RNG bllarciI generate 64 number generators at once on Cell a e Use gettimeofday function CISC 879 Software Support for Multicore Architectures To run the performance tests the following parameters were used Compiler used spuxlc ppuxlc Compiler optimization setting 03 qstrict Randomnumber generation method sdk Precision single Number of evaluations 2000lmOO CISC 879 Software Support for Multicore Architectures Performance by number of SPUs single precision Number of SPUs OOOKICDO IACONA Elapsed time seconds24 GHz CellBE processor measured 657 329 219 164 1318 109 94 82 73 66 6 55 51 47 44 41 Elapsed time seconds32 GHz CellBE processor estimated 4927 246 1642 123 988 817 705 615 54 495 45 412 38 352 33 307 Speedup 199 498 602 698 801 995 1095 1194 1288 1397 1493 1602 CISC 879 Software Support for Multicore Architectures 240 coll 32GHz Coll slump MMquot 0 SP5 v 39 Or anizations in nancial markets require doubleprecision ca culations Initial target marketplace for Cell does not need this Initial implementation of Cell provides limited doubleprecision support in hardware Singleprecision 9 Fully pipelined Doubleprecision 9 Partially pipelined CISC 879 Software Support for Multicore Architectures Performance by number ofSPUs double precision Number of SPUs Elapsed time seconds24 GHz CellBE processor measured 1573 786 524 393 3149 2625 225 197 175 1578 143 131 121 113 105 99 CDNCDO lPwM t 4444444 mmeNAO Elapsed time seconds32 GHz CellBE processor estimated 1179 589 393 2947 2361 1968 168 147 1312 118 107 982 91 847 787 742 Speedup wN b 499 699 798 898 996 11 12 13 1392 1498 1589 CIS C 879 Software Support for Multicore Architectures Run time with MersenneTwiste without optimization 5 sec Run time with the CellBE SDK 41 sec Mechanisms to improve the performance still further 90ptimize MersenneTwister code for threading framework 9Rewrite the code to utilize the SIMD capabilities of SPUs Performance comparison between CellBE SDKand Mersenne Twister random number generators Precision Runtime Runtime Runtime seconds SDK seconds seconds RNG 24Ghz Mersenne Mersenne TwisterRNG 24 Twister RNG 32 GHz GHz estimated Single 41 102 076 Double 99 247 185 CISC 879 Software Support for Multicore Architectures lxl MixedPrecision I Only those parts that actually need doubleprecision are calculated usmg doublepreCISIon Disadvantage Makes for a slight increase in the programming effort needed Identify parts of code which use this sort of precision Make the appropriate changes to the code Advantage Performance improvement CISC 879 Software Support for Multicore Architectures on kloads f ixed reci 1 2 The two methods of applying mixedprecision to our code are Concatenating two singleprecision random variables Generate one singleprecision random variable and then domg a doublepreCISIon diVISion CISC 87 9 Software Support for Multicore Architectures SPU CCDPMT OOOU1IgtWNp A 51o Ni O 13 14 15 16 4033 2033 1356 1017 813 678 582 509 453 408 370 340 314 292 272 252 4033 2033 1356 1017 813 678 582 509 452 408 370 339 314 292 272 253 CCDPSDK MDPMT 4576 2288 1526 1144 916 764 655 575 511 460 418 384 354 329 307 288 CCDPMT Concatenation DoublePrecision Mersenne Twister CCDPSDK Concatenation DoublePrecision SDK MDPMT Division DoublePrecision Mersenne Twister SPMT SinglePrecision Mersenne Twister SPSDK SinglePrecision SDK CISC 87119 Software Support for Multicore Architectures SPMT 1201 606 405 304 243 203 175 153 136 122 111 102 094 088 082 077 SPSDK 1116 570 380 285 229 191 164 144 128 115 105 096 089 083 078 073 Mixed recision worklcgds a omen Ill l 8888 7 1 WN quot W30 39 IL SPJT 12315678910111213141518 4898 g I Jr M 374 WV 1 1 w L r A rquot wi 39 anquot r 2 J39 1r wr39 J m t w x x u Additional optimization techniques Unrolling more parts of MersenneTwister RN G Additional software pipelining by parallelizing computation Introducing new variables to eliminate dependencies Precalculating some items a0ltsomethinggt for i0iltNi sinf4a0 sinf4ai1 I CISC 879 Software Support for Multicore Architectures A master thread forks slave threads to perform RNG master thread 9part of the CellBE code that runs on PPU slave threads 9 parts that run on the SPUs Difference Wor scheduled by the OpenMP runtime shares same cores as the OSt reads The SPUs n the ell E ve sion a e n trunnin th 0 er in sysltem Thc1s enab es t em to he used entclrely to r n t e gppallca ion co e CISC 879 Software Support for Multicore Architectures SystemCPU Speed GHz X355030 X336 28 H821 233 Operating System Compiler N0 of Threads Cores 1 2 4 Red Hat Linux Intel ICPC 31 76 1 59 846 Red Hat Linux Intel ICPC 4327 3002 2262 Fedora Core 6 gcc 4338 2174 1088 8729 Software Support for Multicore Architectures 826 Drlgdna SP Cell rst D 5P Cell latest DP DpenMF39 Software Suppart for Multicme Architectures Re ults achieve 0 ar are cna s ste that man View a Eemg unsultaife fgr F1nanc1a1 KIar ets users y E h D bl P fthe Cell BrgaddEn lne aa 1 mm 0 stems based on Ce111BE techncllcgyare an excellent S p atfcrm for F1nanc1a Markets app 1cat10ns CISC 879 Software Support for Multicore Architectures j Offload as much of the computation onto the SPUs as possmle Write the SIMD code yourself rather than relying on the compiler to do it 9XLC provides autoSlMDize 9This may not be a good approximation In certain situations you might find that starting from scratch is a much quicker way to Implement application code CISC 879 Software Support for Multicore Architectures lt39t it trmance out of Reasons for generalpurposeprocessors make up the majority of the computational Infrastructures 1 Huge numbers of systems based on these processors 2 Large su ply of professionals skilled this leads to lower Skl ls costs 3 A lot of application development tooling 4 The relatively easy code porting to these platforms CISC 879 Software Support for Multicore Architectures ESOTERIC technologies 9 Offer high performance for their chip area 9 Consume much less power per computation Disadvantages 1 Skills to program them are rare and hence expensive 2 Lack of application development tooling 3 The porting process is generally both slow and costly CISC 879 Software Support for Multicore Architectures Advantages of CellBE technology 1 2 3 4 5 6 Consumes less power space and cooling High computational power Better data movement and manipulation abilities A number of strong customer proof points Support from key Independent Software Vendors Results of experiments such as this one CISC 879 Software Support for Multicore Architectures Questions Comments Caveats Software Suppart for Multiccre Architectures Lecture 6 281h February CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 6 February 28 2008 Lecturer John Cavazos Scribe K arthik Ravindra The following class gave us an overview of the decomposition models Functional and Pipeline decomposition Dependency analysis and an Overview of Design patterns for parallel programs The corresponding presentations are a part of the 5Lh and the 63911 lecture Task Functional Decomposition 0 Programs decompose into tasks functions and loop iterations o It is always better to work with too many chunks of data rather than too few 0 Example Merging 1000 functions into 100 is better than splitting 10 functions into 100 o In the real world the software outlines the hardware 0 Example The Data handler for a GUI Consider the following example Suppose the tasks for a particular process are divided into chunks as f g h q r s and the dependency is as shown in the picture the execution would be as follows f executes on CPUO g and r execute on CPUl and CPUO respectively h and q execute on CPUl and CPU2 respectively s executes on CPUO
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'