Popular in Course
Popular in Computer Information Technology
This 58 page Class Notes was uploaded by Darrion Bednar on Saturday September 19, 2015. The Class Notes belongs to CISC879 at University of Delaware taught by Staff in Fall. Since its upload, it has received 32 views. For similar materials see /class/207175/cisc879-university-of-delaware in Computer Information Technology at University of Delaware.
Reviews for ADVANCEDPARALLELPROGRAMMING
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/19/15
Software Pipelining on MultiCore Architectures Alban Douillet Guang R Gao Tom St John Dept of Electrical and Computer Engineering University of Delaware 8779 Siofhware aparter Multic qre Architectures Introduction SingleDimension Software Pipelining Multi Threaded Software Pipelining Experiments CISC 879 Software Suppoirtfor Multicoire Architectures Software pipelining is among most successful optimizations Can it be applied to multicore chips What extensions are required CISC 879 Software Su ppoirttor Multicoire Architectures Does not simply pipeline innermost loop Pipelines most profitable loop level Loop levels enclosing selected loop left to global scheduler Selected loop seen as outermost loop Inner loops executed sequentially Able to take advantage of lLPdata locality properties present in other loops CISC 879 Software Support for Multicore Architectures FOR lFOS no om FOR use no DP Up on END FOR END FOR 1 Double mp Next T L 12 b Kamel FOR 1103 no on FOR 110 no Up Up END FOR END FOR a Double Lamp Nest smae V 2 I a mix u 2 2 a b Kernel CIS Q 879 a 9 r m mam nhhe kamel my mama onnnamm kemnl c Ideal Schedule 5110quot03 mm Penman no 017 mm 2 00 Up on END FOR END ran a Double Lon Nest smae A 0 Ex u u u u 12 2 n b Kernel m mum unne kernel my mum uruummw kernel c Ideal Schedl e mums ammsav Wm j W pm on k r mm 1 mp mm um up 1 f DLPJ k V gmup nfSn m oumnnosl Hamlinquot w x m lt x L cpilog lt 1 Final Schedule CC 879 39 Software Support fbr Multicorg Afchite ure39s Several Obstacles Exist Dependencesresource constraints must be respected Operation cannot be scheduled before all dependences are satisfied Memory dependences may exist between thread units Synchronization is required CISC 879 Software Support for Multicore Architectures quotquoti39h eadd Schedule each group of Sn iterations on a thread unit using roundrobin approach Workload balance is fair Sn is max number of iterations that can be executed in parallel without resource conflict Thread units may share same instruction cache CISC 879 Software Support for Multicore Architectures mm wail uumlctmn Inlunnnxl 39 InapPanzm w 1 Epilag a4 x m m x E m u d e h c S I a m F CISQ 8795 Data dependencies may exist between outermost iterations Synchronization points are chosen to minimize code duplication during code generation WAIT is placed before each repeating pattern SIGNAL is placed after each pattern Synchronization delay guarantees the correctness of the schedule CISC 879 Software Support for Multicore Architectures a No Slall me dependence is my specled um m nnmmmmd unhede it No Sun the dependence is re b Stall without Symmm la on Der speded lay me dependence is no expected Inlqum Mman meme no a NO Stall the dependence is Ex I Stall Vilhom Synchronila on Der C Stall with Synchronization Delay pected lay the dependence is um mpecned me dependence xs respected 39GISG 879 Software Supportzfo r MultiebrerArcHitectures Each thread has two counters Synchronization counter counts number of synchronization Signals received Clock counter incremented after each WAIT When thread reaches a WAIT execution continues only if synchronization counter greater or equal to clock counter WAIT implemented with an active loop SIGNAL is a nonblocking atomic addinmemory instruction CISC 879 Software Support for Multicore Architectures Allows for coarsergrain synchronization Execution of Nn 1 instances of the innermost loop pattern is tiled Into tiles of G iterations WAIT and SIGNAL are issued at the entrance and exit of each tile Gmin value of G that minimizes final schedule length can be approximated at compile time up a NR in a t Egg 1mg 1 l T 3n l 1 CISC 879 Software Support for Multicore Architectures t 1min A CrossIteration Assume thread units do not share registers Insert copy operations to copy value from one thread unit to next Register dependence transformed into memory dependence lssue memory spill instruction to copy from register to scratchpad memory of destination thread Value restored using local memory load CISC 879 Software Support for Multicore Architectures CrossIteration Memory spi instructions only need to be issued by the last iteration of an iteration group Memory restore instructions only need to be issued by the first iteration If distance of dependence is greater than 1 cascading copies and memory spillsrestores will bring value to target iteration CISC 879 Software Support for Multicore Architectures k m mm Alwwn amen W W n cmum I Wampum g Wm qu 59 WWW e s emu g 51AM Tquot Tquot m CISC 879 Software Support for Multibote Architectures The multicore final schedule represented by the schedule function is correct The multithreaded final schedule is deadlockfree The synchronization signal guarantees that the memory accesses preceding it on the same thread unit have been committed CISC 879 Software Support for Multicore Architectures The MTS method has been implemented on the Open64 compiler retargeted for the IBM Cyclops64 architecture Loop nests from the Livermore Suite SPECZOOO and NAS were used CISC 879 Software Support for Multicore Architectures Execmtoll Tum Abso me Spemnp mam rm Absohue Spw np us As 32 a t 1 6 5 Number of mm Umu Nonmltzed Pmblem Em name names it Execu an Time Absolute Speedup b Scalnbility With Ptoblem Size MTS schedules showed very good scalability with relative speedup between 575 and 81 for 99 threads use 379 s ottware Support or Multioore Architectures mama Tm Sprung a Imp mm mm is is Loop Tiling Factor Timin results using tiling factor Gm match results using est empirical tiling factor use 379 s oi iware Support i or Muii ioore Architectures 1 Optimizing the Fast Fourier Transform on a Multicore Architecture Long Chen Ziang Hu Junmin Lin Guang R Gao IEEE International Parallel and Distributed Processing Symposium 2007 Presentation by Yuanyuan Ding Dept of Computer amp Information Sciences University I Dela ware ClSjC 83979 Software Support for Multicore Architectures J FFT Introduction IBM Cyclops64 C64 Architecture 1D FFT Optimization StepbyStep 2D FFT Optimization Conclusion ClSjC i879 Software Support for Multicore Architectures Radix2 CooleyTukey algorithm divide and conquer approach 1 I u mgr Recursiver defined l Xlc F103 w Fgm 0 g k l 1 Xk F1k w F2k 03k 1 l wlzwlz ka twiddle factors Fi the N2point DFTs of fin Recursive overhead are not favored iterative Implementation are used C39ISC 879 Software Support for Multicore Architectures Bit reversal permutation First stage Second stage Third stage 10 X0 11 9 gtlt X391 IQ 13 2 wt gtlt mg W 13 A 3 l 1 W 114 g X1 15 393 g X 5 1 1 391 16 mg g X6 1 17 quot3 g 83 Xm Bitreversal permutatioh befcle butterfly computations CISTC 5879 Software Support for Multicore Architectures Consisting thousands of64 chips connected by 3D mesh network with every C64 chip 8O 64bit processors each processor 1 floating point unit FPU 2 thread units TUs 64 64bit registers and 32 KB SRAM 16 shared instruction caches le 4 offchip DRAM controllers Crossbar network with 9696 ports 4GBs bandwidth per port 384GBs in total Memory scratchpad SP memory onchip global interleaved memory GM and offchip DRAM GigaBit Ethernet controllerrnd other lO devices Etc CISC 879 Software Support for Multicore Architectures ClSjC 879 Software Support for Multicore Architectures i Processot 1 80 Base Parallel Implementation Optimal Work Unit Special Handling of the First Stages Unnecessary Memory Operations Loop Unrolling Register Renaming and Instruction Scheduling Memory Hierarchy Aware Compilation CIS C 879 Software Support for Multicore Architectures Work Unit smallest ur of concurrency Intuitive work unit considers a butterfly operation Read 2 point data and the twiddle factor from GM Perform a butterfly operation upon them Write the 2 point results back to GM l a u My 1 a tuxb Work units are assigned in a roundrobin way 654 Gflops are achievid in this implementation CISC 879 Software Support for Multicore Architectures 1 Butterfly Operation 1 H a w b 4 Butterfly Operation wt M 1 wN Bil reversal permulalilm First slage Sccund stage mini stage WE 951 1 r 0 12 8 13 3 his 1 1 14 393 1 15 3 gtlt V gtltgtlt 1 V1 1 w Log 1 17 3 gtlt 45 w 1 1 1 6186 879 Software Support for Multicore Architectures X0 X1 X392 X3 X4 X5 X6 XT Finegrained work units imply large synchronization overhead Number of floating point operations cannot be reduced defined by the FFT algorithm itself Using bigger point work units the number of load and store operations are efficiently reduced the number of stages number of barriers are reduced I CIS C 879 Software Support for Multicore Architectures Number of cycles per b erfly operation VS the the size of work unit 8 point is the best an O N O I O O I 0 0 l amp O I m D l 2point 4point 8point 16point 32point 64point Register spilling for largelWU Need 112 for 16point 3879 Software Support for Multicore Architectures N 0 Average number of cycles per bufferfly operation 8 O l Theoretically a work unit of Npoint data can get rid of logN1 barriers 5mg N Percentage of FP operations IS 6N1g2 137 For C64 architecture 8point work unit is the best choice without serious register spilling Reach a performance 1317 Gflops CISiC 879 Software Support for Multicore Architectures Optimizations GFLOPS Speedup Over Incremental Base Version Speedup Base 654 100 0 IOptimal WU 1317 202 1015 Special App 1692 259 284 Eli MEM Ops 1797 275 62 Loop Unroll 1823 279 14 Reg amp Inst 2072 317 137 0130 879 Software Support for Multicore Architectures In the first logM stages for Mpoint work units all points in the same work unit are consecutive The ith stage of a complete FFT computation 2 1 distinct twiddle factors are needed Thus apply 16point work unit for the first 4 stages reaching 1694Gflops Half twiddle factors used in a later stage are the same as those twiddle factors in the previous stage Thus reduce the computation for the indices of twiddle factors and memory opelations CISC 879 Software Support for Multicore Architectures Optimizations GFLOPS Speedup Over Incremental Base Version Speedup Base 654 100 0 Optimal WU 1317 202 1015 Special App 1692 259 284 kEli MEM Ops 1797 275 62 Loop Unroll 1823 279 14 Reg amp Inst 2072 317 137 0180 879 Software Support for Multicore Architectures Focus on bitreversal permutation part 57 of total execution time C64 ISA bit gather instruction used to do fast indices computation Unroll kernel loop 4 times 0 hide the memory latency 25 improvement for permutation part 14 improvement on the overall performance Further apply manual renaming and rescheduling achieve 137 improvement 2072 Gflops CISC 879 Software Support for Multicore Architectures Optimizations GFLOPS Speedup Over Incremental Base Version Speedup Base 654 100 0 Optimal WU 1317 202 1015 Special App 1692 259 284 Eli MEM Ops 1797 275 62 Loop Unroll 1823 279 14 k Reg amp Inst 2072 317 1377 CISTC 879 Software Support for Multicore Architectures Entire process is tedious and errorprone Smart compiler identify the segments where variables reside apply corresponding latencies when scheduling the instructions 1984Gflops using tailored compiler on loop unrolled code CISiC 879 Software Support for Multicore Architectures Perform 1D FFT alternativew on each dimension of the data interleaved with data transpose steps One rowcolumn FFT as a work unit Every rowcolumn are independent to each other work units are distributed to threads in the roundrobin way 1511Gflops achieved Processor 1 2 3 4 5 Some threads remain ide 19 180 rows 160 threads CleC 879 Software Support for Multicore Architectures Processor 1 2 3 4 5 Base parallel implementation straightforward but not necessarily efficient Not fine enough grained using smaller work unit instead Small task 8point work unit 8 inputltgt 8 output it needs more barriers to synchronize threads working on the same rowcolumn FT CleC 879 Software Support for Multicore Architectures Exploit the nature of 2D FFT exact the same operations and twiddle factors are applied on each rowcolumn FFT This character favours data reuse which can reduce indices computation and memory operations Majorreversal work distribution scheme to exploit this opportunity 1937Gflops achieved CIS C 879 Software Support for Multicore Architectures 1D FFT 2M6 points and 2D FFT 256256 25 20 G ops E n 0 0 20 40 60 80 100 120 Number of threads CleiC S879 Software Support for Multicore Architectures Conclusion Consider both the architecture features and application characteristics A set of optimization techniques are proposed Essentiality reduce memory operation Challenges to multicore system software smart compiler Achieve 20Gflops on both 1D and 2D FFT which is about 4 times of Intel Xeon Pentium processor about 5Gflops Future work Fast scratchpad memory on thread unit may be used as larger register file Larger point work unit may be exploited Larger FFT problem sizerhen data cannot be fully stored CIS C 879 Software Support for Multicore Architectures 1 Questions Thanks for 30m time 3879 Software Support for Multicore Architectures POSH A TLS Compiler that Exploits Program Structure Dimitrij Krepis Dept of Computer amp Information Sciences University of Delaware 879 Software Support forMuIticore Architectures J Introduction to Thread Level Speculation POSH TLS Compiler Implementation Experimental Platform Results CISC 879 Software Support for Multicore Architectures Break up computation into threads Assign threads to processors Thread2 Thread1 Program Thread3 Sourcewwyvcscmueduafscscmueduacademiccl 15740f01pubic docdISCUSSIonsmuItIprocessorstlsppt CISC 879 Software Support for Multicore Architectures J No data Dependency Nith Data Dependency Original Original i1 i1 i2 j2 a3 a3 bc bj M Thread i Thread i1 i1 i1 a3 j2 j2 bc Thread i1 a3 b Source wwwcscmueduafscscmueduacademicc 15740f01pubi docdISCUSSIonsmuItIprocessorstlsppt 39 39 39 CISC 879 Software Support for Multicore Architectures j I Compiler can not fully exploit potential parallelism while continuecondition hashindex1 hashindex2y Thread i hash3 hash10 Thread i1 Source wwwcscmueduafscscmueduacademiccl docdiscussionsmultiprocessorstlsppt hash19 hash21 115740f01public CISC 879 Software Support for Multicore Architectures What is ThreadLevel Speculation Enables the compiler to create parallel threads despite the existence of ambiguous data dependence Run Time Compile Time Commit Parallelize without N0 Modification Detect detection of Violation dependency Yes Squash And Reexecution Sourcewwyvcscmueduafscscmueduacademiccl 15740f01public docdiscu55ionsmultiprocessorstlsppt CISC 879 Software Support for Multicore Architectures eve Pecu a quot load Task 1 l time Task 1 Spawn mTask 2 Task 1 LD A Task 2 spawn quotgt S xx x Task 2 O g squashl I x 3 l N C LD A 1 I I I I Task 2 L initial execution re execution a Sequential Execution b Parallelism c Prefetching 879 Software Support forMuIticore Architectures J hrLevel Speculation Speculative Parallel Execution Dynamically Detect Dependency in RunTime Time Processor 0 Processor 1 Processor 2 Thread 1 Thread 2 Thread 3 haSh3 hash19 4 hash10 Hash 10 quot 1 Hash21 Hash21 Source wwwcscmueduafscscmueduacademicc 15740f01pubic docdISCUSSIonsmuItIprocessorstlsppt CISC 879 Software Support for Multicore Architectures I eLevel Speculation Source wwwcscmueduafscscmueduacac docdISCUSSIonsmuItIprocessorstlsppt I Squash and Reexecute if conflict detected Time Processor 0 Processor 1 Processor 2 Thread 1 M hash3 h h19 M u I as I 4ha8h10 H h 10 7 as 1 Hash21 Hash21 I Redo M M Thread 3 hash30 h h5o Hash40 Hash60 39pubiC Hash21 CISC 879 Software Support for Multicore Architectures