### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# 605 Note for ECE 49500 with Professor Eigenmann at Purdue

### View Full Document

## 23

## 0

## Popular in Course

## Popular in Department

This 49 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Purdue University taught by a professor in Fall. Since its upload, it has received 23 views.

## Similar to Course at Purdue

## Reviews for 605 Note for ECE 49500 with Professor Eigenmann at Purdue

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15

Section 7 Loop Parallelization Techniques DataDependence Analysis DependenceRemoving Techniques Parallelizing Transformations Performanceenchancing Techniques Loop Parallelization ECE 4958 Fall 2008 Some motivating examples Doi1 n ls itlegal to ai bi S1 Run the i loop in parallel ci ai1 S2 Put 82 first in the loop End do Is it legal to DO I 1 n Fuse the two i loops ai bi End do In general it is desirable to determine if two references access the same memory DO I 1 n location and the order they execute so that we can determine if the references 00 304 might execute in a different order after End do some transformation Dependence an example Do i 1 n 30 W 31 Indicates dependences ie CO 8104 2 the statement at the head of the arc is somehow End do dependent on the statement at the tail i1 i2 i3 i4 i5 i6 blt1gt blt2gt blt3gt blt4gt blt5gt b6 a1 a2 a3 a4 a5 a6 aoa1a2gt a3a4ia5gt cm C2gt co C4gt C5gt C65 Loop Parallelization ECE 4958 Fall 2008 Can this loop be run in parallel Assume 1 iteration per processor then if for Do i 1 n some reason some iterations execute ai bi S1 out of lockstep bad things can happen Ci ai1 S2 In this case read of a2 in i3 will get an End do invalid value i 3 b3 i1 i2 a3 i4 i5 i6 W W a2b4 b5 b6 a1 a2 03 a4 a5 a6 aoa1 a3a4 a5gt 01 02 04 05 06 V time Loop Parallelization ECE 4958 Fall 2008 Can we change the order of the statements Doi1n Doi1n ai bi S1 ci ai1 S2 ci ai1 S2 ai bi S1 End do End do No problem with a serial execution Access order before statement reordering bm a1aoc1 b2a2a1 C2 H W as a2 C3 b4 a4 as C4 i1 i2 i3 i4 Access order after statement reordering aoc1 b1a1 N am e 2 b2 a2 a2 C3 b3 a3 II a3 C4 b4 a4 i i2 i3 i4 5 Loop Parallelization ECE 4958 Fall 2008 Types of dependence a2 Flow or true dependence data for a read comes from C a2 a previous write writeread hazard in hardware terms a2 Antidependence write to a location cannot occur C before a previous read is finished a2 a2 Output dependence write a location must wait for C a previous write to finish a2 Dependences always go from earlier in a program execution to later in the execu on Anti and output dependences can be eliminated by using more storage6 Loop Parallelization ECE 4958 Fall 2008 Eliminating antidependence a2 Antidependence write to a location cannot occur C before a previous read is finished a2 Let the Program in bei The new program is a2 Create additional storage a2 a2 to eliminate the anti dependence quot a a2 2 a2 aa2 No more antidependence Loop Parallelization ECE 4958 Fall 2008 Getting rid of output dependences a2 Output dependence write a location must wait for C a previous write to finish a2 Let the program be The new program is Again by creating new 32 storage we can a2 32 eliminate the output 32 a2 dependence aa2 a2 aa2 Loop Parallelization ECE 4958 Fall 2008 Eliminating dependences In theory can always get rid of anti and output dependences Only flow dependences are inherent ie must exist thus the name true dependence In practice it can be complicated to figure out how to create the new storage Storage is not free cost of creating new variables may be greater than the benefit of eliminating the dependence Loop Parallelization ECE 4958 Fall 2008 An example of when it is messy to Doi1n a3i1 a2i End do Loop Parallelization create new storage A3i writes locations 2 5 8 11 14 17 20 23 A2i writes locations 2 4 6 8 10 12 14 16 18 20 22 Ai reads from outside the of loop when i 1 3 7 9 13 15 19 21 Ai reads from a3i1 when l 5 11 17 23 Ai reads from a2i when l 2 4 6 8 1012 14 1618 20 22 1O ECE 4958 Fall 2008 Can we fuse the loop Do i 1 n Do i 1 n In original execution of ai W 31 ai 30 31 the unfused loops End do Ci ai1 2 1 Ai1 gets value 0 i End do assigned in ai C0 304 32 2 Can t overwrite End do value assigned to aI or c 3 Bi value comes from outside the loop 1 Is ok after fusing because get ai1 from the value assigned in the previous iteration 2 No output dependence on ai or ci not overwritten 3 No input flow or true dependence on a bi so value comes from outside of the loop nest 11 Loop Parallelization ECE 4958 Fall 2008 Data Dependence Tests Other Motivating Examples Statement Reordering Loop Parallelization can these two statements be Can the iterations of this swapped loop be run concurrently DO i11002 DO i11002 B2i B2i B3i B2i B3i ENDDO ENDDO An array data dependence exists between two data references iff both references access the same storage location at least one of them is a write access Loop Parallelization ECE 4958 Fall 2008 Dependence sources and sinks The Sink of a for i1 i lt n i dependence is the K statement at the ai1 head of the dependence arrow am The source is the am am statement at the tail a1 of the dependence am am arrow a4 a3 13 Loop Parallelization ECE 4958 Fall 2008 Data Dependence Tests Concepts Terms for data dependences between statements of loop iterations Distance vector indicates how many iterations apart are source and sink of dependence Direction vector is basically the sign of the distance There are different notations ltgt or 101 meaning dependence from earlier to later within the same from later to earlier iteration Loopcarried or crossiteration dependence and nonloopcarried or loopindependent dependence indicates whether or not a dependence exists within one iteration or across iterations For detecting parallel loops only crossiteration dependences matter equal dependences are relevant for optimizations such as statement reordering and loop distribution Iteration space graphs the unabstracted form of a dependence graph with one node per statement instance 14 Loop Parallelization ECE 4958 Fall 2008 Data Dependence Tests Iteration space graphs Iteration space graphs the unabstracted form of a dependence graph with one node per statement instance d0 i1 1 n d0 i2 1 n 3015 alti12i23 I2 4 end do end do 392 3 This is an iteration 9 I 2 Space graph 2 or diagram Loop Parallelization 150E 49582Fall 2008 15 Data Dependence Tests Distance Vectors Distance vector indicates how many iterations apart are the source and sink of dependence i25 o o o o o o do i1 1 n d i21 24 o o o o o end do end do i23 391 439quot3 1 i 2 o o o o o 23 2 i21 o o o o o 16 Loop Parallelization 1 150E 49582Fall 2008 Data Dependence Tests Direction Vectors Direction vector is basically the sign of the distance There are different notations ltgt or 101 meaning dependence from earlier to later within the same from later to earlier iteration i2 5 do i1 1 n do391 24 o o o o o a391392 3l13923923 end do end 0390 i2 3 o o o o o o lzll is l l3l l ll2d2l23l 22 o Direction ltgt or sign2sign3 11 in some works 392 1 17 I1 4 5 6 Loop Parallelization 150E 49582Fall 2008 Data Dependence Tests Loop Carried Loopcarried or crossiteration dependence and nonloopcarried or loopindependent dependence indicates whether or not a dependence exists with n one itgration o5 acrosg iteratiogs l25 doi11n gt gt doi21n a391wi2 392 4 gt gt o o Dependence on o o o o NUUU K a UUU w the i2 loop is 392 2 loop carried Dependence on I 1 2 the I1 loop IS 8 L93 Parallelization 391 150E 49582Fall 2008 5 Data Dependence Tests Loop Carried Loopcarried or crossiteration dependence and nonloopcarried or loopindependent dependence indicates whether or not a dependence exists within one iteration or across iterations do i1 1 n dopar i2 1 n I 2 5 8l1l2 24 end I 2 3 9 o o 9 dopari 21 n 22 o o ai1i 21 21 g g g g g end do 392 39 d do i2 4 G 9 en i2 o o o o o This is legal since loop 9 9 splitting enforces 392 39 G o the loop carried i1 1 2 3 4 5 dependences Loop Parallelization ECE 4958 Fall 2008 A quick aside Lower bound Stride A 399 p do i o n131 1 do I 4 n 3 Can be always be a3 I4 ai normalized to end do end do Upper bound the loop This makes discussing the datadependence problem easier since we only worry about loops from 0 n 1 More precisely do i lower upper stride ai becomes do i 0 upper lower increstride 1 1 ai stride lower 20 Loop Parallelization ECE 4958 Fall 2008 Data Dependence Tests Formulation of the Datadependence problem DO the question to answer a4 I 39 39 39 can 4i ever be equal to 2i1 within iE1n ENb39DO a2 H1 If so what is the relation of the i s when they are equal In general given two subscript functions fI and go and loop bounds lower upper Does fi1 gi2 have an integer solution such that lower 5 i1 i2 5 upper 21 Loop Parallelization ECE 4958 Fall 2008 Diophantine equations An equation whose coefficients and solutions are all integers is a Diophantine equa on Determining if a Diophantine equation has a solution requires a slight detour into elementary number theory Let fi a1i1 c1 and gi b1i1 c2 then gt 39 b1i 1 C2 39 C1 t fits general form of Diophantine equation of a1i1 a2i2 c 22 Loop Parallelization ECE 4958 Fall 2008 Does fi gi have a solution The Diophantine equation a1i1 32i2 c has a solution iffgcda1b1 evenly divides c Examples 15i 6j 9k 12 has a solution gcd3 2i 739 3 has a solution gcd1 9i 3 6k 5 has no solution gcd3 Loop Parallelization ECE 4958 Fall 2008 Finding GCDS a16b6 a16mod6 b4a6 a6mod4 b2a4 a4mod2 a2bO Loop Parallelization ECE 4958 Fall 2008 24 Determining if a Diophantine equation has a solution Let g gcda1az can rewrite the equation as ga 1i1 quot39 ga 2i2 C 9 ga 1i1a 2i2 C Because a 1 and a 2 are relatively prime all integers can be expressed as a linear combination of a 1 and a 2 a 1i1a 2i2 is just such a linear combination and therefore a 1i1 a 2i2 generates all integers i Ed assuming a 1a 2 can range over the integers If remaindercg 0 c is a solution since c gc and ga 1i1 a 2i2 generates all multiples of g If remaindercg 0 c cannot be a solution since all values generated by ga 1i1 a 2i2 are trivially divisible by g and cannot equal any c that is not divisible by g 25 Loop Parallelization ECE 4958 Fall 2008 More information on gcd s and dependence analysis General books on number theory Books by Utpal Banerjee Kluwer Academic Publishers Illinois now Intel who developed the GCD test in late 70 s Mike Wolfe Illinois now Portland Group High Performance Compilers for Parallel Computing Randy Allen s thesis Rice University Work by Eigenman amp Blume Purdue range test Work by Pugh Omega test Maryland Work by Hoeflinger etc Illinois LMAD 26 Loop Parallelization ECE 4958 Fall 2008 Other DD Tests The GCD test is simple but not very useful Most subscript coefficients are 1 gcd1i 1 Other tests Banerjee test accurate stateoftheart test takes direction and loop bounds into account Omega test precise test most accurate for linear subscripts See Bill Pugh publications for details Worst case complexity is bad Range test handles nonlinear and symbolic subscripts Blume and Eigenmann many variants of these tests Compilers tend to perform simple to complex tests in an attempt to disprove dependence 27 Loop Parallelization ECE 4958 Fall 2008 What do dependence tests do Some tests and Banerjee s in some situations affine subscripts rectangular loops are precise Definitively proves existence or lack of a dependence Most of the time tests are conservative Always indicate a dependence if one may exist May indicate a dependence if it does not exist In the case of may dependence runtime test or speculation can prove or disprove the existence of a dependence Short answer tests disprove dependences for some dependences 28 Loop Parallelization ECE 4958 Fall 2008 Banerjee s Inequalities If ai1 bi 1 c has a solution does it have a solution within the loop bounds and for a given direction vector By the mean value theorem c can be a solution to the equation fi c i Ebub ifi doi1100 flbltc Xi fub gt 0 Xi1 assumes fi is monotonically increasing over the end do range bub NOte there IS a lt The idea behind Banerjee s Inequalities is to find dependence the maximum and minimum values the Let s teSt for and dependence equation can take on for a given lt dependence direction vector and see if these bound 0 This is done in the real domain since integer solution requires integer programming in NP 29 Loop Parallelization ECE 4958 Fall 2008 Example of where the direction vector makes a difference do i 1 100 xi xi1 end do Note there is a lt dependence Let s test for and lt dependence Loop Parallelization Dependence equation is ii Ifi i then ii 0 V i i Ifi lt i then ii 75 O and when i i1 the equation has a solution 30 ECE 4958 Fall 2008 Banerjee test If ai1 bi 1 c has a solution does it have a solution within the loop bounds for a given direction vector lt or in this case For our problem does i1 i 1 1 have a solution For i1 i 1 then it does not no dependence For i1 lt i 1 then it does lt dependence 31 Loop Parallelization ECE 4958 Fall 2008 Program Transformations Applying data dependence tests to untransformed loops would determine that most loops are not parallel Reason 1 there are many anti and output dependences DO i1n anti t aibl flow dependence t dependence crossiter ENDDO loopIndeP 0 Solution scalar and array privatization 32 Loop Parallelization ECE 4958 Fall 2008 Scalar ExpansionPrivatization privatization PARALLEL DO i 1n DO i 1n Privatet t ai bi t ai bi Ci t Ci t ENDDO ENDDO expansion Private creates one PARALLEL DO i 1n copy per parallel t1i ai bi loop iteration Ci t1i ENDDO Loop Parallelization ECE 4958 Fall 2008 Analysis and Transformation for Scalar ExpansionPrivatization Loop Analysis Transformation variable 2 find variables that are Privatizationi I used in the loop body but put ton private list Mark as dead on entry ie the lastvalue if necessary variables are written on EXPaHSiOHi all paths through the declare an array t0n with loop before being used nOOPiteFati0nS determine if the variables replace all occurrences oft in are live out of the loop the loop body with t0i where i make sure the variable 395 the loop Yar39ab39e39 is de ned in the last loop liveout variables create the assignment tt0n after the Iteration loop 34 Loop Parallelization ECE 4958 Fall 2008 Parallelization of Reduction Operations PARALLEL DO i 1n DO I 1n ATOMIC sum sum al sum sum ai ENDDO ENDDO PARALLEL DO i 1n Private 3 O ssam POSTAMBLE Lock sum sum 5 Unlock ENDDO Loop Parallelization DIMENSION sprocessors D0 1processors Si 0 ENDDO PARALLEL DO i 1nprocessors smyproc smyproc ai ENDDO D0 1processors sum sum sj ENDDO ECE 4958 Fall 2008 35 Analysis and Transformation for Sum Reduction Operations Loop Analysis Transformation find reduction statements as shown on previous slide 0f the form 3 S eXPr create private or expanded Where eXPr does 0quot use variable and replace all 8 occurrences of reduction discard s as a reduction variable in loop body Variable is used in non update original variable reduction statements with sum over all partial sums using a critical section in a loop postamble or a summation after the loop respectively 36 Loop Parallelization ECE 4958 Fall 2008 Induction Variable Substitution ind indO ind indO DOi1n PARALLEL DOj1n and b0 aind0kj1 W ind indk ENDDO ENDDO Gives k39 k indo This is of the form aj c which is good for dependence analysis j is the loop canonical induction variable 37 Loop Parallelization ECE 4958 Fall 2008 Induction Variable Analysis and Transformation Loop Analyis find induction statements of the s s expr wher form e expr is a loopinvariant term or another induction variable discard variables that are modified in no n induction statements Loop Parallelization ECE 4958 Transformation find the following increments outer loop loop IV use 4 4 V start of outer loop body 3 start of inner loop start of inner loop body end of inner loop body end of inner loop for each use of IV compute the increment inc with respect to the start of the outermost loop in which it an induction sequence in Replace IV by incind0 38 Fall 2008 Induction Variable Analysis and Transformation Transformation find the following increments outer loop Ttart of outer loop body start of inner loop l loop 4 k start of inner loop body IV use 1 end of inner loop body 4 V for each use of IV compute the increment inc with respect to the start of the outermost loop in which it is an induction sequence in Replace IV by incind0 end of inner loop Loop Parallelization ECE 4958 ind indO PARALLEL DOj1n aind0kj1 bj ENDDO Thus in the above indO is obvious inc is kj1 oinc is an induction sequence within the loop DO I The inner loop body is empty 39 Fall 2008 Loop Fusion and Distribution poi 1n 30 W fusion ENDDO gt DO J 1n 8J b0 DO k1n CO 8039 ck ak distribution ENDDO ENDDO necessary form for vectorization less parallel loop startup overhead can provide synchronization can increase af nity better locality of necessary for fonNard dependences reference can create perfectly nested loops 4O Loop Parallelization ECE 4958 Fall 2008 Loop Fusion and Distribution 3j bj fusion ENDDO DOJ 1n 8039 bJ39 DO k1n 0039 81 ck ak distribution ENDDO Dependence analysis needed Determine usesdef and defuse chains across unfused loops Every def gtuse link should have a flow dependence in the fused loop Every use gtdef link should have an antidependence in the fused loop No dependence not associated with a use gtdef or def gtuse should be present in the fused loop 41 Loop Parallelization ECE 4958 Fall 2008 Loop Fusion a nd Distribution D0 1n 81 301 bj fUSllon gt 1 n 81 301 bj Do k1n distribution 82 00 aj s2 Ck ak ENDDO ENDDO o o 9 9 J k 1 2 3 4 J 1 2 3 4 Flow dependence from S1 to 82 Loop Parallelization Antidependence from 82 to 82 42 ECE 4958 Fall 2008 Dependence graphs D0 in 1 a 391 b 39 ESDD J J fusuon gt DOj1 n S1 aj1 bj DO k1n distribution 82 00 aj s2 Ck ak ENDDO ENDDO 8 6altltgt 9 9 Loop Parallelization ECE 4958 Fall 2008 Loop Interchange PDOi1n PDOj1m DOj1m DOi1n aii biJ aii biJ ENDDO ENDDO ENDDO ENDDO loop interchanging alters the data reference order gt significantly affects localityof reference gt data dependences determine the legality of the transformation dependence structure should stay the same loop interchanging may also impact the granularity of the parallel computation inner loop may become parallel instead of outer Loop Parallelization ECE 4958 Fall 2008 44 Loop Blocking aka Stripmining PDOj1nblock Pgat kf Doroum4 ENDDO 30H0b0H0 ENDDO ENDDO block EmEmmm mn 1 n Many variants adjustment if n is not a multiple of block number of blocks number of processors or vector register length gg glig zor blockcyclic spin Loo anon Stripmining a parallel loop is always legal 45 958FaH2008 General Iteration Space Partitioning Given an iteration dependence graph the graph is partitioned such that no data dependences cross partitions locality is maximized subsumes the techniques tiling stripmining interchange skewing wavefront Loop Parallelization ECE 4958 Fall 2008 Criteria for Parallelization Vectorization no backward dependence If we allow statement reordering no dependence cycles Parallelization no loopcarried dependence Note loops inside a dependencecarrying loop are dependence free wrt a given reference pair Loop Parallelization ECE 4958 Fall 2008 47 Parallel ExecutionScheme Most widely used Microtasking scheme Main Helper task tasks lt Main task creates helpers lt Wake up helpers Pare e o p lt Barrier helpers go back to sleep lt Wake up helpers Pare e o p Barrier helpers go back to sleep 7 V V V V 48 Loop Parallelization ECE 4958 Fall 2008 Program Translation for Microtasking Scheme Loop Parallelization Subroutine x call scheduler1naboopsub END Subroutine loopsubbubab integer bub DO jjbub aJJbJJ ENDDO END ECE 4958 Fall 2008 49

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "When you're taking detailed notes and trying to help everyone else out in the class, it really helps you learn and understand the material...plus I made $280 on my first study guide!"

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.