### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Introduction to Parallel Processing CS 442

UNM

GPA 3.76

### View Full Document

## 8

## 0

## Popular in Course

## Popular in ComputerScienence

This 43 page Class Notes was uploaded by Trent Dare on Wednesday September 23, 2015. The Class Notes belongs to CS 442 at University of New Mexico taught by Staff in Fall. Since its upload, it has received 8 views. For similar materials see /class/212211/cs-442-university-of-new-mexico in ComputerScienence at University of New Mexico.

## Popular in ComputerScienence

## Reviews for Introduction to Parallel Processing

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/23/15

Analytical Modeling Of Parallel Programs CS 442EECE 432 Brian T Smith UNM CS Dept Spring 2003 4102003 models Introduction Problem How can we model the behavior of a parallel program in order to predict its execution time using the size of the problem the number of nodesprocessors the communication network parameters ts and tw 0 Clearly we must consider the total parallel system 7 That is the algorithm and the architecture on which it is running Issues 0 A serial program measures total execution time and typically depends mainly the size of its input 0 Aparallel program has several new issues 7 The execution time of the program 7 The speedup relative to the algorithm running serially gtgt However is it speedup compared with the best serial algorithm or is it speed relative to the serialization of the parallel algorithm used 4102003 models 2 4102003 Outline Of This Topic Sources of overhead in a parallel program Performance metrics for parallel systems The effect of granularity on performance Scalability of parallel systems lVlinimum execution time and minimum cost optimal execution time Asymptotic analysis of parallel programs Other scalability metrics models 3 Typical Parallel Execution Profile 4102003 P0 P1 P2 P3 P4 P5 P6 P7 I EssentialExcess Computation Interprocessor Communication El Idling models 4 Sources Of Overhead A pro le rllustrates the kinds of act1v1t1es m a program CXCCUUOII 0 Some of these are overheads time spent not computing directly what is needed such as 7 Interprocess interaction 7 Idling 7 Excess computation gtgt These are all activities the serial program does not perform gtgt Thus the time wasted in these activities compared to the serial execution and the way this wasted effort grows with the size of the problem and the number of processors used provides a measure of the effectiveness of the parallelism 0 An efficient parallel program attempts to minimize these overheads to zero but of course this is not always possible 4102003 models Interprocess Interaction and Idling Interprocess interaction Usually the most signi cant overhead Sometimes reduced by performing redundant computation Idling Caused by 7 load imbalance 7 synchronization waiting for collaborating processes to reach the synchronization point 7 serial computation that cannot be avoided 4102003 models Excess Computation The fastest known serial algorithm may not be easy to parallelize espe01ally for a large number of processes 0 A different serial algorithm may be necessary which is not optimal when implemented serially 7 Such programs may perform more work 7 A good example is the fast Fourier transform gtgt It may be faster for all processes to compute common intermediate results such as twiddle factors than to compute them once and broadcast them to all processes that need them 7 What we are thus faced with is how to measure the performance of a parallel program to tell whether the parallel program is worth using 7 Aspects are gtgt How does performance compare with a serial implementation gtgt How does performance scale with adding more processes gtgt How does performance scale with increasing the problem size 4102003 models 7 Performance Metrics 7 Execution time 0 For a serial program 7 wallclock time from start time to completion time 0 For a parallel program 7 wallclock time from the time the parallel processing starts to the end time for the last process to complete 7 Total parallel overhead 0 The difference between the sum ofthe times the processes are reserved to the total computational effort of the parallel program 7 Speedup The ratio ofthe parallel execution time with the execution ofthe best serial program 7 Efficiency 0 The ratio of the speedup to the number ofprocessors used 7 Parallel Cost 0 The parallel execution time times the number of processes used 4102003 models Notation Parallel and serial runtime 0 TP parallel runtime 0 TS serial runtime run on the same processor as the parallel program Total parallel overhead function T a To PT P Ts Speedup S 0 How well is the parallel program performing Answer Speedup S TS TP 0 Expect the speedup S to be near p the number of processors used 7 Equal top is called linear speedup 7 Most often less than p 7 In practice 80 ofp is generally very good 0 For S larger thanp it is called superlinear speedup 4102003 models 9 An Example Summing n Numbers in Parallel With n Processors Using input decomposition place each number on a different processor 0 The sum is performed in log n phases 0 For the purpose of illustration assume n is apower of2 and assume the processors are arranged in a linear array numbered from 0 0 Phase 1 gtgt Odd numbered processors send their input item addend to the processor one position to the le and that even processor adds the two numbers addends 0 Phase 2 gtgt Every second processor sends its intermediate sum to the processor two positions left and this processor adds the 39al sum it has to the received partial Continuing for log n phases process 0 ends up with the sum 4102003 models 10 A Picture Of The Parallel Sum Process o o b be 653 5 inquot L J l J 5 inquot 5 llllll quot K llllll xquot K llllll xquot K llllll xquot a Initial data distribution and the rst communication phase b Second communication step E MOQQM SOQO l pp c Third communication P ooooooo ooooooo d Fourth mmunication step E3 Q e Accumulation of the sum at processor 0 a er the nal communication 4102003 models Example Continued Let s model the execution times 0 The serial time TS is n to where to is the time to add two numbers that is Tstc k1 n The parallel time TP consism of computation time plus communication time that is TP tclogn 5 1W log n log n 7 The speedup is S TSTP to ntClogn 5 tWlog n toto S twn log n n log n 7 The overhead p n is To pTF7 TS m a log n rcaog n71 to n log n 7 The overhead is large Why 7 Thus this parallel algorithm does considerably more total work than the serial algorithm 4102003 models Misleading Speedups Faster than one deserves 0 Consider a parallel implementation of bubble sort called the oddeven sort of105 elements on 4 processors that completes in 40 seconds 0 The serial version of bubble sort takes 150 seconds 0 But serial quicksort on the same set of elements takes 30 seconds 7 The misleading speedup is TSTP 15040 375 7 The correct speedup is TS TP 3040 075 gtgt The required comparison is With quicksort the fastest serial algorithm and the particular parallel sort algorithm But the parallel code can run faster than expected 7 For example 0 Because of caches effects 0 Because of exploratory decomposition 4102003 models 13 Misleading Speedups Continued 0 Cache effects Consider a problem of size Wwordslarge but small enough to t in the memory of a single processor 7 Suppose on a single processor an 80 cache hit ratio is observed for this problem gtgt Suppose cache latency to CPU is 2 ns gtgt Suppose the DRAM latency is 100 ns gtgt The access time per data item on average is then 082 02100 216 ns gtgt Assume the program performs at 1 oating operation per memory access gtgt Then the performance is 1000216 Mflops or 463 MFLOPS 4102003 models 14 Cache Effects Continued Consider solving the same problem on two processors by data decomposition of all data so that two sub problems of size W2 are solved 7 The amount of data per processor is less so that we might expect the cache hit ratio to be higher say 90 7 Ofthe remaining 10 ofthe accesses assume 8 are from each processor39s memory and 2 are from the other processor39s memory 7 Suppose the latencies are 2 ns 100 ns and 400 ns respectively 400 ns for access to the DRAM of the other processors memory 7 The access time per data item on average is then 092 008100 002400 178 ns per processor 7 Assume as before the program performs 1 oating operation per memory access 7 Then the performance is 1000178 M ops or 562 MFLOPS per processor for a total rate of 1 124 MFLOPS 2 processors used 0 Speedup is then 1126463 or 243 faster than we deserve 4102003 models 15 Faster Solutions Via Exploratory Decomposition Suppose the search tree looks like the following graph with the solution at the rightmost node Process element 0 begins searching here 7 Serial search is depth first left first 7 Parallel search is 2 processors Both depth rst 1e first algorithms 7 The serial algorithm finds the solution in 14 steps 7 The parallel algorithm takes 5 steps 7 The speedup is 145 28 gt 2 superlinear speedup resulm 4102003 models 16 Efficiency 5 Efficiency E is defined as E 0 Where S is the speedup andp is the number of processors 7 It in essence measures the average utilization of all processors 0 Ideal speedup isl Superlinear speedup is greater than 1 0 Very good speedups in practice are 080 or larger 7 Example the parallel summation process of 71 numbers with 71 processors BEE nlogn q p n logn 7 Notice the speedup decreases With the size of the problem 7 That is there is no point in using a large number ofprocessors for computations With speedups like this 4102003 models 17 A More Satisfactory Example Edge detection of an n x n pixel image 0 Applies a 3 X 3 multiplicitive template with summation convolution to each pixel The serial computation time TS is 9 I n2 where I is the average time for a multiplyadd arithmetic operation The parallel algorithm usingp processors divides the image into p column slices with np columns per slice 0 The computation for the pixels on each processor is local except for the left and right edges which require the pixel values from the edge column of neighboring processes 7 Thus the parallel time is TP 9 If n2 p 2t5 n I 0 The speedup and efficiency are S TS 9tcn2 7 Eincreases With increasing n Tp 9tcn2 p 2t5 mw 7 E decreases With increasing p E i 1 p 12pfsmw 2 4102003 models 9 5quot 18 Convolution Of A Pixel Image Pixel Image Two different P0 P1 P2 P3 convolution templates Ima 6 ft t for detecting edges g par 1 wrung ammlgs processes and data sharing for process 1 4102003 models Parallel Cost And CostOptimality De ned as the number of processors p times the parallel computing time Tp The term cost optimal refers to a parallel system that has the following property 0 The parallel system has a cost that has the same asymptotic growth as the fastest serial algorithm to solve the same problem that is has an efficiency of l 0 Example Adding numbers in parallel is n log n 7 Adding the numbers serially is n 7 This parallel addition algorithm is NOT costoptimal 0 Questions 7 Can this algorithm or any algorithm be made cost optimal by increasing the granularity That is decrease p and increase the work of each proc gtgt Sometimes but in general no See the examples on the next few slides 7 Is there a costoptimal parallel algorithm for summing n numbers gtgt Yes But the costoptimal algorithm is different from the algorithms so far but uses the idea ofincreased granularity 4102003 models 20 The Importance Of The CostOptimal Metric Consider a sorting algorithm to sort 7 elements using 71 processors that is not cost optimal What does this mean in practice Scalability performance is very poor Let s see why 0 Suppose this algorithm takes log n2 time to sort the list 0 The best serial time is known to be n log n gtgt The parallel algorithm has a speedup and ef ciency of nlog n and 110g n respectively gtgt The parallel algorithm is an improvement but not cost optimal because speedup is not 1 0 Now consider increasing the granularity by decreasing the number of processors from n to p lt n gtgt The new parallel algorithm will take no more than nlog nZp time gtgt Its speedup is nlog nnlog nZp plog n and its ef ciency is the same gtgt For example with 32 processors and n 1024 andn 106 the speedups are 32 and 16 respectively very poor scalability 4102003 models 21 Two Different Parallel Summation Algorithms The second sum algorithm uses the idea of increasing the granularity of the previous parallel algorithm also called scaling down see the next few slides for n l6 and p 4 7 Instead of using n processors use p lt n and place more ofthe addends namely np on each processor 7 Add the corresponding numbers on each processor as with the first algorithm communication until all partial sums are on one processor 7 Add the partial sums in pairs using the approach of the first parallel approach but on one processor The third sum algorithm is a modi cation of the above 7 Add up all the numbers on each processor first using the usual serial algorithm and then apply the first parallel algorithm top addends on p processors 4102003 models 22 Second Algorithm 1116 p4 First Step 12 13 14 15 12 13 14 15 10 11 8 9 10 11 8 9 4 5 6 7 4 5 6 7 0 1 2 3 we Substep 2 distribution and Initial distribution and next communication step rst communication step 12 13 14 15 12 13 14 15 21213 21415 8 9 10 11 289 211011 289 211011 245 267 245 267 245 267 1 3 1 3 2 20 22 20 22 Substep 4 distribution and Data distribution after last substep Substep 3 distiibution and next communication step next communication step 4102003 models Second Algorithm 1116 p4 Second Step 21213 21415 21213 21415 289 21011 289 21011 245 267 245 267 2 1 2 3 2 3 U Substep 2 distribution before second Initial distribution before rst ubstep and communication substep and communication 13 15 13 15 15 212 214 212 214 212 9 11 11 11 28 210 28 2 7 7 7 24 24 Z 243 Z3 Z3 w 9 w 9 w 9 V V V Final distribution after last substep Substep 3 distribution before third Substep 4 distribution before fourth 4102003 models Second Algorithm 1116 p4 3rd amp 4th Step 2815 7 7 20 20 Final 439 39L 39 A l L r 39 39 substep 15 20 Final result Data distribution before first substep and grouping of operations 4102003 models 25 Third Algorithm 1116 p4 A Cost Optimal Algorithm gon w 1 5 9 9 Initial data distribution and Data distribution a er first step Data distribution after second step m 5099 Final result after the last step 4102003 models 26 The Effect Of Granularity lncreasmg the granularlty of a co st optlmal algorithm mamtams cost optlmality 0 Suppose we have a costoptimal algorithm usingp processors and we increase its granularity by reducing the number of processors to q and increase the work per processor 7 The work per processor would increase by a factor pq 7 The communication per processor should also grow by no more than a factor pq provided the mapping is done carefully 7 Thus the parallel time would increase by a factor pq 7 Then parallel cost of the new algorithm would be qTFJew which is 4P4Tp old priold gtgt Thus granularity has not changed the cost of a costoptimal algorithm 0 Thus to produce costoptimal parallel codes from noncost optimal parallel code you may have to do more than increase granularity but it 7 The two new sum algorithms illustrate this point 7 The above proof does NOT show that increasing granularity preserves cost 4102003 models 27 Analysis Of The Sum Algorithms Serial sum algorithm costs n The rst parallel algorithm costs n log n The second parallel algorithm is np steps with log p substeps taking nplog p Then we add np numbers taking np Total time is nplogp Cost is p nplogp n logp This is asymptotically higher than the serial algorithm and still is not cost optimal The third parallel algorithm is The rst step is np additions The second step consists of log p substeps consisting of an addition an o 39 ation Thus the time is np log p and the cost is n p logp As long asp is not too large like n Qp logp the cost is n which means this algorithm is cost optimal 4102003 models 28 Scalability Of Parallel Systems We typically develop parallel programs from small test cases 0 It is very difficult to predict scalability performance for large problems from small test cases unless you have done the analysis first 0 We now study some tools to help in the prediction process 7 See the FFT case study based on observation of performance in the small and its relation to performance for large sized problems Topics 0 Scaling characteristics of parallel programs 0 Isoefficiency metric of scalability Pr blem size 7 Isoef ciency function 0 Cost optimality and the isoefficiency function 0 A lower bound on the isoefficiency function 0 Degree ofconcurrency and the isoefficiency function 4102003 models FFT Case Study Three algorithms for performing FFTs 0 Algorithms described in detail in Chapter 13 gtgt Binary exchange gtgt 2D transpose gtgt 3D transpose 7 Speedup data given for 64 processors for the FFT size n varying from 1 to 18K elements 7 For small n up to 7000 or so 3D transpose and binary exchange are best a lot of testing to see this 7 For large n 2D transpose outperforms the others and continues faster for n gt 14000 Can you believe this remains true though for even larger n gtgt Not unless you have done the analysis to support the conjectured asymptotic behavior 4102003 models 30 Scaling Characteristics Of Parallel Programs The ef ciency is E 317 T SpT P Using the expression involving T O overhead slide 9 1 E 7 0 Unfortunately the overhead is at least linear in p unless the algorithm is completely parallel gtgt Say the parallel algorithm has a serial time of T Sew gtgt Then all but one of the processors is idle during the time one 39 39 computation processor is performing the sen gtgt Thus the overhead is at least Q71Tmm 0 Therefore the efficiency is bounded above by H p 7 mm 3 4102003 models Scaling Characteristics Continued From this expression for E previous slide 7 The ef ciencyE decreases With the number of processors for a given problem size 7 The ef ciencyE increases With larger problems T S increases Consider the cost optimal summation algorithm 0 For this algorithm assume unit time for addition and communication 7 np is the time for adding np items 7 210g p is the time for the addition and communication of phase 2 7 See the disappointing results for large p on the next slide 7 See how an efficiency level of say 80 can be maintained by increasing n for each p 4102003 models TP 5 1 2plog p 1 210g p I7 1 210g p 7 VI 1 VI 32 Speedup Curves S gt Speedup Of CostOptimal Addition Algorithm 100 n64 n192 10 quotquot n512 Linear 1 u u u u 1 2 4 8 16 32 p gt Plots ofS nnp 2 log p for the costoptimal addition parallel algorithm for changing p and n models 4102003 Efficiency Tables A Table Of Values Of E For Different p and n The function of p representing work n that keeps the efficiency fixed with increasing p is the isoefficiency function the bold entries are 80 efficiency 4102003 models 34 Scalability Overhead varies with the serial time the amount of work and the number of processors Clearly overhead communication typically increases with the number of processors It often increases with the amount of work to be done usually indicated by the sequential time T S 7 However as the problem size increases overhead usually increases sublinearly as a percentage of the wor 7 This means that the efficiency increases with the problem size even when the number of processors is fixed For example look at the columns of the last table 7 Also an efficiency level can be maintained by increasing both the number of processors p and the amount ofwor 7 A parallel system that is able to maintain a specific efficiency in this manner is called a scalable parallel system The scalability is a measure of a system39s capacity to increase speedup in proportion to the number of processing elements 4102003 models 35 Scalability And Cost Optimality Recall Cost optimal algorithms have an efficiency of 1 Scalable parallel systems can be always made costoptimal Costoptimal algorithms are scalable 0 Example the costoptimal algorithm for adding n numbers 7 Its ef ciency is E 112plogpn gtgt SettingE equal to a constant sayK means thatn andp must vary as n 2K17Kp logp gtgt Thus for anyp the size of n can be selected to maintain ef ciency K For example forK 80 then if 32 processors are used then the problem size of 1480 must be used Recall this ef ciency formula was based on adding 2 numbers and communicating 1 number took unit time not a very realistic assumption with current hardware 4102003 models 36 Isoefficiency Metric Of Scalability Two observations Ef ciency always decreases with increase in the number of processors approaching 0 for a large number of processors Ef ciency often increases with the amount of work or size of the problem approaching 1 gtgt It sometimes can then decrease for large work sizes run out of memory for example 7 A scalable system is one for which the efficiency can be made constant by increasing both the work and the number of processors The rate at which for a xed ef ciency the work or the problem size must increase with respect to the number of processors to maintain a xed ef ciency is called the degree of scalability of the parallel system This de nition depends upon a clear de nition of problem size Once problem size is de ned then the function determining problem size for varying number of processors and xed ef ciency is called the isoef ciency function 4102003 models 37 Problem Size Define problem size as the number of basic operations such as arithmetic operations data stores and loads etc 0 It needs to be a measure that ifthe problem size is doubled the computation time is doubled 7 Measures such as the size order of a matrix are misleading gtgt Double the size of a matrix causes a computation dominated by matrix matrix multiplication to increase by a factor of 8 gtgt Double the size of a matrix causes a computation dominated by matrixvector multiplication to increase by a factor of gtgt Double the size of vectors causes a computation dominatedby vector dot product to increase by a factor of 2 7 In the formulas that follows they assume the basic arithmetic operation takes 1 unit of time gtgt Thus the problem size Wis the same as the serial time T S of the fastest known serial algorithm 4102003 models 38 The Isoef ciency Function The General Case Parallel execution time T P is a function of 7 the problem size W 7 overhead function T O and 7 the number number p of processors 7 Let s x the ef ciency to E and solve for W in terms of p and the xed ef ciency 7 LetK E17E and we get T WTOWp ES P p p SiW Lp TP WT0Wp L 1 iWTOWp 1TOWpW 4102003 models 39 Development Continued Now solving for ngves ElTOWpWl lTOWpW lE 17E E lE7lTOWpW lE7l W E ToWp KT0Wp l 7 E 0 In the above equation K is a constant 0 For each choice of W we solve forp gtgt This can usually be done algebraically 0 Or for each choice ofp we solve for W possibly a nonlinear equation that has to be solved numerically or approximated somehow The resulting function of W in terms ofp is called the isoefficiency function 4102003 models 40 Analysis Of The Isoefficiency Function Suppose this function has the property that 0 Small changes inp give small changes in W 7 Then the system is highly scalable 0 Small changes inp result in large changes inW 7 Then the system is not very scalable 7 The isoe iciency function may be dif cult to nd gtgt You can solve the above equation only with speci c nite values and cannot solve it in general 7 The isoef ciency function may not exist 7 Such systems are not scalable 4102003 models 41 Two Examples Consider the costoptimal addition algorithm 0 The overhead function is 2p log p 0 Thus the isoefficiency function from slide 40 is 2Kp log p 7 That is if the number of processors is doubled the size of the problem must be increased by a factor of 2 1 log plog p 0 In this example the overhead is only a function ofthe number of processors and not of the work W 0 This is unusual Suppose we had a parallel problem with the followmg overhead function TO Pm P34W34 The equation to solve for Win terms ofp is W Kp3Z Kp34W34 7 For xedK this is apolynomial of degree 4 in WWith multiple roots 7 Although this case can be solved analytically in general such equations cannot be solved analytically 7 See the next slide for an appropriate approximate solution that provides the relevant asymptotic growth 4102003 models 42 Second Example Continued Finding An Approximate Solution The function W is the sum of two positive terms that increase with increasing p 0 Let39s take each term separately find the growth rate consistent with each and take the maximum growth rate as asymptotic growth rate 7 Forjust the rst term W Kym and W W 7 For just the second term W Kym WV4 gtgt Solving for W gives W p3 7 The faster growth rate is p3 0 Thus for this parallel system the work must grow like p3 in order to maintain a constant efficiency 0 Thus the isoefficiency function is p3 4102003 models 43 Your Project 0 Perform the corresponding analysis for your implementation and determine the isoefficiency function if it exists It is the analysis that allows you to test the performance of your parallel code on a few test cases and then allows you to predict the performance in the large 7 The test cases have to be selected so that your results do not re ect initial condition or small case effects 4102003 models 44 CostOptimality And The lsoefficiency Function 0 Consider a costoptimal parallel algorithm 4102003 An algorithm is cost optimal iff the efficiency is l 7 That isE Sp WpTQDpTP W ButIiTp Wf T001412 SO thatToU V II W 7 Thus W QTOWp Thus an algorithm is costoptimal iff its overhead does not exceed its problem size asymptotically In addition if there exists an isoef ciency function p then the relation W Qfp must be satis ed in order to ensure the costoptimality of the parallel system models 45 A Lower Bound On the lsoefficiency Function We desire the smallest isoefficiency function recall degrees of scalability eg slides 37 41 7 How small can the isoefficiency function be 0 The smallest possible function is p 7 Argument gtgt For W work the maximum number of processors is W because the processors in excess of the number Wwi e idle no work to do gtgt For a problem size growing slower than p if the number of processors grows like order p then eventually there are more processors than work gtgt Thus such a system will not be scalable 0 Thus the ideal function is p 4102003 7 This is hard to achieve the cost optimal add algorithm has an isoef ciency function of p log p models 46 The Degree Of Concurrency And The lsoefficiency Function The degree of concurrency CW is the maximum number of tasks that can be executed concurrently for a computation with work W This degree of concurrency thus must limit the isoefficiency function That is gtgt For a problem size Wwith a degree of concurrency CW at most CW processors can be used effectively gtgt Thus the isoefficiency function can be no better than CW 4102003 models 47 An Example 0 Consider solving Ax b a linear system of size 11 Via Gaussian elimination 7 The total computation is n3 time 7 We eliminate one variable at a time in a serial fashion taking nz time 7 Thus at most It2 processing elements can be used 7 Thus the degree of concurrency is WZB gtgt Thus from p VVZ3 W Qp32 which is the isoefficiency function Thus this algorithm cannot reach the ideal or optimal isoefficiency function p 4102003 models 48 Minimum Execution Time and Minimum CostOptimal Execution Time Parallel processing time T P often decreases as the number p of processors increases until either T P approaches a minimum asymptotically or increases 0 The question now is what is that minimum and is it useful to know 7 We can nd this minimum by taking the derivative of the function of parallel time with respect to the number of processors setting this derivative to 0 and solving for the p that satis es this derivative equation gtgt Let p0 be the value of p for which the minimum is attained gtgt Let T Fmquot be the minimum parallel time 0 Let39s do this for the parallel summation system we are working with 4102003 models 49 An Example Consider the costoptimal algorithm for adding n numbers 0 Is parallel time is slide 32 TP np 2 log p 0 The equation from the first derivative set to 0 is 7np2 2p 0 The solution is p0 p n2 The minimum parallel time is Tpm 2 log n 0 The processor product time work isp Tpm n log n 7 This is larger than the serial time which is n 7 Thus for this minimum time the problem is not being solved costoptimally larger than the serial time 4102003 models 50 The CostOptimal Minimum Time Tpcostiopt Let s characterize and find the minimum for the computation performed costoptimally Costoptimality can be related to the isoefficiency function and vice versa see slide 45 lfthe isoefficiency function of a parallel system is 9 then the problem ofsize W can solved costoptimally iff W 9W gtgt That is a costoptimal solution requires p f 1 p 7 The parallel runtime is TP Wp because pTP W 7 Thus a lower bound on the parallel runtime for solving a problem of size W costoptimally is cosfiopf 71 TP 9Wf p 4102003 models The Example Continued Estimating T 11003111quot for the costoptimal addition algorithm 7 After some algebra we get TFWSLUF 2 log n 7 log log n Notice that T pmquot and T 110051701 are the same asymptotically that is both log n 7 This is typical for most systems 7 It is not true in general and we can have the situation that Tpcosfiopf gt Tpmin 4102003 models An Example Of TPCOSLOM gt Tpmin 0 Consider the hypothetical system with slide 42 To p32 p34 W34 7 Parallel runtime is TP W TOp Wp p12 W34pl4 7 Taking the derivative to nd Tpmiquot gives p0 W 7 Substituting back in to give TPWquot gives TPWquot Wlz 7 According to slide 43 the isoef ciency function W p3 p Thus P f391P WW 0 Substituting into the equation for TFWWF on slide 51 gives Tpcas tiapt Wzs Thug Tpcostiopt gt Tpmin This does happen often 4102003 models 53 Limitation By Degree Of Concurrency C W Beware 7 The study of asymptotic behavior is valuable and interesting but increasing p asymptotically is unrealistic 0 For example p0 larger than CW is meaningless 7 For such cases Tpm is Tlgnin W T0 C W Needless to say for problems where W grows unendlessly C W may also grow unendlessly so that considering large p is reasonable 4102003 models 54 Asymptotic Analysis Of Parallel Programs Table For 4 Parallel Sort Programs of 11 Numbers A1 A2 A3 A4 2 n 1 n Vn Nn n n n Recall the serial best time is n log 11 Question Which is the best 4102003 models 55 Comments On The Table Comparison by speed T P 0 Al is best followed by A3 A4 and A2 But Al is not practical for large n o It requires n2 processors Let s compare Via efficiency E 0 A2 and A4 are best followed by A3 and Al 0 Look at the costs pTP now 7 A2 and A4 are costoptimal where A3 andAl are not Overall then A2 is the best 0 if least number of processors is important Overall then A4 is the best 0 if least parallel time is important 4102003 models 56 Other Scalability Metrics Other metrics to handle less general cases have been developed 0 For example 7 Metrics that deal With problems that must be solved in a speci ed time real time problems 7 Metrics that deal With the fact that memory may be the limiting factor and scaling of the number of processors may be necessary not for increased performance but because of increased memory gtgt That is memory scales linearly With the number of processors p Scaled speedup Serial fraction 4102003 models 57 Scaled Speedup Analyze the speedup increasing the problem size linearly with the number of processors This analysis can be done by constraining either time or memory in the analysis To see this consider the following two examples 7 an parallel algorithm for matrixvector products 7 an parallel algorithm for matrixmatrix products 4102003 models 58 Scaled Speedup For Matrix Vector Products The serial time T S performing matrixvector product for a matrix of size nxn is to n2 0 where I is the time for a multiplyadd operation Suppose the parallel time ITP for a s1mple parallel algorithm Section 821 is n2 TP tc t5 log 127th Then the speedup is t 172 52 71 tC ts logptwn P 4102003 models 59 Scaled Speedup For Matrix Vector Products Continued Consider a memory scaling constraint 0 Require the memory to scale as p 0 But the memory requirement for the matrix is n2 7 Therefore nZ p or n2 c Xp 7 Substituting into the speedup formula we get Sv cop Cl p t62txlogptwcp 02 0310gPC4 P 7 Thus the scaled speedup with a memory constraint is Vp 4102003 models 60 Scaled Speedup For Matrix Vector Products Continued 0 Consider a time scaling constraint Require the time to be constant as the number of processors increases But the parallel time is nzp 7 Therefore nZp c or n2 c X p This is the same requirement as for the memory constrained case and so the same scaled speedup results Thus the scaled speedup with a time constraint is also DOp 4102003 models 61 Scaled Speedup For Matrix MatriX Products The serial time T S performing matrixmatrix product for a matrix of size nxn is to n3 0 where I is the time for a multiplyadd operation Suppose the parallel time ITP for a Simple parallel algorithm Section 821 is Hz J ton S n3 tC tJv logp21 P TP to n t5 log p th P Then the speedup is 3 n2 WW 4102003 models 62 Scaled Speedup For Matrix MatriX Products Continued Consider a memory scaling constraint 0 Require the memory to scale as p 0 But the memory requirement for the matrix is n2 7 Therefore nZ p or n2 c Xp 7 Substituting into the speedup formula we get Mcivf395 01p 339 to Cpl5 P tJv logp2tw 02 03 J 7 Thus the scaled speedup with a memory constraint is p that is linear 4102003 models 63 Scaled Speedup For Matrix MatriX Products Continued Consider a time scaling constraint 0 Require the time to be constant as the number of processors increases 0 But the parallel time is n3p 7 Therefore n3p c or n3 c Xp 7 Substituting into the speedup formula we get I SH cop Clp Op56 23 16 t62tx10gp21wcp C2 0310gPC4P P 7 Thus the scaled speedup with atime constraint is also pS6 that is sublinear speedup 4102003 models 64 Serial Fraction 0 Used as with the other measures to indicate the nature of the scalability of a parallel algorithm 0 What is it Assume the work Wbe broken into two parts 7 The part that is totally serial denoted as T52 gtgt We assume this includes all the interaction time 7 The part that is totally parallel denoted as TM 7 Then the work W Tm Tpm De ne the serial fraction as f pr W Now we seek an expression for f in terms of p and S in order to study how f changes with p 4102003 models Serial Fraction Continued From the de nition of TP T m W T TPTser p Tser serfW P P Using the relation S WTP and solving for f gives a formula for f in terms of S and f W fW P T Pifi W S p filS7lp7 pS7l 17117 771 It is not clear howfvaries withp here 7 If f increases with increasing p the system is considered scalable Let39s look at what this formula tells us for the matrixvector product 4102003 models Serial Fraction Example 0 For the matrix vector product f pSil ptxlog ptwnptcn2il 771 p71 N txlogptwn 2 tcn 7 This indicates that the serial fractionfgrows with increasing p and so the parallel algorithm is considered scalable 4102003 models 67 Memory System Performance CS 442EECE 432 Brian T Smith UNM CS Dept Spring 2003 2182003 memory Introduction The clock rate of your CPU does not alone determine its performance 0 Nowadays memory speeds are becoming the limitation 0 Hardware designers are creating architectures that try to overcome memory speed limitations 7 However the hardware is designed to be ef cient under only some models of how the program running on them is designed 0 Thus careful program design is essential to obtain high performance 7 We introduce these issues by looking at simple matrix operations and modeling their performance given certain characteristics of the CPU architecture 2182003 memory Memory Issues Definitions Consider an architecture with several caches 0 Questions of the hydrant example gtgt How long does it take to receive a word in a particular cache once requested gtgt How many words can be retrieved in a unit of time once the rst one is received Assume we request a word of memory and receive a block of data of size b words containing the desired word 7 Latency The length oftime I usually in nanoseconds to receive the first word after the request 7 Bandwidth 0 The number ofmillions of words per second received with one request 2182003 memory 3 Hypothetical Machine No Cache Its clock rate is 1 GHz 1 nanosecond 1 us It takes 100 cycles 100 us to obtain a word once requested the latency is 100 ns Suppose the block size is l word Thus the bandwidth is 10 megawords per second Mwords Suppose the machine has two multiply add units A multiply and an add operation can be performed in 1 cycle on each unit 7 This CPU is fast it can perform 4 oating point operations per cycle so that its peak performance is 4 Gflops 7 However this machine is very slow in practice as shown below Consider a dot product operation 7 Each step is a multiply and add accessing 2 elements of 2 vectors and accumulating the result in a register say v 7 2 elements are sent to cache every 200 ns and2 ops are performed in 1 cycle 7 Thus the machine runs at lOOns for each oatingpoint operation or at the rate of l OMflops a factor 400 times slower than the peak 2182003 memory 4 A Second Machine Cache BUT The Problem is Different Let s consider matrix multiplication with a reasonable sized cache 32 KBytes with block size 1 word assume 8 or fewer bytes per word 0 Two situations are different here 7 The problem is different gtgt Dotproduct part of matrix multiplication averages out to 1 operation for each data item 2n operands 2n operations gtgt Matrix multiplication has data reuse that is order n3 operations for order n2 data 7 The second machine has a cache with line block size 4 gtgt A cache is an intermediate storage area for which the CPU accesses its memory in 1 cycle low latency but stores suf cient data to take advantage of data reuse the important aspect of matrix multiplication 2182003 memory 5 A Second Machine Continued Same processor as before 1 GHz Same latency from memory to cache 100 ns Consider multiplying 2 32x32 matrices A and B and producing a third matrix C 0 We need a 1024 words for each matrix 3 matrices times 8 bytes per matrix this is 24KB and fits in a 32 KB cache 0 Fetching 2 matrices into the cache is 2048 words at one word every 100 ns takes 2048 u seconds 0 Performing the 2n3 operations for the matrix multiply that is 64K operations 2323 takes 23234 ns or 163 p seconds the machine performs 4 oating point operations per us as above 0 Thus the flop rate is 64102048163 296 M ops 7 The text book has 303 because it treats K as exactly 1000 not 1024 7 A lot better than 10 M opsbut no where near the peak of4 G ops 7 The notion of reusing data in this way is called temporal locality many operations on the same data occur close in time 2182003 memory 6 A Third Machine Increase the Memory Cache Bandwidth To increase the bandwidth to memory increase the block size 7 say from 1 word to 4 words 0 As stated this implies that the data path from memory to cache is 128 bits wide 4x32 bitsword 7 For the dot product algorithm 0 A request for al brings in al 4 7 The al takes 100 ns but a2 a3 and a4 arrive at the same time 0 Similarly a request for bl brings in bl 4 7 The request for bl is issued one cycle after that for al but the bus is busy bringing a14 into the cache 7 Thus a er 201 ns the dot product computation starts and proceeds 1 cycle at a time completing at a4 and b4 0 Next the request for a5 brings in a58 and so on 0 Thus the CPU is performing at approximately 8 ops for roughly 204 ns or 1 operation per 25 ns or 40 Mflops 2182003 memory 7 The Third Machine Analyzed In Terms of CacheHit Ratio The cache hit ratio is the number of memory accesses which are in cache to the total number of memory accesses 0 In this case the first access in every 4 accesses is amiss and the remaining 3 are hits or successes 7 This the cache hit ratio is 75 0 Assume the dominant overhead is the misses 7 Then 25 of the memory cycle time is an average overhead per access or 25 ns 25 of100 ns memory latency 0 Because the dotproduct has one operation per word accessed this also works out to 40 Mflops 0 A more accurate estimate is 75 X l 25 X 100 nsword gtgt Or 2574 ns or 388 Mflops 2182003 memory 8 Actual Implementations 128 bit Wide buses are expensive The usual implementation is to pipeline a 32bit bus so that the words in the line block arrive at each clock cycle after the rst item is received This modi es the above calculation a little 7 That is instead of 4 words received after a 100 ns latency the 4 items arrive after 100 3 ns 7 However multiply add operations can start after each item arrives so thatthe result is the same that is 40 or so Mflops 2182003 memory 9 Spatial Locality Issue It is clear that the dotproduct is taking advantage of the consecutiveness of elements of the vector 0 This is calledspatial locality the elements are close together in memory 7 Consider a different problem 7 Taking a linear combination of the columns of a matrix determined by a vector X this is actually forming Ax gtgt The elements of the column are not consecutive in C gtgt That is they are separated by a number of columns equal to the column length gtgt In this case the accesses are not spatially local and essentially all accesses to every element of every column are cache misses 2182003 memory 10 First The Simpler ColumnSum Algorithm Consider the problem of computing the sums of the columns of a 1000x1000 matrix b 7 This is one ofthe norms ofa matrix 0 But let39s look at how this performs 0 Consider the following code for 1 7 1000 1 r columnisumul 7 00 for 3 7 0 j lt 1000 3 columnisumul m 11 1 0 This code performs very poorly gtgt colurrtnisum ts in cache and the memory accesses are consecutive gtgt In the inner loop b s elements are accessed by columns and consecutive elements in a column are 1000 elements apart gtgt Thus they are unlikely to be in the same cache line so every access experiences the maximum latency delay 100 ns for our model machines 2182003 memory 11 An Alternative Algorithm Consider the following rewrite which produces the same column sum for 1 7 0 1 lt 1000 1 columnisumul 7 00 for 3 7 0 3 lt 1000 3 for i 7 01 lt 1000 i columnisumul m 11 0 In this case the inner loop accessesb by rows 7 The elements in the rows are consecutive and thus a memory access brings multiple elements 4 for our model machine at a time and thus the performance is reasonable for this machine 0 But poor performance will occur again if the columnisum and row lengths are so large that they do not fit into cache 7 The excessive elements replace the early elements in the co umnisum so that accesses to columnisum are all cache misses 7 The solution is to quottilequot or break the iterations into blocks of consecutive elements that t into the cache 2182003 memory 12 Peak Floating Point Performance To Peak Memory Bandwidth The performance issue with our model machine is that the peak oating point performance outstrips the peak memory bandwidth 0 For fast microprocessors it is 100 MFLOPSMB of bandwidth gtgt We solve the problem by modifying the algorithms to hide the 5 large memory atencie gtgt Many believe that the optimization phase of compilers should tr ansform the code to obtain the optimal performance 0 For large scale vector processors it is l MFLOPMB of width gtgt These modi cations are typically unnecessary but they don39t hurt the computation and sometimes help 2182003 memory 13 Hiding Memory Latency Consider the example of getting information from Internet using a browser 0 You wait a long time sometimes What can you do 1 While reading one page we anticipate the next pages we will read and therefore begin fetches for them in advance 7 This corresponds to prefetching pages in anticipation of 39 e d Fquot We open multiple browser windows and begin accesses in each window in par 7 This corresponds to multiple threads running in parallel We request many pages in order 7 This corresponds to pipelining with spatial locality 5 2182003 memory 14 Multithreading To Hide Memory Latency Consider the matrix Vector multiplication Ab 0 Each row by vector inner product is an independent computation thus create a different thread for each computation as follows k r 0 k lt n createithread dotiproduct getirowm r b 7 As separate threads gt On the rst cycle the rst thread accesses the rst pair of data elements for the rst row and waits for the data to arrive gtgt On the second cycle the second thread accesses the rst pair of elements for the secon row and waits for the data to arrive gtgt And so on until units of time the latency gtgt Then the rst thread performs a computation and requests more data next gtgt Then the second thread performs a computation and requests more data gtgt And so on so that after the rst latency of 1 cycles every cycle is performing a computation H 2182003 memory 15 Prefetching To Hide Memory Latency Advance the loads ahead of when the data is needed 0 The problem is that the cache space may be needed for computation between the pre fetch and use of the pre fetched data gtgt This is no worse that not performing the prefetch because the prefetch memory unit is typically an independent function unit 0 Dot product again or vector sum provides an example gtgt The rst two elements ofa and b are requested in a loop gtgt The processor sees the next two are needed for the next iteration and in the next cycle requests them and so on gtgt Assume the rst item takes 100ns to obtainthe data and the requests for the others are every consecutive cycle gtgt The processor waits 101 cycles for the rst pair performs the computation and the next pair are there on the next cycle ready for computation and so on 2182003 memory 16 Impact On Memory Bandwidth Prefetching or multithreading increase the bandwidth requirements to memory Compare a 32 thread computation experiencing a cache hit ratio of 25 because all threads share the same cache and memory access versus 1 single thread computation experiencing a 90 cache hit ratio 7 The memory bandwidth requirement for the first case is estimated to be 3 GBsec 7 The memory bandwidth requirement for the second case is estimated to be 400 MBsec 2182003 memory

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.