Parallel Programming CS 475
Popular in Course
Popular in ComputerScienence
This 63 page Class Notes was uploaded by Betty Kertzmann on Tuesday September 22, 2015. The Class Notes belongs to CS 475 at Colorado State University taught by Sanjay Rajopadhye in Fall. Since its upload, it has received 42 views. For similar materials see /class/210181/cs-475-colorado-state-university in ComputerScienence at Colorado State University.
Reviews for Parallel Programming
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/22/15
Sanjay Raj opaclhye Colorado State University Fal12008 Week 4 I Objectives l MPI Beyond the simplest example I Block data distributions and their analysis I Function MPIBCClS l Performance tuning E improving sequential performance Dbetter algorithm D avoiding communication redundant computation Outline l The problem and a sequential algorithm l Parallelization strategies I Data decomposition l Parallel algorithm development 8 analysis I Benchmarking l Performance tuning The problem I Find all the prime numbers up to a given number n I By filtering out the multiples ofknown primes l llany strategies 393 Start with a sequential algorithm and systematically parallelize it using our known and trusted approach 0 Think out ofthe box a completely different approach Seguential Algorithm Complexity n ln 1n n Algorithm contd Create an array oF numbers 2 n none of which is marked Invariant the smallest unmarked number is a prime k lt 2 k is the next prime number repeat Mark oFF all multiples of k as nonprimes Set k to the next unmarked number which must be a prime until done Pseudo code For i i markedi o k nde arked0 markedl 1 while kltn or 392 iltn i quot7 ik markedlild Whil markedindex 7 do nothing now index has the rst unmarked number so k index Analysis amp Improvement I Where does the program spend its time I How to improve 339 1fXah is a composite number then at least one of a or b is less than or equal to X algorithmic improvement Improved code For i0 ilt imar1ltedi o k ndex 2 markele markedl 1 while kkltn oufer loop has only sqr Hn iferarions For ikk iltn i iF i ok markedi 1 while markedirdex do nofhing no 39 has the rsf unmarked number so l k inde Sequential performance first o or k i iF ik markedil For ikk iltn i k m rked39 While mamaleindexD do nofhing now index has he rsf unmarked number so k index 1 39 l Exercxse Wars the three Versions of the code and measure the mnnlng time for asymptoueeuy meteesmg Values of n I Prionties ushmpxove asymptoue tunmng hme better algorithms ex get constant factor gems better sequenmt programs Only then thmxs of parallehzahon I Para ehza on I Aglomem on two steps e computmonnlmskxs nsmgle updnte matth ofm mnyelemmt All updntes othe same element done by nsmgleprocess r Fmtheethetets nddmoml ngglomemuon mmy mm elements nsslgned to the Sims process Decomposmon dmde the army mto pteces u eyeuc mocmon Block allounon I Dommn Otha n ocmon Break I Thmk oumde the box Streammg a ocatlon Each process is responsible for npnme annber mum me at nut a Baalem my marked at mmaYked but an my afmtegem m lm I Fm numbEns npnme Apxocess lters out the am number n sees mypnmi And Wales out in mnythh all mulhples ofmypnme deleted HWI Clari cation I Fust experiment a bit v nd a large enough w mnnmg nme On one processor 1 about 100 sec l Them keep 11 xed change the number ofprocessors gather mnmng time data does x xve thh your expecmom hypotheses I Cychc VS block a ocatlon I Cychc allocatxon nmplelogxcto determmewho ls responsible fonheem element r his loud mbzlnnce 51mm to the quotmemng nppx nah I Blockallocahon m lend balmced v Glxgh y complxutedlogx Wm ms not mumple om Ench process should get enhez Ix 0 WM elemems r Howto dxsmbutethem u m I Simplest approach I Sxmple Block allocatxon 1 Fm p71 processors get My elements 7 125 one gets whatever xemzms ExamplesProblems lt1006y1amp1ds17 17 17 17 17 15 lt1009gty1amp1ds11 11 11 11 11 11 11 11 1 lt10011gty1e1ds 1o 10 1o 10 1o 10 1o 10 1o 10 o lt2018gty1eldsZZZZZZZZZZOOO JOE OO mast hmme Pmcessms have nu am I How to resolve Basic mapping functions Gwen an element 1 wh1ch processor owns 19 r H funcuon r H m funchon memorylocmon on theloczl mnympwhezextxs stored Gwen a processor what elements does it owngt pa 1 function relation a paf andpa fust and last elements owned by processoxp Sqwhat is the H m moeom Compose theta and ref functions mduse zef 39 0 71170 men a processorP and an address 11 what element ofthe ongmal array does that hold f0H Block distribution 21 I szst 7 processors get exactly MPi elements I Remainder gethpj I Mappmg funcnons gtp1uPJ mom 0 11 12 1mm mmoa ye 1 zap mouon mm mmq LnF 1 j L or v Wm Double oox Mnyhnve to be computed o mdymm1czlly I Block distribution 31 l Reverse engineef from a simple functmn paf funcnon mm LpnPj l Everything follows from 1t pa funonon 1mm Lp1rzFj e1 PWO I13917J I Block distribution 14 I This 1s what you should follow I Vzrrznt of the srrnplest strategy First 21 processors get WP elements Para elization Strategy l FLrst processor has elements I We sure that this 1s more thzh 411 so all pnmes used to mark offare determined by the rst or last one gets whatever remains process Powwy more than the other I FLrst processor mforrns others ofthe prune Again Simple mapping function seem the code I Processors mark off them section of the array I Outline I Background amp History I Message pas singprogramming model I SPMD EDistnbuted memory l Writing MP1 programs Sanjay Raj opaclhye Colorado State University I comp mg g and hen king Fall 2008 Week 3 Background amp History Foundation I Prior to 1990 vendor speci c libraries 1 No portability Parallel Virtual Machine PVM ORNL 1989 public release 1993 u enables portable parallel programmmg l Parallel sic efforts to develop an open portable general purpose parallel programming 19924994 1 Parallel 10 1997 l Nearly ubiquitous on bigiron l Assembly language ofparallel programming rs similar advantagesdisadvantages l SPMD Single Program Multiple Data v wnte a smgle progmm a usingaspeclal hbmn a compile itwith spectal compiler that 1mm about the Library a execute at m an ennronmen created bythe mmronm eat vanables of your session as on bass with a speoal Comde on your local machme e g Lam WI 0 0 MP1 hbmxy provides funcuons for coordmahng and oommumcaung amongst the instances 39 1 Example program I Circuit is hardwired into the program I 16 inputs 64k 65536 possibilities to test f O rh input OOO0000000000000 65536 rh input 1111111111111111 I Write a function that takes argument i and tests the i th input I Outer loop that calls this function 65536 times and prints success 39 Structure of MP1 program From h r rpscompu ringllnlgovtutorialsmpi MIPI inrlln le le 1 Declarations prototypes etc Program Begins Serial code quot 392 e n Farallelcodebegins sage passing tall Tamime MIPI Emitmum Payauel code ends Serial code Program Ends 1 MP1 1nit l Headers include ltmpihgt Includes the library API I MPIIni rampargc ampargv Every program must have this before a call to any other MP1 function Not necessarily the first executed statement May be used to pass broadcast command line argmnents to all processes system dependent Communicators l Opaque object that provides the environment in which the processes interact Quinn In English this means t name of a group of processes t default predefined communicator provided by the system I MPICOMMWORLD includes all the 1VP1 processes I Most MP1 functions require a communicator as an argument specifies the collection of processes that participate in that communication function dep endent Communicators MPCOMMWORLD 9 0 W l MPLCommsizecommunicafor comm int ampsize Communicator IS mput size is returne Number ofprocesses 1n the Communicator MPLCommrankcommunicafor comm inf Swank Communicator IS mput rank is returned id ofthe ca 1mg process in the communicator Integer between 0 and size I All Processes have a Private copy ofALL Program Variables Cyclic allocation of work mainint argc char argv int 1 id p void Checkicircuii int int MPIilnitUQargc ampargv MP17CoanrankMP17C0MM7W0RLD amp1 d MP17CoInILsizeMP17C0MM7WORLD ampp for i id 1 lt NUMREPS i p checkicircuiuid i printfquotProcess d is donenquot id fflushstdout MPIiFinalize return 0 39 1 Compiling and executing l Please see U A cs475L hm httpwww Execution notes I Nonedetetmmtsttc Order ofprian outputs Mulhple punts bysnme ptooess you always occunn the otdetthey ate executed quot calls by differmt ptooesses you be queued uhpomhoe of flush the system may mteueoye the ptooessmg of the commands 1 Extensions I In addmon to pnntmg out the solutxons we want to count how many mputs Satisfy the Circuit Introduce your rst collective Communication call I MPIReduce combmes Values m different processes mto a smgle resu t Details I Each ptooess has a oouhtet to mamtzm how many soluhohs it found I a o 9 2 B 9 m t C rt msfmble l Ihttoduoe yout stst ooueouye oommumoduoh call at the end of the loop MPIReduce oombmes Values m diffexmt ptooesses mto a smgle tesult I Call after the for loop I MP1 Reduce I inf MPIReduce void sendbuf pfr To rs argumenf void recvbuf pfr To rst resulf inf count number of values To combine MPID 1fafype dafafype Type MPIO op operator inf roof des nufion MPIComm comm loa Sanjay Rajopadhye Colorado State University Fall 2008 Week 5 Objectives l Sieve of Eratosthenes algorithm analysis I Optimizations El Broadcast removal El Loop interchange for caches DMarking only odd numbers Analysis I Sequential complexity Tn 0n ln ln n Almost linear time complexity 39 1 Analysis I Sequential complexity Tn 0n ln ln n log10 n Number of digits in decimal representation ofn log2 n Number ofbits in the binary representation ofn 39 log log10 n Digits in decimal representation of that log log2 n bits in the binary representation of that Almost linear time complexity Parallel program analysis I Time to mark one cell X Sequential execution time X 7 ln ln 7 I Number ofbroadcasts I ln I l Time for a single broadcast two models A logp K I Total execution time Tnixnlnlnn Zir logp p lnn 1 Improving the program I Avoid broadcasts homework assignment 1 39 39 Each process independently compute all primes up to 71 Each processor independently marks off its section of the global array Processes communicate only once with MPIReduce to determine the total number ofprimes I BUT FIRST SEE WHAT COMPILER CAN DO 39 USE OPTIMIZATION FLAGS Initial sequential part For i0 ilt5qr rn i markedi O k index 2 i0 marked0 marked1 1 while kkltsqr rn For ikk iltsqr rn i k markedi1 while markedindegtlt primesj k index loop rhrough marked amp s rore addi rional values in ro primes numprimes ro ral number of primes Found Parallel part For j0 jltnumprimes j F rs r mul riple of primesj in my sec rion For iF iltmysize i primesj markedi1 code ro coun r The number of primes and MPIReduce ii To processor 0 39 J LocalitV Improvement I Loop blocking tiling partitioning I Typically applied to compute intensive loops to opt1m1ze Communication cost Memory locality register cache main memory disk parallel machine utilization processors coresgt etc Mafr39ix mul riplicafion ideal 0013 0075 The uniform Memory Hierarchy Model of Conpututiorf39 B Alpa39n L Curta39 E Feig T Selka Algorithrricu 1992 Tiling the sieve For J39O jltnumprimes j F rst multiple of primesj in my sectionquot For iF iltmysize i primesj markedi1 39 Tiling the sieve For J39O jltnumprimes j F rst multiple oF primesj in my section For iF iltmysize i primesj markedi1 For J39O jltnumprimes j F rst multiple oF primesj in my section start largest multiple oF BKSIZE leq Fquot For iistart ii lt mysize ii BKSIZE F rst multiple oF primesj geq ii For iF iltminiiBKSIZElmysize iprimesj markedi1 39 i Tiling the sieve This changes nothing in the code all iterations are executedin exactly the same order as before Only With a more complicated loop structure For J39O jltnumprimes j F rst multiple oF primesj in my section start largest multiple oF BKSIZE leq Fquot For iistart ii lt mysize ii BKSIZE F39 rst multiple oF primesj geq ii For iF iltminiiBKSIZE1mysize iprimesj markedi1 39 4 Pull the ii loop outside the j loop For J39O jltnumprimes j F rst multiple of primesj in my section start largest multiple of BKSIZE leq Fquot For iistart ii lt mysize ii BKSIZE F rst multiple of primesj geq ii For iF iltminiiBKSIZE1mysize iprimesj markedi1 39 1 Pull the 11 loop outside the 139 loop For J39O jltnumprimes j F rst multiple of primesj in my section start largest multiple of BKSIZE leq Fquot For iistart ii lt mysize ii BKSIZE F rst multiple of primesj geq ii For iF iltminiiBKSIZE1mysize iprimesj markedi1 I 1 But is this legal l That s the trick I You need to ensure that the code correctly does What the original one did Performance Speeduv wvl itsdf Speedup Num Pvucessms Performance Spaadup speedup wrt sievz Num Processm 851 15787 44870904 3002147 22516097 18027931 15024339 13072409 1 1292232 8937185 8013412 6296979 5963036 84669903 44574816 29787484 22288853 17824314 14790665 N m A A 00 a a 1 1073728 axloo com 500 Bax lea ANA 5298138 19916327 10116669 1536564 136129 877094 4409445 2943989 2207137 1777574 14811 16 1262348 11 17034 0893259 0751872 0636131 0576545 39 J Summary I More involved problem I Reductions and Broadcasts I Block Distribution l Sequential Performance is crucial Key to high performance computing Sanjay Rajopadhye Colorado State University Fall 2008 Week 6 1 Objectives I HWO Feedback I Allocation and distribution of 2 D arrays l Floyds s Algorithm 0713 sequential complexity D Simple loop structure I Point to point communications l Computation Communication granularity DTradeoffs Writing Reports l Stats N50 reports a N 10 pages 4 max 50 20 minutes average I HWO Objectives Measure program running times contrived setting I function of procs and mputs Analyze data Hypothesize analytic functions that t them 39 Report Open ended deliberately Reports You will do a LOT ofwork Do you report everything Do you report it chronologically What is the story that you want to tell What is the punch line How to build up What is the MINIMUM that you need to say to get to it Grades W carrot vs stick Allpairs Shortest paths A B C D A 0 6 0 B 4 0 0 1 C 0 0 0 5 D 0Q 3 m 0 E 10 5 5 2 Adjacency Matrix Containing Distances 1 Allpairs Shortest paths l Matrix multiplication based approach 39 39 All paths oflength 1 adjacency matrix paths oflength 2 DE length 3 Di minDJv length 71 D5 IninDJ39l Di Flovd s Algorithm For k e 0 0 n 1 For i e 0 0 n 1 For j e 0 0 n 1 aij 6 min aij aik akj endFor endFor endFor 1 Why it works I Don t consider path lengths I Consider the intermediate nodes on a path Parallelization l Task Channel model Partitioning I Domain decomposition Quinn s approach I Too early consider 113 independent tasks Communication I 2112 independent broadcasts Agglomeration and Mapping two steps I data partitioning I stripes IOWewlSe 39 columniw1se Data Allocation 1D l Example m mallocNsizeoFdouble m is a pointer allocated on runtime stack 39 39 the actual storage is on the heap Stack 5 Data Allocation 2D arrav 1 Data Allocation 2D arrav Floaf A Adafa Adafa Floaf malloc m n sizeo oa A Floaf nalloc m sizeo oaf For i 0 i lt m i Ai ampAdaain File InputOutput I l Disk access times signi cantly slower than inter processor communication I All processors read from disk too inef cient serialized f 7 con icts l One processor reads and sends to the others What is the communication pattern What is the drawback File InputOutput II MP1 Send MPISend void buf s rar ring address of message in r coun r number of elemen rs oF MPIDa ra rype d rype rhis da ra rype in r des r id of des rina rion wildcard allowed in r rag user de ned rag mus r ma rch recv MPIComm communica ror MP1 Receive MPIRecv void buf s rar ring address of message in r coun r number of elemen rs of MPIDa ra rype d rype rhis da ra rype in r src id of des rina rion wildcard allowed in r rag user de ned lag mus r ma rch send MPIComm communica ror Summary I Classic algorithm I 2 D data alloaction and distribution Row striped Column striped I Point to point communication MPISend MPIReCV S 475 G Sanjay Rajopadhye Colorado State University Fall 2008 Name card info outside I Fold your card LENGTHWISE so that it can be placed on your desk I Neatly write your name what you want to be called in BIG BOLD LETTERS using the markers that are circulating I Prop it up on your desk so that it can be seen from the front of the classroom Name card info inside I Name Sanjay Rajopadhye I What you want to be called Sanjay I Pronunciation if applicable In Indian names a is almost always pronounced as a short u sound as in gun fun etc or a long aa sound as in calm bard etc Sanjay is pronounced as Sunjuy Rajopadhye RaajOhpaathyay in the paath make the t sound like a 1 Don t worry if you don t get it right it s almost always mispronounced even in India This is true in many parts 0s Asia eg pronounce Bagdad I Major eg CS ECE etc I Status eg senior grad etc I Interesting fact Every member in my family was born in a different country Teaching Assistant I Name Tomofumi Yuki I What you want to be called Tomofumi I Pronunciation if applicable I Major Computer Science I Status Grad MS I Interesting fact Drinks over ten cups of tea every day Teaching Assistant I Name Kaustubh Gadkari I What you want to be called Kaustubh I Pronunciation if applicable 6 33 cowstubh pronounce the u as in put not as is tub I Major Computer Science I Status Grad MS I Interesting fact Finalist at National Astronomy Olympiad 2000 Also suffers from vertigo Teaching Assistant I Name Nissa Osheim distance class only I What you want to be called Nissa I Pronunciation if applicable I Major Computer Science I Status Grad PhD I Interesting fact Named after a mythical Scandinavian creature Moore s Law is dead Long live Moore s Law 39 Moore s Law as formulated l Empirical observation 1965 by Gordon Moore I The number of transistors that can be inexpensively placed on an integrated circuit is increasing exponentially doubling approximately every two years I This has held true till now and is expected to hold for at least another ve years 39 I Moore s Law interpretationsl l Chip designers we had better ensure that the exponential growth is maintained or else the competition will I Main reason that Moore s law is being sustained Other features of semiconductors are also growing exponentially but at different rates I Frequency I Memory density and speed bandwidth and latency I Hard drive and other I 0 devices capacity and speed I Power I General public Performance is increasing so we can expect build more and more powerful applications Corollary of exponential growth I When two quantities grow exponentially but at different rates their ratio also grows exponentially Consider y1axaandy2bxfora2b21 y axfora21 3 2 17 Gaps or walls of Moore s Law Memory gap wall I Memory bandwidth grows much more slowly than processor speeds since mid 808 this was addressed by ever increasing on chip caches Devoting such a large fraction of silicon resources to just storage is not very effective ILP instruction level parallelism wall I One way to exploit increased clock frequency was to increase the instruction level parallelism on chip deeply pipelined out of order VLIW etc leading to complicated control logic Overhead for this as the degree of parallelism is scaled is too high Power wall I Power dissipation ability is also increasing exponentially but at such a slow rate that it has effectively peaked This is a brick wall Long Live Moore s Law The goal is to deliver performance to the masses The only solution to the power wall is NOT to increase frequency as aggressively but to have multiple processor cores l The processor is the new transistor Number of cores will continue to grow at an exponential rate 39 Challenges of multicores manvcores l Parallel programming is critical move from the realm of a small number of expert programmers Who have challenging applications that must be resolved to general purpose computing A sequential program is a slow program I Parallel programming is hard Grand challenge Enable average programmers to write codes for massively parallel computers Without making the programming task more dif cult This requires long term research on automatic parallelization languages OS fault tolerance etc In the meantime there s this class Class Objectives I Study the foundations of parallel programming I Write parallel programs In C possibly Fortran for GRAD 510 Using MP1 and OpenMP style programming paradigms l Execute and measure performance I Analyze performance I Write reports 39 1 Class Format I Hands on I Weekly Lab Monday 12 1 or 2 3 covers many hands on details I Write effective and scalable parallel programs You Will submit your programs a few days before the homework is due half the story Measure analyze and report performance second half Plotting Data cardinal rule I Visually a straight line conveys the most information I If your data is not a linear function massage it so that what you plot looks like a linear function Then deduce the original function Massaging Data I If you think the function is polynomial y fx a0 a1x anx z anxquot asymptotically log y 10g an nlog x y39a39nx39 Massaging Data I If the function is exponential y f x bax log y 10g b xlog a b a x Massaging Data I If the function is hyperbolic y f x y E x scaled reciprocal of y y f 1 this is the speedup function Massaging Data I If the function is hyperbolic with a non zero asymptote a y fx c x l a CM not really a straight line a cx but linear for small 6 and x Sanjay opadhye Colorado State University Fall 2008 Lecture 2 I Topics I History of Supercomputing l Seeking Concurrency UData Parallelism D Functional Parallelism DPipehning l Example Section 16 l Example exercise adding up numbers Historical Development l Supercomputing specialized domain I Bigscienti c questions grand challenges I Simulations augment and or replace experimental Validation Modern Scienti c Method obsebvahon expeziment Seeking Parallelism I Data dependence Graphs Common abstractton in Computer Sc1ence 39 Nodes typlcally tasks I Edges typlcally precedence constraints l Expose parallelism and independence Data Parallelism l Large nurnber of1ndependent operauons I Usually ne grain l Dataeparallel collecuon onented languages Early c1rca80 DPcCVMPC imperative some commercial hspNESLfuncnonal HPF l llgh Performance Fortran Fortran95 MatlabR etc Rapdemd I Data Parallelism example I Standard code For i0iltNi ni bil EU I Datarparallel code AEC Functional Parallelism l Different operations on different data a 2 b 3 m nb5 p n b2 z nmp4p Pipeline Paralleli Pipeline Parallelism Exploit parual results p0 10 p0 10 pl n0nl p1 p0nl p2 n0nln2 p2 pln2 p3 n0nln2n3 p3 p2n3 I i 1 Document 1 Classification I Clustering Algorithm face recognition Al etc Input documents I Para elism Oppor uni es l Functional Parallelism c Compute a vector for each element document Draw data dependence graph Determine a Vector to represent each cluster eg centroid Look for independent partitions sets ofnodes such that there IS no path from one to the ot er Repeat Evaluate a quotgoodnessquot functaon Update cluster Vectors to quotimprovequot funcnon e g by migrating each document to its quotclosestquot cluster and recompuhng cluster Vectors until convergence or max number Ofiterations Output cluster Vectors Data dependence g1th Build docmnehl vectof Compute goodness mam r Recompule eluslerv Ghoose cluste f Vector Outputolu er vectors Data Parallelism l Sarne operauon apphed to ladependent sets ofdata Generating the document vectors Assigning cluster vectors Finding distance from document to cluster vectors I Finding closest cluster Updating membership 0 f clusters Recomputing cluster vectors Para el Progxammjng I Once you have found the parallelism how to express 1t Extend a Compiler Extend an existing sequential language Create a new language E tend a compiler l Automatic parz ehzauon Detect parallelism from a sequential program transform code annotate for paranehsm produce para11e1 code I Fortrzn parallehzatlon almost 40 years ofresearen 2007 Turing award Winner Fran Allen Extend a com iler 2 New Langyages l Advantages l Back to baslcs create a parallel language No change for user Advantages Clean start avold epeatlng mlstakes Code compatlblllty No retrzlnlng l Dlsadvantages Dlsadvantages Clean start avold leverzglng exlstlng technology yety dlf cult Utoplano I Another Example exercis 1 Compute the sum of1000 4edlglt numbers uslng 1000 gt I Library ba 1 extension I Add functlons for parallelism related activities Createspeclfy processes accountants Communlcate l You have stack of 1000 lndex cards l Accountants are m huge hall and have calculators r Armngedln 25 cows of 40 each an communlcate only to the four posslbly fewer naghbonng accountants NSEW l You may choose to to use as many ofthe 1000 as you die bullds ofexlstlng technology lack compllet Support not perfectly satlsfactory Issues sttnbute cards as quxckly as possxble Accumulate answers as quxckly as possxble Adapt to use arbitrary number ofaccountznts Plot running mm as a funcnon of ofzccountants Add another plot ms me for addmg up 10000 numbers Analyzediscuss Sanjay Raj opadliye Colorado State University Fall 2008 Week 2 Outline l Exercise conclusion I Parallel program design I Task channel model ne od Adding numbers On a grid Broadcast Y L39 matrix multiplication Exercise conclusion Precise model U Octopus accountant simultaneously sendreceive from any neighbor unit cost CEach communication event may have an arbitrarily large stack ofcards cost independent of Volume Alternate model Cl Cost is directly proportional to Volume Data Distribution Efficiently broadcast the numbers to processors l Row wise or columnwise F Each processor along the edge gets a stac Keeps whatever is needed for its rowcolumn 40 er25 L39 Forwards the rest to the processor to south or east C Initiates an independent broadcast along its rowcolumn I Analysis D Total number of steps 65 ComputationAccumulation Ef clently compute and collect answers Each processor computes the sum on numbers Commumcate backward to the corner processor lntersperse wlth one addmon operatlon Analysis Number of steps 2 265 Totolexecuuon urne 197 I Can we do better 1 Tradeoff Use fewer processors onlysoo ln aZSXZU god E Localdata 4 Commumcahon dnstznce 45 I Totalhme 139 opumzp 10x20 god 95 5 10x10 god 70 5x10 god 65 5x5 good 70 New twlst what 1f the problem SIZE 1 dlfferentgt Tradeoff ganalytlcall Assum no p ns P processors arrangele o squore god to mnlmlze peorneter N numbers to odd no overlop of communlcouon amp computouon l Totol execuuon urne NP 3 P e term maenses mth P is o squoreroot oohnormal ordegree o 5 md other decrenses hyperbolxczlly Notewe orenot lnterested m osymptouc behnmorherebutmmerwlth o trodeorr What volue ofP rmnlrnlzes the me Set deovoove to zero md solve I 139 Task Channel Model I Task program wrth local memory plus lO ports Taskchannel model Foster s design methodologV Communication through channe1s Four main step i pro ucerconsumexvxew Partitioning receiverblocks unul datais available from sender Break up the computation into multiple parts tasks asynchronous communicauon C mumcmon l Limitation Create channe1s between the tasks difficult to descnbe collecuve communicauo Agglomeration Group tasks together for convenienceload balance Ma ns introduce extensions clouds to group tasks thatparucipate in such acunty PP Allocate taskegroups to processors Partitioning Partitioning issues Both data and computation must be partitioned Many more primitive tasks than processors in the target machine Data domain decomposition Minimize redundant computations and data First lelde the data Homogeneous tasks Functional decomposition Dmde the computamnzl ow of the algorithm scaling number oftasks grows with prob1em size functional decomposition does not scale Goa1 create as many primitive tasks as possib1e Communication Insert channels for tasks that need to communlcate l Local communlcatlon each task needs to communlcate wlth a few tasks Global communlcatlon many tasks lnvolved ln a Slngle communlcatlon event t Clutters the graph avold but keep track quot Drawback of the methodology quotoolleouve oommunloatlontquot are a powerful detlgn paraolgm e g matnx mulupboauon later 1 Communication issues I Communlcatlon should be balanced l Favor local communlcatlon l Parallellsm ln communlca lo n tasks should pexfoxrn oommunloauont lndependently Parallellsm n the computa lon t tasks should be able to perform oomputabont concurrently Agglomer ation Now group tasks lnto clusters throttle parallelism Improve localle I Achleve load balance slmpllfy programmlng effort Achleve scalablllty Agglomeration issue Agglomeratlon should lncrease localltyc Rep o k llcate tas s i take less time than avolded oomrnunloauon ll w 1 tea ng Clusters have tlmllar tlze oomputataon commumcahon I Number ofclusters ls u lnoreatlng funonon ofproblem tlze at small at pottlble but no less than number Ofprocessoxs Mo l catlons to exlstlng code are reasonable ldeally parameterlzable Case studies broadcast one processor out ofP has a smgle value and we want every processor to have a copy ofthe same value Device ef clent algorlthms to achieve thls Asslgn tasks to processors P and orchestrate a schedule Marn goal ocessor utlllzatlonload balance I Stauc Structured communrcauon Slmpll ed use ofFosters methodology no agglomeranon man campumnpmr runner armnrnnn ens nrpnpmssn tzsks 7 processors and no mapprng mun ampunurnpnmr sync nnpn mun than nuskucmd mm W Hmong 30mm Speclzl condmons 21156 when we take processor archrtecture I Dynamrc Into account Frequen Commumc ro cto us accountant gnd t at n dvnarnrc load bnlmcmg nlgonthm p Mmy shorr hyed tasks no mtermsk cornmuucauons tunrhme schedule r Ideal fully connected mmth when my pm of pmesson cm communrcate but a processor cannot srrnultaneously communrcate wrth rnuluple other processors Broadc t 1d lower bound Broadca Ideal lower bound I Tlme for a processor to receive the value must be at least equal to its dlstance number ofhops in the grid graph from the corner source Atanynme let say thatlrprocessors have a co Y hen y our rules no more C an 2k processors can have a copy only rwxse communlcatlons are allowed at the next step Inmally only one processor has a copy This 15 the perlmeter ofthe god P Thls successlve doubllng lmplles that at least lg P steps are needed The data dlstnbutlon phase ofthe addrnumbers exerclse can be adapted to achreye thls Ifthe model 15 changed so that a processor may communlcate Slmultzneously wlth b others for some constant b there 15 only a constant factor lmprovement Not slgm cznt ls asymptotlc analysrs W cruclzl In real llfe 1 oadcast 1deal algorlthmz Classxc strategy m computer sclence dwlde and conquef At any stage there are a certzm number of copies say X Dmde up the processors mto X groups and mdependently broadcast each opy mto the gen 5 How to aente l e libel the groups I who is the mmntor Solution proposed m 2125 t In t s ocesz o ls responsible for brondcasnng to everyone lxespamxble m an the add plasmas two mdependent subpxoblems I Reduc uon gadd1t10n1 l Dual problem of broadcast Now anew ngtp mtroduce llmlted agglometanon and mappmg I Model Tlme to eommumeate a single Value A Tlmeforanthmehc opera 0 i I Naive solutlon Allnodes send data to root 5 Root adds an up 0 Tnrllx Bette Solution dlvlderznd conquer I eductlon d1V1de amp conquer I Do the bozdczst ln reverse I Each step also equxzes a compute operatxon l Total tlme rzpykze x Partltlonmg commumcatlon complete bmary tree on nodes I Agglomerahon amp Mappmg I Complete bmary tree Merge everyleft chxld to 1t parent Bmomlzl tree see g 313816 m text Anzlysls Vt T Z171x1gpx n 7xlgpx p n Body Problem Famhomng one tasls per body I Commumcahon complete graph collecuve commumcataon evant ls eyeryonemust send all lts smgle mrquedatato everyone else r e asrmultaneousbroadcast Lhef Bzute force r steps each processor receryes data from eyery other cooxdlnmon ls ndetml hit needs to be resolved Betterway dyrde and con If h qu r alfthenodeshaye already adareyed an all gnLhex mongst themselves 39 me more step to complete allrgnthez 0 21M nodes A11 gather Analysis vae stmtegy rA l Dlvldezndrconquer p lg r1 steps r A L Messnge slze grows at each step Renllshc commumcmon model iffme cost funcuon Txme a txmsm message afvalume v V rA7 Allrgzthex under af ne communrcataon model geometnc senes n Tare Mg n 7 i3 I 1 n Body gconcl 1 Many more partrcles tlaan process a rs t kperprocessox and aglomemte ppamcles mto each cluster Modrty the allegathex Alternate methodology Tasks for each st Aglomem on flxstpx01ects along tlaetamerterataon Anzlysls r4134 51 TreeHg H52 1 I Input Output Taskeclaannel model does not deal expllcitly wlth 10 Solutlon add auxlllarytzsks tlme also af ne functhn but dlfferent latency and bandwrdtla N a Dlstnbutlng the lnputs scatter g accountants example a Dlvlde amp conquer scatter rn af ne model M134 n TMMED PRAM Model Yet Another model to parallel ptogtammmg ldeal SIMD machme Completely Ignore communlcatlon Assume a shared memory globally zccesslble by all ptocessots In constant time Xhat ltptocessots access the same locatloh I PRAM Model V ariantsl l Coh lct tesolutaoh foxboth reads andwntes 39 Dlsallow w nkez machme but probably mote renhshc l Allow but provide atesolutloh mechamsm EREW Excluslve Read Excluslve Xnte no coh lcts allowed CREW ERCW tately cohsldeted cRcw Xnte coh lcts resolved by atblttaty myofthewntexs succeeds nonrdetezmlmshczlly pnonty thewntexwlth the hlghest pnonty succeeds mo e nlgonthm must ehsute that all waters mate the same value t comblhmg most powexful the values axe comblued usmg ah assodatlve etc 39 com th opentot e 3 sum max PRAM Matrix Multiplication I Work out On your sheet WhV PRAM Provldes lower bounds the best you can do m an ldezl sltuatloh plus mdlcatloh ofhow much you d lose moving to a te machme Good startlng polnt tot taslechahhel model a Evolutaoh otthe mathx multapllcataoh example