### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Introduction to Parallel Processing CS 442

UNM

GPA 3.76

### View Full Document

## 53

## 0

## Popular in Course

## Popular in ComputerScienence

This 49 page Class Notes was uploaded by Trent Dare on Wednesday September 23, 2015. The Class Notes belongs to CS 442 at University of New Mexico taught by Staff in Fall. Since its upload, it has received 53 views. For similar materials see /class/212211/cs-442-university-of-new-mexico in ComputerScienence at University of New Mexico.

## Reviews for Introduction to Parallel Processing

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/23/15

Basic Communication Operations CS 442EECE 432 Brian T Smith UNM CS Dept Spring 2003 412003 commiops 1 Introduction 0 Outline Collective interactions 7 Those that involve two or more processes Oneto all broadcasts and allto one reductions Allto all broadcasts and reductions AllReduce and pre x sum operations Scatter and gather operations Allto all personalized communication Circular shi 412003 commiops 2 Why Study The Detailed Algorithms It provides simple applications to show how the cost model IAv mtw actually indicates which algorithms are best and why It provides a survey of the commonly used collective communication operations and basis for understanding how they work and what operations they perform It shows how to build the algorithms for complex networks from simple networks 2d meshes from ld rings It discusses the mathematical and computational cleanness and naturalness of the hypercube and mesh networks for many applications It demonstrates the design approach quotdesign to the hypercube mesh network and map to the real network as a viable approach 7 The point here is that modern networks are homogeneous for the most part and one only needs to worry about the fact that they are not when they become overloaded you are supposed to design away from overloads and contentions 412003 commiops 3 OneToAll Broadcast And Its Dual A single process sends the same data m to all processes 0 Eg matrix vector multiplication Its dual is allto one 0 That is all processes send a different item to one and reduce it to one item received 0 The reduction is one of the useful associative operations such as sum product minimum maxrmum or39ed an 39 Why is this not tnv1al 0 Because the straightforward way serializes the process gtgt That is process 0 sendsm to each process one at at time gtgt Thus it takes order p time where p is the number of processes 0 And we can do a lot better 412003 commiops 4 One To All Broadcast and Single Node Accumulation Fmedmfbg Smglomut accumuluuun D M is the reduced value of M39s from all Figure OnerloraH bruadcast and singlernode accumulauun processes say Cupyllght r H394 EGHJQWH CHWWHQS Publishing Co surnM39s One to all broadcast sage M is sent by process 0 to processes 01 p71 Single node accumulation or reductio A lprocesses send M o process 0 wrth a reduction say surn 412003 remap Recursive Doubling Algorithm For Broadcast The rst processor sends to a neighbor e two processors simultaneously send to two more processors 7 Them 4 processors send simultaneously to 4 other processors and so on e message gets to everyone in log p phases and maybe log 1 time provide there is no conges 39 n e 1 use quotphasesquot for the high level parts ofthe communication Phases are usually serial and steps n different phases are concurrent e 1 use quotstepsquot for the concurrent parts of each phase 412003 remnant OneToAll Broadcast On A Ring Log p phases with no congestion but multiple links per step except for the last phase 0 Notice the very regular pattern of communication via process number 0 Can you figure out the pattern in general for a ring of 2k processors 412003 commiops 7 AllToOne Reduction On A Ring phase 1 lt phase 1 The phases are reversed with the reduction as well as communication occurring in parallel 7 Eg On phase 1 P7 P5 P3 send to P6 P4 P2 and P6 P4 P2 perform a reduction say sum on the data they have and the data they receive and so on for phases 2 and 3 412003 commiops 8 How Do You Do This In MP1 0 Outline let k log p and assume p is a power of 2 For each 139 from 0 to kil log p phases 7 for each from 0 to 2 fl 2 steps gtgt Establish a fromjrocess and togrocess as fromjrocess j2kquot tOJJrocess fromjrocess 2161 gtgt lfmy process is the fromjrocess send the data to the togrocess gtgt lfmy process is the togrocess receive the data from the fromjrocess 412003 commiops 9 Matrix Vector Multiply P P P Vector b Output Onet o all broadcast Allt o one sum Matrix A l72 42 The diagram is drawn for 16 processes to multiply a 4gtlt4 matriXA times 4gtltl vector 1 with 4gtltl vector resulty 412003 commiops 10 A 2D Mesh Configuration Treat a 2d mesh configuration as a set of 1d wraparound rings in each direction x and y 7 Example a square mesh with p processors 7 Np processors in the x direction and Np processors in the y direction 0 Perform the broadcast in two phases 7 Phase 1 gtgt The process with the data broadcasts its data to all processes in its row x direction 7 Phase 2 gtgt All processes Np process in their co 7 How would you broadcast data in a 3d mesh of them in this row broadcast their data to every lurnn 412003 commiops phase 1 Onetoall broadcast from process 0 to all 16 processes in a 2d square mesh of 16 processes in 4 phases 412003 commiops A Hypercube OneToAll Broadcast 39 A hypercube of size p 201 can be considered to be a ddimensional mesh with 2 processors in each dimension Thus we use the same algorithm e dphaseswith 1 ormore steps per phase The number ofsteps depends on which phase andrs 2H for the kth phase Wm cannian 13 Hypercube Broadcast Figule 35 OnertoeaH prom serum 3 mreestrmerrsrorret hypercube The binary represemmruns or processor tar bets ere shown rn parentheees Cupytht rjr i994 BenjaminCummings pnetrsmrrg Co 39 3 phase broadcast k th phase with 2 steps Wm cannian m A Binary Balanced Tree Broadcast With A Processor At Every Node 0 A phase for every level of the tree Each phase is 2quot steps 7 2 branches from each node of the binary tree As the phase increases the number of concurrent operations increases 7 at level 1 2 concurrent steps 7 at level 2 4 concurrent steps 7 at level 3 8 concurrent steps 7 and so on at level k 2k concurrent steps 412003 commiops A Binary Balanced Tree Broadcast With Switches At NonLeaf Nodes 0 Suppose p 2 that is d levels Broadcast is like the ring algorithm with p processors in which d phases are used 7 process 0 sends to process 2 1 7 process 0 and 2 1 sends to processes 2 2 away from themselves 7 at the k phase whatever processes have the data send to processes 2445 away from themselves 7 See next slide for d 3 7 Notice that there is no congestion in the switches gtgt That is the messages traveling concurrently never go through the same switch or over the same path 412003 commiops Balanced Binary Tree With Switches 0 Notice no congestion of the messages 412003 commiops 17 Detailed Algorithms 0 For p 2d processes the communication algorithms are very similar for the wraparound ring mesh hypercube and balanced tree The algorithms are given for the hypercube mainly because the code is the simplest but the most dif cult to explain it uses the exclusive or operator 412003 commiops 18 Hypercube Broadcast From Process 0 procedure ONEiTOiALLiBC d myiid X Example eg1ri1lask 2d 7 1 mde 3910039 f0 39d71d0wnt00d0 13 ask mask XOR 2quot V ifmy7id AND mask hen mask 111 if myiid AND 2 0 t 11 Values 139 2 1 0 XOR 2 Phase 139 Values mask 3901139 3900139 3900039 Dimension ExecR0S SS Neighbor Only nodes with 039s in the last ibits communicate endfor Nodes with a 1 in the tLhbit receive end ONE7T07ALL7BC 0 NOTE SPMD program each process namedmy id executes this program 412003 commiops 19 Hypercube Broadcast From Any Process source procedure GENERALiONEiTOiALLiBC d myiid source X egm myivirtualiid myiid XOR source mask d 7 1 fort d7 1 downto 0 d0 mask mask XOR 2quot if myivz39rtualiid AND mask 0 then if myivirtualiid AND 2quot 0 then virtualidestination myivirtualiid XOR 2 sendX t0 virtualidestination XOR source else virtualisource myivirtualiid XOR 2quot receive X from virtualisource XOR source endelse endif endfor end GENERALiONEiTOiALLiBC 412003 commiops 20 Hypercube Sum Reduction To Process 0 On A Single Data Item Per Process procedure ALLiTOiONEiREDUCE d myiid rrg X sum begm everse the order and direction of the communication compared With the onet oall broadcast slide 19 if myiid AND 2 it then msgidestination myiid XOR 2 send sum to msgidestination else msgisource myiid XOR 2 receiveX from msgisource sum sum X endelse endif mask mask XOR 2 or end ALLiTOiONEiREDUCE 0 Note that mask is updated after the sendreceive operations whereas with onetoall broadcast it is updatedbefore the sendreceive operations 412003 commiops 21 Hypercube Sum Reduction To Process 0 On A VectorX of 172 Data Items procedure ALLiTOiONEiREDUCEiMilTEMS d myiid mX sum begin to 7 39 Forthe m item case m sums are mask quot 0 1 accumulated on eachprocess for i I 0 to d 7 1 do but only process 0 ends up with if myiid AND mask the correct values if myiid AND 2 it 0 then msgidestination myiid XOR 2 send sum to msgidestination else msgisource myiid XOR 2 receiveX from msgisource forj 0 to m7 1 do sumj sumj Xj endelse endif mask mask XOR 2 endfor end ALLiTOiONEiREDUCEiMilTEMS AssurneXis a vector of m data and the reduction is summation 412003 commiops 22 Cost Analysis For OneToAll Broadcast and AllToOne Reduction Because the hypercube ring mesh and balanced binary tree are very similar the cost of these operations is similar 0 Same number ofphases that is d log p phases 0 Although each phase is made up of one or more steps the steps involve a communication with the same number of words 7 Thus the communication time is IAv m tw log p 0 Compare this cost with the naive way 7 Note that we are assuming the reduction operation is a trivial cost relative to communication 412003 commiops 23 MP1 OneToAll Broadcast Procedures Oneto all broadcast C and CH error MP17Bcast srcidestidata count type romiproc communicator 39 Fortran call MP17Bcast srcidestidata count type amp romiproc communicator error 0 where Srcidestidata is the source of the data to be sent from process romiproc and is the destination in processes other than fr Omiproc count is the number of words to be broadcast type is the MP1 datatype 5 rcidestidat a see next slide f romiproc is the source process from which the data is being broadcast e communicator is the communicator handle for the process group on 412003 which the broadcast is occurring such as MPLCOMMiWORLD error is an integer indicating the success or failure of the procedure commiops 24 MPI Types Named constants are used to specify the NIPI types 7 For Fortran MPLINTEGER MPIiREAL MPIiDOUBLEiPRECISION MPLCHARARACTER MPIiLOGICAL 7 For C MPIilNTMPIiFLOATMPIiDOUBLEMP17CHAR 7 For C MPI INT MPIFLOAT MPI DOUBLE MPICHAR Named constants that specify the reduction operations 7 For Fortran and C MPLSUM MPIiMAXMP17BANDMPIiLAND 7 For C MPISUM MPIMAX MPIBAND MPILAND 412003 commiops 25 MPI OneToAll Reduction Procedures Oneto a11 reduction 0 Cand CH error MPIiReduce srcidata destidata count type operation toiproc communicator Fortran call MPIiReduce srcidata destidata count type amp operation toiproc communicator error 0 where 7 Srcidata is the source ofthe data to be sent from each process to the process fromiproc and is the destination in processes other than f romiproc 7 count is the number ofwords in s rcidata oftype type 7 type is the MPI datatype of Srcidata and destidata see previous slide 7 operation is the reduction operation to be performed see previous slide 7 toiproc is the process to which the reduced data is being sent 7 communicator is the communicator handle for the process group on which the reduction is occurring such as MPLCOMMiWORLD 7 error is an integer indicating the success or failure of the procedure 412003 commiops 25 MPI Blocking Sends And Receives Blocking send and blocking rece1ve C and C error MPlisenCU srcidata count type tag to roc communicator error MPliReCVT destidata count type tag fromiproc communicator status Fortran call MPlisenCU srcidata count type tag toiproc amp communicator error call MPliRe cv destidata count type tag fromiproc amp tag communicator status error where 7 s rcidata is the source data to be sent to process toip roc and destidata is the received data from process f romip roc 7 count type and communicator as before 7 toip roc is the process to which the data is being sent 7 fromip roc is the process from which the data is being sent 7 tag is an integer representing a message identi er 7 s t tus is aninteger army of size MPlisTATUsisIZE representing the st atus of the receive operation 7 e rro r is an integer indicating the success or failure ofthe procedure 412003 commiops 27 AllToAll Broadcast And Reduction Allto all broadcast called lPIAllgather in MPI 7 Each process sends its item to all processes 7 Each process receives an array of items one from each process 0 The naive way is to perform p onetoall broadcasts 7 This would take 15 m tWXp logp The efficient way is to concatenate messages traversing the same link taking t5 m twp7l Allto all reduction 7 All processes receive the reduction of data from all the processes that is 0 the kth process receives one item which is the reduction ofthe k th item from each process 412003 commiops 28 All To All Broadcast And Reduction For broadcast process 0 receives an M from each process For reduction this IS the reduction P 1 me Mlvs Rom 5 ocesses MHWGH bh d bl Mu Mi MN m Vin Mu G D 4 uh1chhwmummn K Fi ule39 A HoraH bwoadcast and mumnude accmnmauon Co n 0w 994 BemmmCummm s Pubhshm Co PV 9 J g g Amman For reductron this is the reductror ed quot oftheM s so all processes 412003 cumminps All To All Broadcast on Linear Array And Ring 0 There are p71 phases ofthe broadcast In each phase each process sends its message to the neighbor to the right and concatenates it with the messages it has already A erpil phases each process has 17 messages epel messages sent to it e the original message it started with 7 See the next slide for a gure for p 8 7 See the slide after that for the algorithIn 412003 cumminps AllToAll Broadcast Figure On A Ring Communication Phase 1 E P Messages at node at the beginning ofthis phase 412003 AllToAll Broadcast Algorithm On A Ring procedure ALLiToiALLchiRlNG myiid myimsg p result begin lEft I myiidi l mOdp lt Determine the process to my le right myiidJr l modp andto my right resu quot my msg msg result for i 3 1 top 1 d9 7quot Message received fromprevious send msg to right 4 step is passed along receive msg from left result result U msg endfor 4 The nal value of result is the end ALLiToiALLchiRlNG concatenation of messages from all processes 412003 commiops AllToAll Sum Reduction Algorithm On A Ring procedure ALLiToiALLiREDiRlNG myit39d myimsg p result begin left my id7 1 modp DeteIijiethe 7 1 7 process to my le tght myithr 1 modp39 andtomyngm ecv r r 0 for sum reduction 39 The message recv came from the process F1 to the right ofrne em send tem to left receive recv from right end or result Wirm myiid recv end ALLiToiALLiREDiRlNG 412003 commiops 33 Trace Of The Messages For All To All Surn Reduction M7 ltT 7 M 6 4 T M5 lt39 M 4 Zlt1 M ltTltT M ltT 7 M1 ltT 3341 M 0 P0 P1 P2 P3 P4 P5 P6 P7 For phase 1 and myiid0 i 1j 1 temp msg1 send temp receive reev from P1 For phase 2 and myiid0 i 2j 2 temp msg2recv send temp receive reev from P1 412003 commiops 34 Trace Of The Messages For All To All Sum Reduction M7 aLH H i HJ HJ HLkLJr aL M A tH t lt ltLlt La 393 M5 H FH H i HLNJ t lt Llt 6 4lt M 4 A ltJ ltz lt 4 L4J 45 J M0 q LRJ t4J FH F4 l F4LF1 l F P0 P1 P2 P3 P4 P5 P6 P7 0 After phase 7 Pk has the sum ofthe messages Mk from every process 412003 commiops 35 AllToAll Broadcast on Mesh of p Processes As with the previous one toall interactions this interaction is alltoall ring broadcasts in two phases First along all rows All nodes end up with a message of size m Wp Second downup all columns All nodes now end up with a message of size mp 412003 commiops 35 AllToAll Broadcast Algorithm on Mesh of p Processes procedure ALLiToiALLchiMESH myiid myimxg p result Communication along r0 I 12 ailm jdmn l timgjgih s hp J ng mg exulr o a 9 fori1top71 do send msg to right receive mxg from lefr result I remit U mg mm 9 6 9 Communicagun along columns up idi pmudp d n 39 my mg 6 0 9 mg 1 Check this algorithm out I sendmsgtu ri ht receive mxg from lefr result remltu mg on a 3X3 mesh endfur end ALLiToiALLchiMESH 412003 commiops 37 AllToAll Broadcast Diagrams on 3 X3 Mesh O 678 345 012 012 39 ution after rowwise broadcast 012 01214 5618 01z34 01z34 012341 56 78 5678 5678 Data distribution after subsequent column wise broadcast 412003 commiops 38 AllToAll Broadcast On The Hypercube 0 It is an extension of the mesh algorithm to cl log p dimensions gtgt Recall there are only two processes in ea h dimension so that there is only one interaction per dimension The XOR operation on the current dimension is used to determine the number of the quotpartnerquot process 7 The XOR operationmyiid XOR 2 changes the ith of myiid thus computing the number of the partner in the same dimension 412003 commiops AllToAll Broadcast Algorithm On The Hypercube procedure ALLiTOiALLchiHCUBH myiid myimsg d result e in result myimsg for i 0tod7 1 do partner myiid XOR 2 send result topartner receive msg from partner resultresultu ms endfor end ALLiToiALLchiHCUBE 412003 commiops AllToAll Broadcast Diagrams On 3d Hypercube 0 Note the algorithm slides 40 and 42 is a pipelined algorithm and turns out to be much more efficient than p onetoall broadcasts using the hypercube algorithm see slides 1314 0 The algorithm slides 40 and 42 Performs simultaneous pairs of sendreceives on disjoint nearest neighbor parts of the cube Each phase feeds to the next stage of communication so that after d phases all messages have been received by all processes 412003 commiops 41 a Initial distribution of messages be the rstphas 0 7 0 7 0 7 d Final distribution ofmessages 42 0 3 0 3 412003 c Distribution ofmessages commiops before third phase ALLTOALLBCHCUBE For a 2d Cube Variables From Algorithm On Slide 40 TUBU101 U111 412003 commiops 43 AllToAll Sum Reduction Algorithm On The Hypercube procedure ALLiToiALLiREDiHCUBH myiid msg d result egin recviloc 0 foridilt00do partner myiid XOR 2 j myiid A 2 k myiid XOR 2 AND 2 sendiloc recviloc k recviloc recviloc j send msg sendiloc sendiloc 2 7 l to partner receive temp 0 2 7 l frompartner formdilt00do msg recviloc m msg recviloc m tempm endfor endfor result msgmy71d end ALLiToiALLiREDiHCUBE 412003 commiops 44 ALLTOALLREDHCUBE For a 2d Cube Variables From Algorithm On Slide 44 412003 m o s 7 m f Notation M from node p eg 03 1s message 0 from nod ieu Cost Analysis Of AllToAll Broadcast On a ring or linear array of p processes slides 31 32 7 It takes p71 phases ofa message of size m 7 Thus the time to broadcast on a ring or linear array is T 4 tVnXPel On a mesh with p processes slides 373 8 7 The rst phase rows is Wp7l steps with message size m 7 The second phase columns is also Wp7l steps but with message size m p 7 Thus the time to broadcast on a mesh is T 4 MOOrel IS WVPXVP I 2 W71 was 71 412003 commiops Cost Analysis Of AllToAll Broadcast Continued 0 On a hypercube with p processes of dimension d log 7 There are d phases slides 40 42 7 The message doubles in size each successive phase 7 Thus for phase 139 the message is ofsize 2quot1 m Thus the time is d71 T 20 2 twm dt twmp l i0 412003 commiops Analysis Of AllToAll Broadcast The term proportional to the network bandwidth is the same for the ring mesh and hypercube with the same number of processors 0 Thus for long messages the times are the same 7 The hypercube is no better than the other con gurations for long messages 7 This is because of the pipelined nature of the hypercube algorithm gtgt This pipelined idea can be used to improve performance of a gorithms such as LU factorization and back substitution 0 For short messages the hypercube is the least 412003 commiops Contention Of Hypercube AllTo All Algorithm Mapped To A Ring gt messages Third phase of the 3d hypercube mapped to the ring Slide 42c 412003 commiops 49 AllReduce Operation Allreduce is the following operation 0 Each process starts with a value ofsize m and receives the same reduced value ofsize m of all the values 0 It has the same effect as an alltoone reduction followed by a onetoall broadcast 0 However it is much more efficient to use the alltoall broadcast algorithm and replace the concatenation operation U by the associative operation of the reduction like Allreduce is implemented as a variation of allto all broadcast see next slide and slide 40 0 For a hypercube it has a cost like T ts twm log p 7 Cf for alltoall broadcast on a hypercube tslogp WMpil 7 Note however the message does not increase in size like allto all broadcast 412003 commiops 50 7 x local sum x outgoing message a lnlu39al distribution of values before the rst phase 7 4 7 4 7 0 3 412003 c Distribution of sums before third phase 0 7 comm ops d Final distribution ofsums 0 7 AllReduce Algorithm On The Hypercube procedure ALLiREDUCEiHCUBE myiid myivalue d result begin result myivalue fori0tod71do partner my id XOR 2 send result minartner result resu endfor end ALLiREDUCEiHCUBE 412003 receive value from artner lt e V commiops Note U in alltoall broadcast algorithm replaced with for the allreduce algon39thm Cf Slide 40 52 PrefixSum Operation 0 Like an allreduce except that each node receives a partial or prefix sum of the values 7 Suppose processt has value nk for eachk from 0 to p71 7 Then process Pk ends up with the sum of the values no n1 nk Again based on the allto all broadcast Also called a scan operation A more general version is where the nk are vectors of values 7 Then the kth partial sum of each component ends up inPk Eg Suppose the values 3 l 4 0 2 are on 5 processes 7 The result of a prefix sum operation is the value 3 on process 0 4 3l on process 1 8 3l4 on process 2 8 3l40 on process 3 anle 3l 402 on process 4 412003 commiops Hypercube PrefiX Sum Diagrams 7 o x 1oca1 sum x outgoing message 0 1 a Initial distribution of va1ues before the rst phase b Distribution of sums before second phase 61 7 W 71 0 7 0 7 0 7 412003 0 Distribution of sums before commiops d Fina1 dismbutjoh ofpre x 54 third phase sums PrefixSum Algorithm On The Hypercube grocedure PREFIXisUMsiHCUBH myiid myinumber d result egln result myinumber msg resu t39 The message sent is the full a umulation he result is only the accumulation of values from processes earlier in the receive ms from 39 Processes order msg msg er if partner lt myiid then result rest1 msg endfor end PREFIXisUMsiHCUBE 412003 commiops Scatter and Gather Operations 0 The scatter operation sends a unique message from a specific process to all other processes Also called one to all personalized communication Different than a broadcast in that a broadcast sends the same one message to all other processes gtgt There is no duplication of data Its dual is the gather operation gtgt Gather a different data item from all processes gtgt Also called concatenation 412003 commiops Diagrams For Gather Scatter Operations Scatter from one to all 5 M0 M1 M1 6 Gatherfromalltoone Q Use a modi ed oneto all broadcast algorithm for scatter operation as follows 7 The rst send is the concatenation of all messages In the next hase each process sends half the received message to the partner and keeps halfthe message 7 e until ofthe message becomes of size 1 Reverse the process for gather concatenate instead of split on each send 412003 commiops 0113 a Initial distribution of be the rstphas messages 412003 0 Dismbution of messages comm ops d Final distribution ofmessages before third phase 58 Cost Analysis Of Scatter and Gather Operations There are log p steps on the hypercube 0 Each steps sendsreceives amessage of decreasing size each decrease is by a factor of2 gtgt The sum ofthe message lengths is 20 21 239H which is p71 0 Thus the time for the scatter operation is T Slogp W39Mp fl 0 The time for a gather is the same 7 Because the same algorithm is being used for the ring and 2d mesh the times for these topologies is the same as for the hypercube Exercise 47 412003 commiops 59 AllToAll Personalized Operations Each processes sends a different message to each other process also called total exchange 7 Notice this communication performs a transpose of the data MM M M owl 11H F l l M M H0 H1 IHJH Mai Mm Mprm M10 Mm M1971 615 b 323 i 37 Allt oall personalized communication Mao 412003 commiops 50 Matrix Transposition Consider an nxn matrix with n processes 0 Suppose each process has one row 0 An alltoall personalized communication will transpose the matrix amongst the processes 7 That is a process that has the kth row before the communication Will ve the kth column after the communication 7 Suppose p is smaller than n the usual case but for now p divides n 0 Now place np rows per processes and break the rows intop blocks of size np X n p 0 An alltoall personalized communication with blocks of size np gtlt np nearly performs a transpose 7 The transpose is completed by transposing the blocks once they arrive at their destination 412003 commiops o1 Diagram Of Matrix Transposition P0 A 21y Top left initially Pl AT 211 P2 Bottom right a er transpose P3 Reconsider this diagram Where the elements are blocks of a matrix After the blocks are exchangeot they MUST be transposedto complete the operation This transpose operation is used in 2 d FFT codes for example 412003 commiops o2 AllToAll Personalized Operations On A Ring Consider p processors each with p pieces of data of size In The operation is performed in p l phases on a p processor ring In the first phase each process sends all but one of its data to its neighbor in a message ofsize mp 71 Each process keeps on item of data and sends the rest to its neighbor on the second phase it sends mp 72 pieces of data And so on until there is no data to send 7 At this point each process has a portion of data from every process including 1 item it originally had 7 See the next slide Where the tuple fk designates a piece of data from process j destined for process k 412003 commiops AllToAll Personalized Comm On A 6 Processor Ring 5 my lt54 w 5 p p 3 0 64 727Tf4 9 7 a 3 Keeps 40 gt 394 4 eeps 30 quot372 5 5 661 270 Receives 10 412003 commiops 54 Cost Analysis On A Ring Of AllTo All Personalized Communication 0 From the previous diagram for a ring There are 1 steps for a p processor ring For the k step there are mng words sent Thus the cost is p71 T Z tstwmp k p ltvtwmpp l2 k1 I quotlth 2P 1 This is optimal ignoring IAv because it is usually negligible 412003 commiops 55 AllToAll Personalized Operations On A Mesh Perform the operation in 2 phases 7 For the rst phase perform an allto all personalized operation in each row 0 To do this the messages on each processor in a row need to be grouped so that gtgt the messages in the kth group are destined for the kth process in gtgt Now perform all row alltoall personalized communication 7 For the second phase perform an all toall personalized operation in each column 0 To do this the messages after the first phase in each process are grouped so that gtgt the messages in the kth group are destined for the kth process in the column gtgt Nowperform all column allto all personalized communications 412003 commiops All To All Personalized Operation On A 3X3 Mesh An Example of the 2 Communication Phases Using 361 Notation Part 1 8707873786 818487 87278757878 505356 Data distribution and arrangement before 0 606366 707376 616467 7174 77 626568 12357 578 3707373156 3717374lvll7 3727375lvl378D 4707473lvl4 4717474lvl477 4727475lvl478D sor ng 35 62650s Cx727578 21 7 222528 87278757878 0707073lvlov5lv 1707173lvlL6 0717074lvlov7lv 1717174717 6717674lvl677 37273751978 020508 121518 717477 424548 Data distribution and 8391398394398397 525558 arrangement before 87078737876 A the column 3114081 3 Ins0137 303336 0 2 0 5 0 8 personallzed 404346 414447 1392391S1S communicatiOn and 5707531576 515457 v 2727275lvl278 before resorting 07070731m6 17071731076 412003 lm nsln sp commiops 0717074lvlov7 17171741077 2717274lvl277D Sorting Process Before Phase 2 Communication 0 The sorting for process 8 is illustrated below After phase 1 communication by rows process 8 has 6526508 727578 828588 The result after sorting the messages in process 8 before phase 2 communication by columns 652752a82 6575585 687888 412003 commiops 68 All To All Personalized Operation On A 3X3 Mesh An Example of the 2 Communication Phases Using 361 Notation Part 2 575777578757 Data distribution and 68 7888 arrangement before um12152 354555 37874787578 commumcation after sorting 02 12 22 0 oysi lys zysi x3x 081si 28 384858 081828 4 aslnslmsp 67077707807 67377737837 67677767876D 67177717817 6747774lvlsv4 67777777877D 3707470150 3737473lvl573 3767476lvl576D 3717471151 37474741574 3777477lvl577D 07071701070 Ovllvllvllvllll 07371731073 0747174lvll4 0767176lvlzv6D 0777177lvlzv7D 07771771977 07571751975 374757 354555 061626 677787 657585 aiming o 667686 Flgal d alta dllstnbuat m 07371731931 041424 07271721972 a er e co 3737473l 3 344454 3 2 4 2 5 2 toall personalized 573 M77318 54 TQM874 639239739239839239 communlcation o 39 39 39 39 39 00l020 39 011121 412003 3 0 4 0 5 0 3vllvl4vllvl571 69 607080 mm 617181 Cost Analysis On A Mesh Of All ToAll Personalized Communication From the previous diagram for a mesh 0 There are Wp7l steps for ap processor mesh for phase 1 row communication but they are concurrent 7 The messages are of size mp 7 Thus the cost is substituting into the ring cost TttwmJEJE2JZ 1rxtwm 2JE 1 0 For the next step there IS the same communlcatlon With the same message lengths 0 However there are two sorting steps which takes trmp where I is the time to perform a read and write to memory 7 This is negligible compared to the t and tW terms 0 Thus the cost is T 21A twmp 1 0 This is optimal ignoring IS because it is usually negligible 412003 commiops 70 AllToAll Personalized Operations On A Hypercube Treat the ddimensional hypercube as a d dimensional mesh and perform an allto all personalized communication in each dimension d phases 0 This works and results in the communications shown on the next slide 7 At all phases each process has ppackets of size m 7 Each phase sorts its messages into two parts halves gtgt One half it keeps the size of the part is p2 packets gtgt The second part it sends to its partner in the kth dimension for the kth phase 412003 commiops All ToAll Personalized Operation On A 3d Hypercube An Example the 3 C smunication Phases Us 39 3611 N tion nomz Ms 6 quot67 lt6ogtlt62gt ms 70 m lt611gtylt63gt 6 NEWS 77gtgt 20 A 20ZZ Z6 0 0 32 36 z1z3 z7 33 17 0 57 40 3 0 lt4 Zgt lt 0 lt5o5z 56 sum 57 WNW ly6 00 07 110 117 00 oz 06 gaffe 053 mam m aInitialdistributionof b Distribution ofmessages a ers in rstphas 711 M75 M61 X615 73 7y7636y7gt z53135 7Z3Zy7gt 515541M415 535y7434y7 D messages before the rst phase 606lh 66 IMAM 76 mount 26 Q 3031 36 I 7173 77 61 1 67 20Zy43034 22Zy6321316gt OMAN mam 57 Hams w 01gtltosgtlt11gtlt15gt gtlt04gtlt1l0x14gt 03071317 00012 06 00 1 0 1 Z 1 6 1M 1 y D 012 Y Ylyzxmad h f Distn39 utiono messages c Distnbutlon ofmessages 412003 mme P5 after sorting in second phase 72 before second phase 02 lto6gtlt12gtlt16gt sgtlt31gtlt2sgt lt22gtlt26gtlt32gtlt36gtgt Z0ZA303A o lt2zgtlt26gtlt2zgtlt26gtgt 9 Z lt4ogtlt44gtlt5ogtltlgt e 37Z3Z7Y lt42gtlt46gtlt52gtlt7 l 404450 6 y y y y y 11554145 535y7434y7gt 0 01gtltosgtlt11gtlt15gt 00 04 10 14 00041014 03071317 Eioi zgi ioi ygg W v v Y W e Disthbutiohormessdg s 16 a Distribution ofmessages m39hgihsecohd ase before third phase a e lt42lt5Z6Zgt7Zgt 46566676 MM1ZgtltZZlt3Z 0616263 MM 1 6M 216M316 465y6661716gt QZM12ZYZ3Z 4 y y y y y y 2 4050160 445464 04ly4k y4 445464 0010z030 00MLOMZOMLOL 0 A l A A A A 39 412003 third phase after sorting commiops 5 FM d smbmwn messag s Cost Analysis On A Hypercube Of All ToAll Personalized Communication There are log p phases 0 Each process hasp messages at every step 0 In each phase each process sends concurrently half its messages to 1ts partner 0 Thus the cost is T tstwmp2logp 0 Again one must count the time to sort the messages but as before this is I mp per phase logp phases gt Because t39 is very small relative to t and tw it can be neg ec e 0 However this cost is not optimal The next slides give the optimal algorithm for long messages 412003 commiops 74 An AllToAll Personalized Optimal For Long Messages Algorithm On A Hypercube Consider an algorithm in which each node communicates with every other node p l times 7 Use the Ecube routing pattern to avoid congestion as follows 0 Each process sends to a partner llink away in the first two dimensions Each process sends to a process 2links away in the subspace ofthe above 2 dimensions 0 Each process sends to the remaining process llink away in the the next dimension 0 Each process sends to a process 2links away in the new subspaces and so on Each process sends to a process 3links away into a new subspace and so on 7 See the next two slides for an illustration for a 3d cube 7 The routing that must be used to avoid congestion is the Ecube routing sorting the dimensions in ascending order commiops 412003 75 b Second phase 1 away 39 2 di 39 en nd m 412003 0 Third phase 2away in commiops d Fourth phase 1 away in the 76 the 1st two subspaces next dimension 7 Phase Algorithm For 3D Hypercube e Seventh phase 3away in the full space 412003 commiops e Sixth phase 2away in the rdZ d subspace e Seventh phase path taken message 77 Code For AllToALL Personalized Communication On A d Dimensional Hypercube procedure ALLiToiALLiPERSONALiHCUBH d myiid forilt02 171 do partner myiid XOR i sendey 1017mm topartner receive pigmmm rom partner endfor end ALLiTOiALLiPERSONALiHCUBE 412003 commiops Review the previous slides notice that in the ith phase process myiid sends a message to process myiidXORi Also notice that no process sends a message to itself 78 Cost Analysis For ECube Routed Messages For AllToAll Personalized Communication Each process sends one message of length m to a partner in each phase p71 phases 0 Thus the cost is T ts twmp l The cost of the nonE cube routing algorithm is T l Av twmp2log p 0 The Ecube algorithm is better for large m by a factor of log 102 0 For small m IS will dominant and the factor logp2 is considerably smaller than the factor p 7l 7 Thus the nonEcube algorithm is faster for small messages 412003 commiops 79 Circular Shift Operations On A Mesh This operation sends a message of size m to a process q maybe 1 processors awa That is process myiid sends to process myiidJrq modp in a world ofp processes 7 Use Jl in string and image pattern matching codes 7 In a ring perform min q pq neighbortoneighbor communications in one direction 7 his is because it is wmpped around in the same way the modulo mctionis de ned 7 The mesh algorithm is based on the ring algorithm as follows 7 Shi the rows q mod Vp to the right gtgt This is correct except for the elements shi ed off the right which must be shi ed up to the next row 7 Shi the rst q mod Vp columns up one position to compensate 7 Shi all columns up LqVpi positions This can be improved by shi ing in the x and y directions the minimum distance and if x shift towards the left is used then shi the last columns down instead of up in the second step 412003 commiops 80 Circular Shift 5 Positions On A 4gtlt4 Mesh 6 1 J 63 60 b Results after the first phase before second phase for backward row shift a Initial distribution of data Arrows indicate communication to take place in the first phase Shift Arrows indicate communication right 5 mod 4 positions to take place in the second phase shift up the first 5 mod 4 columns up l54l positions 412003 commiops 81 Circular Shift 5 Positions On A 4gtlt4 Mesh Cont39d 0 Results after the second phase before third phase for column shift operation d Final distribution of data Arrows indicate communication to take place in the third phase shift all columns up l54l positions 412003 commiops 82 Cost Analysis For The Circular Shift Operation On A Mesh There are three phases of communication Assume that the shifts are optimized so that they use the minimum distance shift either left or right or up or down of at most LWpZJ shifts The rst and third phases send messages of size m for at most LWpZJ times The compensation shift may be multiple columns but is at most a shift of one position up or down Thus the cost is at most T IAv twm 1 412003 commiops 83 Circular Shift Operation On A Hypercube There are 3 reasonable algorithms 1 Map a ring of 2d nodes onto a hypercube of the same number of nodes using the re ected Gray code RGC Section 27 page 67 This mapping and ordering of the nodes has the property that any two nodes that are a distance 2 J apart in terms of the node numbers for j gt0 are exactly 2 links part and for j 0 any pair ofnodes distance by l are 1 link apart Using as many phases as 1 bits in q perform a forward shi equal to the distance of each bit in q The sum ofthese shifts is equal to q 2 Same as above but perform forward and backward shifts so that the number of shifts needed is reduced by nearly a factor of2 eg to perform a forward shift of 6 42 perform a backward shift of 2 instead Use Ecube routing on pointtopoint communication This algorithm is the best and is optimal for long messages 9 412003 commiops 84 RGC Mapping Of A 8 Node Ring To An 8 Node Hypercube Usual numbering Numbering using the RGC mapping Notice 091 are 1 1in apa from a ring to a hypercube Where as 192 are 2 links pm Notice 091 192 are all 1 link apart 0 92 193 294 are all 2 links apart and so on 412003 commiops 85 A 5 Shift On An 8 Node Hypercube Using The GC Numbering Part 1 First phase of a 4 shi Secondphase ofa4 shitt Arrows indicate the communication to be performed in this phase n indicates the message locatedinthis node before communication begins Arrows indicate the communication to be performed in this phase n indicates the message locatedinthis node before communication begins A forward 5 shift is a forward 4shi followed by a forward lshi A backward 3 shi is equivalent to a forward 5 shi A backward 3 shift is a backward 2 shi followed by a backward 1 shi 412003 commiops A 5 Shift On An 8 Node Hypercube Using The RGC Numbering Part 2 Third phase is a 1shi Final con gumtion a er the 4 shi and 1shi Arrows indicate the communication to Opera ons are complete be performed in this phase n indicates the message locatedinthis node before communication begins This is a 5shi eg message 0 node 5 1 gt6 2 7 3 0 4 gt15gt2 6 gt3 7gt gt4 412003 commiops 87 The Ecube Routed Shifts For An 8 Node Hypercube quotquotquot quot 39I I 3shi Recall the Ecube routing uses the exclusiveor of the source and destination node numbers and sends the message along the dimension indicated by the least signi cant 1bit of this exor Example 3shi from node 3 goes to node 6 send message to node 2 then node 6 412003 commiops 88 The Ecube Routed Shifts For An 8 Node Hypercube Continued Notice no link has more than two messages routed on it and when there are two they are opposite in directions thus no congestion or contention 412003 commiops Cost Analysis For The Circular Shift Operation On A Hypercube Algorithm 1 RGC numbering with forward shifts only gtgt log p phases with 2 steps in each phase 2 links except for the lshi operation gtgt Thus the cost is T ts twm2log p l Algorithm 2 RGC numbering with backward and forward shifts gtgt Shi distance is at mostp2 gtgt Thus the cost is T ts twm logp Algorithm 3 Ecube routing gtgt One pointtopoint communication with no contention gtgt Thus the cost is T t twm 412003 commiops Improving The Speed Of Some Operations There are tWO approaches Splitting and routing the messages in parts for each kind of interaction Onetoall broadcast Alltoone reduction Allreduce Using all ports connected to a node simultaneously 412003 commiops 91 Issues With Splitting The Message Into Smaller Pieces Can we do better by splitting the message M in p equal parts assuming p processors and performing the communication over different paths with shorter messages 7 Yes provided the size of M say m is large enough 0 For short messages the previous algorithms are better 0 The improvement comes from increasing the factor multiplying t5 and decreasing the factor multiplying IW in the cost formula Thus depending upon IS IW and p there is a cutoff for m 7 That is for messages of size m lt me for some me it is better to use the previous algorithms 7 That is for messages of size m 2 me for some me it is better to use the new algorithms given on the next slides 412003 commiops 92 OneToAll Broadcast On A Hypercube Split the message into p equal parts of size mp 0 Perform a scatter operation of the p parts to the p processes from the root process 7 The cost for this step from slide 59 is T Jogp tWmpp 71 0 Perform an alltoall broadcast of the p parts top processes The cost for this step from slide 47 is T Jogp tWmpp 71 7 The cost for both steps is Tm ZIJogp 2twm 7 Compare this with the onetoall broadcast cost of a message of size m from slide 23 which is T IAv m tw logp 7 For large m the quotmessage split approach is a factor of 0510g p faster 7 But the latency term is a factor 210g p larger 0 Similar improvements are seen for the ring and mesh topologies 412003 commiops 93 AllToOne Reduction And AllReduce This is the dual of the onetoall broadcast 7 Thus the improvement is similar for the hypercube mesh and ring An allreduce operation is as if semantically equivalent to an allto one reduction is followed by an oneto all broadcast 0 Thus the improvement is potentially the same 0 However it is even better because 7 we can skip the redundant operations in the middle they cancel That is gtgt An alltoone reduction is an alltoall reduction followed by a gather operation gtgt An oneto all broadcast is a scatter operation followed by an allto all broadcast 0 Thus perform an alltoall reduction followed by an alltoall broadcast 7 The cost for a hypercube is T e 2tSlogp 2Wquot 7 Compare this with slide 50 T 15 twm logp 412003 commiops AllPort Communication The algorithms so far have used the single p0rt communication model 0 We have assumed the one for outboard traffic and the one for inbound traffic can be used simultaneously 0 However most topologies has multiple 2way porm gtgt A ring has two ports le and two ports right gtgt A wraparound mesh has two ports in each of the north south east and west gtgt A hypercube of p nodes has 2 log p ports If the multiple ports are used the model used is an allport communication model 412003 commiops 95 All Port Communication Continued There is potential improvement in certain cases 0 For the hypercube using all ports means the potential for reducing the IW term by a factor of log p on the onetoall and alltoall broadcast and the personalized communications 0 For the ring and mesh the cost is reduced by at best a constant constant number of additional ports for these topologies and no improvement in the asymptotic rates constant factors do not affect these rates However there are major limitations 0 Very difficult and complicated to program 0 In cases where the communication costs are small compared to computation it makes no different 0 Not clear that the bus bandwidths and node architectures can handle the increased data movement by using multiple ports 0 A notable exception to this last point is large NUMA shared memory machines such as SGI Origin IBM SP with SMP nodes where multiple memory can match multiple ports 412003 commiops 95 Summary Of Costs For Comm On The Hypercube CrossSection Bandwidth Hypercube Time broadcast min 4 tm ogp 2tAlogp rm 1 reduction broadcast tAlogp twmQFl l Scatter Crosssection or bisection bandwidth requirement I Needed to attain the opemtion times I These times are valid for other topologies with this bisection Width I Ifnot available need to multiply the twterm by the factor that it misses these requirements 412003 commiops 97 broadcast 412003 commiops 98

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.