### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# PARALLEL & DIST COMPUTING CSC 4310

GSU

GPA 3.62

### View Full Document

## 38

## 0

## Popular in Course

## Popular in ComputerScienence

This 130 page Class Notes was uploaded by Mallie Crist on Monday September 21, 2015. The Class Notes belongs to CSC 4310 at Georgia State University taught by Staff in Fall. Since its upload, it has received 38 views. For similar materials see /class/209898/csc-4310-georgia-state-university in ComputerScienence at Georgia State University.

## Reviews for PARALLEL & DIST COMPUTING

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/21/15

Chapter 10 Numerical Algorithms Matrices A Review Column a00 a01 a0m72 a0mil a10 a1 1 a1 m72 almil Row i i i I aquot720 an72l an72 m2 an72mil anil0 anil 1 39 39 39 39 39 39 39 anil m72 anilmil Figure 101 An n X m matrix Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 245 Matrix Addition Involves adding corresponding elements of each matrix to form the result matrix Given the elements of A as ellJ and the elements of B as b each element of C is if computed as ciJ ellJ bill 0Siltn0 39ltm J Matrix Multiplication Multiplication of two matrices A and B produces the matrix C whose elements ciJ 0 S i lt n 0 S lt m are computed as follows 171 cu Z dikka k0 where A is an n X lmatrix and B is an l X m matrix Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 246 Sum Munlpllt results gt Row I CU A X B C Figure 102 Matrix multiplication C A X B MatrixVector Multiplication Matrixvector multiplication follows directly from the de nition of matrixmatrix multiplication by making B an n X 1 matrix vector Result an n X 1 matrix vector Ci Figure 103 Matrixvector multiplication c A X b Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers any Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 247 Relationship of Matrices to Linear Equations A system of linear equations can be written in matrix form Axb Matrix A holds the a constants K is a vector of the unknowns b is a vector of the b constants Implementing Matrix Multiplication Sequential Code Assume throughout that the matrices are square n X n matrices The sequential code to compute A X B could simply be for i O i lt 1 i for j 0 j lt 11 j Clillj 0 for k O k lt n k CH 2 CH 2 Eli k blk 2 This algorithm requires n3 multiplications and n3 additions leading to a sequential time complexity of On3 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 248 Parallel Code With n processors and n X n matrices Can obtain Time complexity of On2 with n processors Time complexity of On with n2 processors where one element ofA and B assigned to each processor Cost optimal since On3 n X On2 n2 X On Time complexity of Olog n with n3 processors by parallelizing the inner loop Not cost optimal since On3 at n3 X Olog n Olog n lower bound for parallel matrix multiplication Partitioning into Submatrices Suppose the matrix is divided into s2 submatrices Each submatrix has ns gtlt ns elements Using the notation Ap I q as the submatrix in submatrix row p and submatrix column q for p 0 p lt 5 P for q 0 q lt s q CPIq O clear elements of submatrix for r O r lt m r submatrix multiplication and CPIq AP r Br39q add to accumulating submatrix CPIQ The line Cplq Cplq Apr Eng means multiply submatrix Z iplr and Brq using matrix multiplication and add to submatrix Cp q using matrix addition Known as block matrix multiplication Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 249 Sum Mumply qgt results Figure 104 Block matrix multiplication 00 01 02 03 00 0 1 0 2 03 10 11 12 13 10 11 12 13 x 20 21 22 23 20 2 1 2 2 23 30 31 32 33 30 3 1 3 2 33 a Matrices A00 00 01 3 10 00 01 00 01 02 03 2 0 2 1 X X 10 11 10 11 12 13 30 31 00 00 01 10 00 01 01 11 02 20 03 30 02 21 03 31 1 10 00 11 10 10 01 11 11 12 20 13 30 12 21 13 31 00 00 01 10 02 20 03 30 00 01 01 11 02 21 03 31 10 00 11 10 12 20 13 30 10 01 11 11 12 21 13 31 Figure 105 Submatrix 000 multiplication b MultiplyingAOvO gtlt BOYD to obtain C070 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 250 Direct Implementation One processor to compute each element of C 712 processors would be needed One row of elements of A and one column of elements of B needed Some of same elements sent to more than one processor Can use submatrices Columnj b j Rowi aj Processor Pia c i j Figure 106 Direct implementation of matrix multiplication Analysis Assuming n X n matrices and not submatrices Communication With separate messages to each ofthe n2 slave processors room nzamp 2mm nzamp rm mam 2n Dram A broadcast along a single bus would yield tcomm tstmup quotztdaw quot2tstaltup tdata Dominant time now is in returning the results Computation Each slave performs in parallel n multiplications and n additions ie t 2n comp Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 251 Performance Improvement Using tree construction n numbers can be added in log n steps using n processors 000 700 001 710 002 720 003 1730 P0 P1 P2 P3 E Computational time P0 complexity of Olog n using 713 processors 000 Figure 107 Accumulation using a tree construction Recursive Implementation i J i a i i P 0 P1 P 2 P 3 APP AMI BPP qu J a V i qu Aqq iqu qu qu mggm Apply same algorithm on each submatriX recursiVly Figure 108 SubmatriX multiplication and summation Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 252 Recursive Algorithm matimult APP BPP s if s 1 if submatrix has one element C A B multiply elements else else continue to make recursive calls s s2 the number of elements in each rowcolumn P0 matimultAPP BPP s P1 matimultAPq Rm 5 P2 matimultAPP qu s P3 matimultAPq qu s P4 matimultAqP BPP s P5 matimultAqq BqP s P6 matimultAqP qu s P7 matimultAqq qu s CPP P0 Pl add submatrix products CPq P2 P3 to form submatrices of final matrix ch P4 P5 qu P6 P7 return C return final matrix Mesh Implementations Cannon s Algorithm Uses a mesh of r with r J quot a torus to shift the A elements or submatrices left and the B elements or submatrices up 1 Initially processor PU has elements 61 and bl 0 S i lt n 0 S k lt n 2 Elements are moved from their initial position to an aligned position The complete ith row of A is shifted i places left and the complete jth column of B is shifted j places upward This has the effect of placing the element dial and the element ij39J39 in processor P J These elements are a pair of those required in the accumulation of cw LA Each processorPJ multiplies its elements The ith row of A is shifted one place right and the jth column of B is shifted one place upward This has the effect of bringing together the adjacent elements of A and B which will also be required in the accumulation Each processorPJ multiplies the elements brought to it and adds the result to the accumulating sum Step 4 and 5 are repeated until the nal result is obtained 71 1 shifts with 71 rows and 71 columns of elements 4 UI ON Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 253 Figure 109 Movement ofA andB elements places j places dial1 HJ39J Figure 1010 Step 2 7 Alignment of elements ofA and B Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 254 Figure 1011 Step 4 7 Oneplace shift of elements ofA and B Analysis Communication Given s2 submatrices each of size m X m the initial alignment requires a maximum of s 1 shift communication operations After that there will be s 1 shift operations Each shift operation involves m X m elements 7 2 licomm 7 2S 1tstartup m ltdata or a communication time complexity of Osm2 or Omn Computation Each submatrix multiplication requires m3 multiplications and m3 additions Hence with s 1 shifts 7 3 7 2 tcomp 7 2sm 7 2m 71 or a computational time complexity of Om2n Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 255 Systolic 533 I 1 Arm 32 23 y Pum in 31 522 513 t p g 30 21 b12 503 ac Ion 520 11 502 39 10 01 39 39 boyo n Figure 1012 Matrix multiplication using a systolic array 03 02 01 00 One cycle delay 13 12 11 10 23 22 21 20 33 32 31 30 Code Each processor PU repeatedly performs the same algorithm after c is initialized to zero recvampa Pilj1 receive from left recvampb Pi1lj receive from right c c a b accumulate value for cij sendampa Pilj1 send to right sendampb Pi1lj send downwards which accumulates the required summations Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 256 MatrixVector Multiplication Pumping action gt 03 02 01 00 13 12 11 10 23 22 21 20 33 32 31 30 Figure 1013 Matrixvector multiplication using a systolic a1ray Solving a System of Linear Equations n71oxo n711x1 n712x2 n71n71xn71 bn71 azoxo azlxl a22x2 alr xr b2 aLoxo auxl a12x2 aLr xr b1 aqoxo aoJxl a02x2 a0n1xn1 b0 which in matrix form is Ax b Objective of solving this system of equations is to nd values for the unknowns x0 x1 xr given values forao 0 6101 an71n71 and b0 bquot Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 257 Gaussian Elimination Objective is to convert the general system of linear equations into a triangular system of equations which can then be solved by Back Substitution Uses the characteristic of linear equations that any row can be replaced by that row added to another row multiplied by a constant Starts at the rst row and works toward the bottom row At the 1th row each row j below the 1th row is replaced by row j row 1 61 a The constant used for row j is a a Has the effect of making all the elements in the 1th column below the 1th row zero because a i J 0 11 Column Row Row 1 Step through Row j Cleared t Already 0 zero cleared to zero Column 1 Figure 1014 Gaussian elimination Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 258 Partial Pivoting If am is zero or close to zero we will not be able to compute the quantity a ialJ Procedure must be modi ed into socalled partial pivoting by swapping the ith row with the row below it that has the largest absolute element in the ith column of any of the rows below the ith row if there is one Reordering equations will not affect the system In the following we will not consider partial pivoting Sequential Code Without partial pivoting for i O i lt nl i for each row except last for j i1 j lt n j step through subsequent rows m aj iai i Compute multiplier for k i k lt n k modify last nil elements of row j alj k alj k Eli k m bj bj bi m modify right side The time complexity is 0n3 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 259 Parallel Implementation Column Row 71 1 1 elements b i Row 1 Broadcast 1th row Already cleared to zero Figure 1015 Broadcast in parallel implementation of Gaussian elimination Analysis Communication 71 1 broadcasts performed sequentially 1th broadcast contains 71 1 1 elements quot7 2 2 1 n n ltcomm Z Ustartup 7 71 1tdata 7 7 1tstartup 7 3tdata i 0 or a time complexity of On2 Computation After row broadcast each processor PJ beyond broadcast processor P will compute its multiplier and operate upon 71 j 2 elements of its row Ignoring the computation of the multiplier there are n j 2 multiplications and n j 2 subtractions 71 n 3 n 2 2 tcomp2Znij2 2 73On j 1 Ef ciency will be relatively low because all the processors before the processor holding row 1 do not participate in the computation again Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 260 Pipeline Con guration Broadcast rows Figure 1016 Pipeline implementation of Gaussian elimination Strip Partitioning P0 P1 P2 P3 Figure 1017 Strip partitioning Processors do not participate in computation after their last row is processed np Znp 3np Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 261 CyclicStriped Partitioning An alternative which equalizes the processor workload Row quotpPl 3np Figure 1018 Cyclic partitioning to equalize workload Iterative Methods Jacobi Iteration Iteration formula the ith equation rearranged to have the ith unknown on the left side k 1 k4 x i 7 E Cit ij ii J i Superscript indicates iteration x is kth iteration of x xf 1 is k 1th iteration of x Time complexity of the direct method is signi cant at ON2 with N processors Time complexity of iteration method will depend upon the number of iterations and the required accuracy A particular application of the iterative method is in solving a system of sparse linear equations Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 262 Laplace s Equation 2 Q 2 f6f0 Objective is to solve for f over the twodimensional space having coordinates x and y For a computer solution nite di rence methods are appropriate Twodimensional solution space is discretized into a large number of solution points Solution space Figure 1019 Finite difference method Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 263 If the distance between the points in the x and y directions A is made small enough 6sz 1 6x2 A2 Q 6y2 Substituting into Laplace s equation we get l xtAJfxAyfxyAfxyA4fxJl 0 Rearranging we get fxy fx A yfx y AfXA yfx yA fxAy2fxyfxiAJl mxy A 7 2m y fxye Am The formula can be rewritten as an iterative formula fquotxy f kx y is value obtained from kth iteration f k 1x y is value obtained from k lth iteration fk 1xeAy fk 1xyeA fquot 1x A fk 1xy An 4 0 7amp77777757770777707770777 01 x2 x3 x4 x5 x6 x7 o o o o o o 0 x11 x12 x13 x14 x15 x16 x17 0 O O O O O 0 x21 x22 x23 x24 x25 x26 x27 0 o o o o o 0 x31 x32 x33 x34 x35 x36 x37 0 O O O O O 0 x41 x42 x43 x44 x45 x46 x47 0 o o o o o o x151 x52 x53 x54 x55 x56 x57 9 o o o o o 0 x61 x62 x63 x64 x65 x66 x67 6 o o o o o 0 x71 x72 x73 x74 x75 x76 x77 o o o o o o 0 x81 x82 x83 x84 x85 x86 x87 0 777amp77777757770777707770777 x91 x92 x93 x94 x95 x96 x97 Figure 1020 Natural Order Boundary points see text x8 0 x18 0 x28 0 x38 0 x48 0 x58 0 x68 0 x78 0 x88 x98 x89 x90 777 x99 x100 Mesh of points numbered in natural order Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 264 Relationship with a General System of Linear Equations Using natural ordering the ith point is computed from the ith equation 7 xiinxi71 xilxin or xiin xiil 499139 xi1 xin 0 which is a linear equation with ve unknowns except those with boundary points In general form the ith equation becomes aii7nxi7n aii71xi71 Gift aii1xi1 aiinxrn 0 where am 4 and ail7n ail1 aim ail n 1 39 To include 39 39 39 39 39 A x Those equatlons Wlth a boundary X x 0 point on diagonal unnecessary bomdainyeviggs x 0 f SOlutlon entries see text x 1 i I ith equation X i 1 5 1 xN1 0 xN 0 Figure 1021 Sparse matrix for Laplace s equation Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 265 GaussSeidel Relaxation Uses some of the newly computed values to compute other values in that iteration Sequential order of computation a o a o a o a o a o VVVVV I C I C C I C I C 7quot I C I gt0 0 O O O computed 0 O O O O O O O 0 Point to be computed O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O Figure 1022 GaussSeidel relaxation with natural order computed sequentially GaussSeidel Iteration Formula 1 iil N k k kil xi abi Z atij Z atij ii j1 ji1 where the superscript indicates the iteration For Laplace s equation with natural ordering of unknowns the formula reduces to k k k kil kil xi 1aiiaiznx 141 aii71x 121 6 iilx i1 6 139an inl At the kth iteration two of the four values of the points before the ith element are taken from the kth iteration and two values of the points after the ith element are taken from the k lth iteration We have k k k1 k4 fkxy f xiAyfxyiAf4 xAyf xyA Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 266 RedBlack Ordering First the black points are computed Next the red points are computed Black points computed simultaneously and Red points computed simultaneously Red ooooooodnk QQOQQQOU 00000000 00000000 00000000 00000000 00000000 00000000 Figure 1023 Redblack ordering RedBlack Parallel Code forall i 1 i lt 1 i forall j 1 j lt n j if i j 96 2 0 compute red points fi j O25fi1 j fij1 fi1 j fi j11 forall i 1 i lt 1 i forall j 1 j lt n j if i j 96 2 0 compute black points fi j O25fi1 j fij1 fi1 j fi j11 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 267 HigherOrder Difference Methods More distant points could be used in the computation The following update formula fkx y 61 016f 1x7Ay16fquot 1xyiA 16fquot 1xAy16fquot 1xyA ifk locizmifk 1xy72Aifk 1x2Ay7fk 1xy2A uses eight nearby points to update a point Figure 1024 Ninepoint stencil Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 268 Overrelaxation Improved convergence obtained by adding factor 1 nx to Jacobi or GaussSeidel formulae Factor to is the overrelaxation parameter Jacobi overrelaxation formula aii k D kil kil xi biizaijxi lioaxi j i where0ltoaltl GaussSeidel successive overrelaxation 171 N k n k k4 k4 xi Zbi Zaijxi Z aijxi 1 xi it j 1 j i 1 where 0 lt n S 2 If n l we obtain the GaussSeidel method Multigrid Method First a coarse grid of points is used With these points the iteration process will start to converge quickly At some stage the number of points is increased to include the points of the coarse grid and extra points between the points of the coarse grid The initial values of extra points can be found by interpolation The computation continues with this ner grid The grid can be made ner and ner as the computation proceeds or the computation can alternate between ne and coarse grids The coarser grids take into account distant effects more quickly and provide a good starting point for the next ner grid Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 269 Coarsest grid points Finer id POimS Processor Figure 1025 Multigrid processor allocation Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 270 Chapter 7 Load Balancing and Termination Detection Load balancing 7 used to distribute computations fairly across processors in order to obtain the highest possible execution speed Termination detection 7 detecting when a computation has been completed More dif cult when the computaion is distributed Processors P P 7 gt T1me a Imperfect load balancing leading to increased execution time P Processors t b Perfect load balancing Figure 71 Load balancing Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 152 Static Load Balancing Before the execution of any process Some potential static loadbalancing techniques Round robin algorithm 7 passes out tasks in sequential order of processes coming back to the rst when all processes have been given a task Randomized algorithms 7 selects processes at random to take tasks Recursive bisection 7 recursively divides the problem into subproblems of equal computational effort while minimizing message passing Simulated annealing 7 an optimization technique Genetic algorithm 7 another optimization technique described in Chapter 12 Several fundamental aws with static load balancing even if a mathematical solution exists Very dif cult to estimate accurately the execution times of various parts of a program without actually executing the parts Communication delays that vary under different circumstances Some problems have an indeterminate number of steps to reach their solution Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 153 Dynamic Load Balancing During the execution of the processes All previous factors are taken into account by making the division of load dependent upon the execution of the parts as they are being executed Does incur an additional overhead during execution but it is much more effective than static load balancing Processes and Processors Computation will be divided into work or tasks to be performed and processes perform these tasks Processes are mapped onto processors Since our objective is to keep the processors busy we are interested in the activity of the processors However we often map a single process onto each processor so we will use the terms process and processor somewhat interchangeably Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 154 Dynamic Load Balancing Dynamic load balancing can be classi ed as one of the following Centralized Decentralized Centralized dynamic load balancing Tasks are handed out from a centralized location A clear masterslave structure exists Decentralized dynamic load balancing Tasks are passed between arbitrary processes A collection of worker processes operate upon the problem and interact among them selves nally reporting to a single process A worker process may receive tasks from other worker processes and may send tasks to other worker processes to complete or pass on at their discretion Centralized Dynamic Load Balancing Master processor holds the collection of tasks to be performed Tasks are sent to the slave processes When a slave process completes one task it requests another task from the master process Terms used work pool replicated worker processor farm Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 155 Work pool Queue Tasks IIIIIIIIIII Master process Send task Request task and possibly submit new tasks Slave worker processes Figure 72 Centralized work pool Termination Stopping the computation when the solution has been reached When tasks are taken from a task queue computation terminates when The task queue is empty l Every process has made a request for another task without any new tasks being generated it is not suf cient to terminate when the task queue is empty if one or more processes are still running if a running process may provide new tasks for the task queue In some applications a slave may detect the program termination condition by some local termination condition such as nding the item in a search algorithm Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 156 Decentralized Dynamic Load Balancing Distributed Work Pool Initial tasks aster Pmaster Process M0 Process Mnil Slaves Figure 73 A distributed work p001 Fully Distributed Work Pool Processes to execute tasks from each other Figure 74 Decentralized work p001 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 157 Task Transfer Mechanisms ReceiverInitiated Method Aprocess requests tasks from other processes it selects Typically a process would request tasks from other processes when it has few or no tasks to perform Method has been shown to work well at high system load Unfortunately it can be expensive to determine process loads SenderInitiated Method Aprocess sends tasks to other processes it selects Typically in this method a process with a heavy load passes out some of its tasks to others that are willing to accept them Method has been shown to work well for light overall system loads Another option is to have a mixture of both methods Unfortunately it can be expensive to determine process loads In very heavy system loads load balancing can also be dif cult to achieve because of the lack of available processes Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 158 Process Selection Slave Pl Slave P Requests Requests Local Local selection selection algorithm algorithm Figure 75 Decentralized selection algorithm requesting tasks between slaves Process Selection Algorithms for selecting a process Round robin algorithm 7 process Pl requests tasks from process Px where x is given by a counter that is incremented after each request using modulo n arithmetic n processes excluding x i Random polling algorithm 7 process Pl requests tasks from process Px where x is a number that is selected randomly between 0 and n l excluding i Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 159 Load Balancing Using a Line Structure Master process P 0 Figure 76 Load balancing using a pipeline structure The master process P0 in Figure 76 feeds the queue with tasks at one end and the tasks are shifted down the queue When a worker process P 1 S i lt n detects a task at its input from the queue and the process is idle it takes the task from the queue Then the tasks to the left shuf e down the queue so that the space held by the task is lled A new task is inserted into the left side end of the queue Eventually all processes will have a task and the queue is lled with new tasks High priority or larger tasks could be placed in the queue rst Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 160 Shifting Actions could be orchestrated by using messages between adjacent processes For left and right communication For the current task PCOIHIH If buffer empty make request 4 Request for task Receive task from request gt If buffer full send task If free Receive request task from tas request K Figure 77 Using a communication process in line load balancing Code Using Time Sharing Between Communication and Computation Master process P0 for i O i lt noitasks i recvP1 requestitag request for task sendamptask Pi taskitag send tasks into queue recvP1 requestitag request for task sendampempty Pi taskitag end of tasks Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 161 Process P l lt i lt n if buffer empty sendPi1 requestitag request new task recvampbuffer Pi1 taskitag task from left proc if buffer full ampamp lbusy get next task task buffer get task buffer empty set buffer empty busy TRUE set process busy nrecvPi1 requestitag request check msg from right if request 88 buffer fu11 sendampbuffer Pi1 shift task forward buffer empty if busy continue on current task Do some work on task If task finished set busy to false Nonblocking nrecv is necessary to check for a request being received from the right Nonblocking Receive Routines PVM Nonblocking receive pvminrech returned a value that is zero if no message has been received A probe routine pvmiprobeO could be used to check whether a message has been received without actual reading the message Subsequently a normal recv routine is needed to accept and unpack the message Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 162 Nonblocking Receive Routines MPI Nonblocking receive MPIiIrech returns a request handle which is used in subsequent completion routines to wait for the message or to establish whether the message has actually been received at that point MP17Wait and MPIiTestO respectively In effect the nonblocking receive MPIiIrech posts a request for message and returns immediately Tree Structure Tasks passed from node into one of the two nodes below it when node buffer empty T ask when requested Figure 78 Load balancing using a tree Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 163 Distributed Termination Detection Algorithms Termination Conditions At time trequires the following conditions to be satis ed Applicationspeci c local termination conditions exist throughout the collection of processes at time t There are no messages in transit between processes at time t Subtle difference between these termination conditions and those given for a centralized loadbalancing system is having to take into account messages in transit Second condition is necessary for the distributed termination system because a message in transit might restart a terminated process More dif cult to recognize The time that it takes for messages to travel between processes will not be known in advance Using Acknowledgment Messages Each process in one of two states 1 Inactive without any task to perform 2 Active The process that sent the task to make it enter the active state becomes its parent On every occasion when process receives a task it immediately sends an acknowledg ment message except if the process it receives the task from is its parent process It only sends an acknowledgment message to its parent when it is ready to become inactive ie when Its local termination condition exists all tasks are completed It has transmitted all its acknowledgments for tasks it has received It has received all its acknowledgments for tasks it has sent out The last condition means that a process must become inactive before its parent process When the rst process becomes idle the computation can terminate Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 164 Parent Final acknowledgment Process First task Acknowledgment Task Other processes Figure 79 Termination using message acknowledgments Ring Termination Algorithms Singlepass ring termination algorithm When P0 has terminated it generates a token that is passed to P1 N When P l S i lt n receives the token and has already terminated it passes the token onward to PM Otherwise it waits for its local termination condition and then passes the token onward PWI passes the token to P0 LA When P0 receives a token it knows that all processes in the ring have terminated A message can then be sent to all processes informing them of global termination if necessary The algorithm assumes that a process cannot be reactivated after reaching its local termination condition This does not apply to work pool problems in which a process can pass a new task to an idle process Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 165 Token passed to next processor when reached local termination condition Figure 710 Ring termination detection algorithm Token Terminated Figure 711 Process algorithm for local termination Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 166 DualPass Ring Termination Algorithm Can handle processes being reactivated but requires two passes around the ring The reason for reactivation is for process P1 to pass a task to P where j lt i and after a token has passed P If this occurs the token must recirculate through the ring a second time To differentiate these circumstances tokens are colored white or black Processes are also colored white or black Receiving a black token means that global termination may not have occurred and the token must be recirculated around the ring again The algorithm is as follows again starting at P0 1 P0 becomes white when it has terminated and generates a white token to P1 2 The token is passed through the ring from one process to the next when each process has terminated but the color of the token may be changed If P passes a task to P where j lt i that is before this process in the ring it becomes a black process otherwise it is a white process A black process will color a token black and pass it on A white process will pass on the token in its original color either black or white After P has passed on a token it becomes a white process PWI passes the token to P0 3 When P0 receives a black token it passes on a white token if it receives a white token all processes have terminated Notice that in both ring algorithms P0 becomes the central point for global termination Also it is assumed that an acknowledge signal is generated to each request Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 167 Figure 712 Passing task to previous processes Tree Algorithm Local actions described in Figure 711 can be applied to various structures notably a tree structure to indicate that processes up to that point have terminated Figure 713 Tree termination Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 168 Fixed Energy Distributed Termination Algorithm Uses the notation of a xed quantity within the system colorfully termed energy The system starts with all the energy being held by one process the master process Master process passes out portions of the energy with the tasks to processes making requests for tasks If these processes receive requests for tasks the energy is divided further and passed to these processes When a process becomes idle it passes the energy it holds back before requesting a new task A process will not hand back its energy until all the energy it handed out is returned and combined to the total energy held When all the energy is returned to the root and the root becomes idle all the processes must be idle and the computation can terminate Signi cant disadvantage dividing the energy will be of nite precision and adding the partial energies may not equate to the original energy In addition one can only divide the energy so far before it becomes essentially zero Shortest Path Problem Finding the shortest distance between two points on a graph It can be stated as follows Given a set of interconnected nodes where the links between the nodes are marked with weights nd the path from one speci c node to another speci c node that has the smallest accumulated weights The interconnected nodes can be described by a graph The nodes are called vertices and the links are called edges If the edges have implied directions that is an edge can only be traversed in one direc tion the graph is a directed graph Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 169 Graph could be used to nd solution to many different problems for example 1 The shortest distance between two towns or other points on a map where the weights represent distance The quickest route to travel where the weights represent time the quickest route may not be the shortest route if different modes of travel are available for example ying to certain towns The least expensive way to travel by air where the weights represent the cost of the ights between cities the vertices The best way to climb a mountain given a terrain map with contours The best route through a computer network for minimum message delay the vertices represent computers and the weights represent the delay between two computers The most ef cient manufacturing system where the weights represent hours of work N L 5 UI ON The best way to climb a mountain will be used as an example Example The Best Way to Climb a Mountain A Base camp Possible intermediate camps Figure 714 Climbing a mountain Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 170 Figure 715 Graph of mountain climb Weights in graph indicate the amount of effort that would be expended in traversing the route between two connected camp sites The effort in one direction may be different from the effort in the opposite direction downhill instead of uphill directed graph Graph Representation Two basic ways that a graph can be represented in a program 1 Adjacency matrix 7 a twodimensional array 51 in which a i j holds the weight associated with the edge between vertex i and vertex j if one exists 2 Adjacency list 7 for each vertex a list of vertices directly connected to the vertex by an edge and the corresponding weights associated with the edges Adjacency matrix used for dense graphs The adjacency list is used for sparse graphs The difference is based upon space storage requirements Accessing the adjacency list is slower than accessing the adjacency matrix Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 171 Destination A F A 00 10 oo oo oo 00 B 00 oo 8 13 24 51 C 00 oo oo 14 oo 00 Source D oo oo oo oo 9 00 E 00 oo oo oo oo 17 F 00 oo oo oo oo oo a Adjacency matrix Figure 716 Representing a graph Weight NULL Source dumbmwm b Adjacency list Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 172 Searching 21 Graph Two wellknown singlesource shortestpath algorithms Moore s singlesource shortestpath algorithm Moore 1957 Dijkstra s singlesource shortestpath algorithm Dijkstra 1959 which are similar Moore s algorithm is chosen because it is more amenable to parallel implementation although it may do more work The weights must all be positive values for the algorithm to work Other algorithms exist that will work with both positive and negative weights Moore s Algorithm Starting with the source vertex the basic algorithm implemented when vertex 1 is being considered as follows Find the distance to vertex j through vertex 1 and compare with the current minimum distance to vertex j Change the minimum distance if the distance through vertex 1 is shorter In mathematical notation if 61 is the current minimum distance from the source vertex to vertex 1 and wiJ is the weight ofthe edge from vertex 1 to vertexj we have 61 mind 511 w J if Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 173 Vertex j Vertex 1 Figure 717 Moore s shortestpath algorithm Date Structures and Code Firstin rstout vertex queue created to hold a list of vertices to examine Initially only source vertex is in queue Current shortest distance from source vertex to vertex 1 will be stored in the array dist i l S 1 lt n n vertices and vertex 0 is the source vertex At rst none of these distances known and array elements are initialized to in nity Suppose w i j holds the weight ofthe edge from vertex 1 and vertexj in nity if no edge The code could be ofthe form newdistj disti wi j if newdistj lt distj distj newdistj When a shorter distance is found to vertex j vertex j is added to the queue if not already in the queue which will cause vertex j to be examined again Important aspect of this algorithm which is not present in Dijkstra s algorithm Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 174 Stages in Searching 21 Graph Example The initial values of the two key data structures are Vertices to consider Current minimum distances EllII vertex A B C D E F vertexqueue dist After examiningl to B Vertices to consider Current minimum distances IIIII VerteX A B C D E F vertexqueue dist Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 175 After examining B to F E D and C Vertices to consider Current minimum distances H I VerteX A B C D E F vertexqueue dist After examining E to F Vertices to consider Current minimum distances HI I VerteX A B C D E F vertexqueue dist Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 176 After examining D to E Vertices to consider Current minimum distances III I vertex A B C D E F vertexqueue dist After examining C to D No changes After examining E again to F Vertices to consider Current minimum distances nmmaan vertex A B C D E F vertexqueue dist No more vertices to consider We have the minimum distance from vertex A to each of the other vertices including the destination vertex F Usually the actual path is also required in addition to the distance Then the path needs to be stored as distances are recorded The path in our case isA gt B gt D gt E gt F Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 177 Sequential Code Speci c details of maintaining vertex queue omitted Let nextivertex return the next vertex from the vertex queue or noivertex if none Assume that adjacency matrix used named w 1 1 while i nextivertexO noivertex while a vertex for j 1 j lt n j get next edge if wi j infinity if an edge newdistij disti wi j if newdistij lt distj dist j newdistij appendiqueuej vertex to queue if not there no more vertices to consider Parallel Implementations Centralized Work P001 Centralized work pool holds the vertex queue vertex queue 1 as tasks Each slave takes vertices from the vertex queue and returns new vertices Since the structure holding the graph weights is xed this structure could be copied into each slave Assume a copied adjacency matrix Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 178 Master while vertexiqueue recvPANy source Pi v getivertexiqueueO sendampv Pi sendampdist ampn Pi recvampj ampdistj PAM source 1 empty C request task from slave send next vertex and current dist array Pi new distance appendiqueuej distj append vertex to queue and update distance array recvPANy source Pi request task from s1ave sendPi terminationitag termination message Slave process i sendPmaster send request for task recvampv Pmaster tag get vertex number if tag I terminationitag C recvampdist ampn Pmaster and dist array for j 1 j lt n j get next edge if wV j newdistij dist j newdi s tij infinity if an edge distv wv j if newdistij lt distj C sendampj ampdistj Pmaster add vertex to queue send updated distance Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 179 Decentralized Work Pool Convenient approach is to assign slave process 1 to search around vertex 1 only and for it to have the vertex queue entry for vertex 1 if this exists in the queue The array dist 1 will also be distributed among the processes so that process 1 maintains the current minimum distance to vertex 1 Process 1 also stores an adjacency matrixlist for vertex 1 for the purpose of identifying the edges from vertex 1 Search Algorithm Search activated by loading source vertex into the appropriate process VertexA is the rst vertex to search The process assigned to vertexA is activated This process will search around its vertex to nd distances to connected vertices Distance to process j will be sent to process j for it to compare with its currently stored value and replace if the currently stored value is larger In this fashion all minimum distances will be updated during the search If the contents of d i changes process 1 will be reactivated to search again Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 180 Master process Vertex w l II w I Process A dlsmnce 39 Other processes Process E Figure 718 Distributed graph search Slave process i recvnewdist PAM if newdist lt dist dist newdist add to queue vertexiqueue TRUE else vertexiqueue FALSE TRUE start searching around vertex if vertexiqueue for j 1 j lt 11 j if wj infinity d dist wj sendampd Pj get next edge send distance to proc j Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 181 Simpli ed slave process i recvnewdist PAM if newdist lt dist dist newdist start searching around vertex for j 1 j lt n j get next edge if wj infinity d dist wj sendampd Pj send distance to proc j Mechanism necessary to repeat the actions and terminate when all processes are idle and must cope with messages in transit Simplest solution Use synchronous message passing in which a process cannot proceed until the destination has received the message Note that a process is only active after its vertex is placed on the queue and it is possible for many processes to be inactive leading to an inef cient solution The method is also impractical for a large graph if one vertex is allocated to each processor In that case a group of vertices could be allocated to each processor Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 182 Chapter 4 Partitioning and Divideand Conquer Strategies Partitioning Strategies Partitioning simply divides the problem into parts Example Adding a sequence of numbers Dividing sequence into m parts added independently to create partial sums x0 xnm7l xnm x2nm7l xmilnm xnil Partial sums Sum Figure 41 Partitioning a sequence of numbers into parts and adding the parts Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers BaIry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 78 Using separate send s and recv s Master s nm number of numbers for slaves for i O x O iltm i xxs sendampnumbersx s Pi send s numbers to slave sum O for i O i lt m i wait for results from slaves recvamppart7sum PAM sum sum partisum accumulate partial sums Slave recvnumbers s Pmaster receive s numbers from master partisum O for i O i lt s i add numbers partisum partisum numbersi sendamppart7sum Pmaster send sum to master Using Broadca multicast Routine Master s nm number of numbers for slaves bcastnumbers s Palaveigroup send all numbers to slaves sum O for i O i lt m i wait for results from slaves recvamppart7sum PAM sum sum partisum accumulate partial sums Slave bcastnumbers s Pmaster receive all numbers from master start slaveinumber s slave number obtained earlier end start s part sum O for i start i lt end i add numbers part sum partisum numbersi sendamppart7sum Pmaster send sum to master Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 79 Using scatter and reduce routines Master s nm number of numbers scatternumbersampsPgrouProotmaster send numbers to slaves reduceiaddampsumampsPgrouProotmaster results from slaves Slave scatternumbersampsPgrouProotmaster receive s numbers add numbers reduceiaddamppart7sumampsPgrouProotmaster send sum to master Analysis Sequential Requires n l additions with a time complexity of On Parallel Using individual send and receive routines Phase 1 Communication tcomrnl mastartup nmtdata Phase 2 Computation tcompl nm 1 Phase 3 Communication Returning partial results using sendrecv routines tcommz quot105mm t tam Phase 4 Computation Final accumulation licomp2 m 1 Overall 1an roomml licomm2 tcompl licomp2 2mrstartup n m data m quotm 2 On m Parallel time complexity is worse than sequential time complexity Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 80 Divide and Conquer Characterized by dividing a problem into subproblems that are of the same form as the larger problem Further divisions into still smaller subproblems are usually done by recursion A sequential recursive de nition for adding a list of numbers is int addint s add list of numbers s C if nunbers lt 2 return nl n2 see explanation else Divide s s1 s2 divide s into two parts s1 and s2 partisuml addslrecursive calls to add sub lists partisum2 adds2 return Partisuml partisumZ Initial problem Divide problem Final tasks Figure 42 Tree construction Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 81 Parallel Implementation Original list Figure 43 Dividing a list into parts F inal sum Figure 44 Partial summation Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 82 Parallel Code Suppose we statically create eight processors or processes to add a list of numbers Process P0 divides1 s1 s2 sends2 P4 divides1 s1 s2 sends2 P2 divides1 s1 s2 sends2 P1 s1 recvamppart7sum1 P1 partisum partisuml recvamppart7sum1 P2 partisum partisuml recvamppart7sum1 P4 partisum partisuml partisum partisum partisum partisum division phase divide s1 into two send one part to another process s1 and s2 combining phase The code for process P4 might take the form Process P4 recvs1 P0 divides1 s1 s2 sends2 P5 divides1 s1 s2 sends2 P5 partisum s1 recvamppart7sum1 P5 partisum partisuml recvamppart7sum1 P5 partisum partisuml partisum partisum sendamppart7sum P0 division phase combining phase Similar sequences are required for the other processes Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 83 Analysis Assume n is a power of 2 Communication setup time tsta upmot included Communication Division phase Pil t 2l Z1 Z1 2t comml 2 data 4 data 8 data p data p Combining phase I data l commZ l datalogp Total communication time quotP i 1 Ttdata ltdatalogp Computation t t comm comml roommZ 7 tcomp Z y logp Total Parallel Execution Time n1171 7 IF 7 Ttdatathdatalongrl jJrlogp Found Not found Figure 45 Part ofa search tree Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 84 M ary Divide and Conquer Task is divided into more than two parts at each stage Example Task broken into four parts The sequential recursive de nition would be int addint s add list of numbers s C if numbers lt 4 returnn1 12 13 14 else Divide ss1s2s3s4 divide s into s1s2s3s4 partisuml adds1 recursive calls to add sublists partisumZ adds2 partisum3 adds3 partisumil adds4 return partisuml partisum2 partisum3 partisum l A tree in which each node has four children is called a quadtree Figure 46 Quadtree Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 85 Image area First division into four parts Second d1v1s1on Figure 47 Dividing an image DivideandConquer Examples Sorting Using Bucket Sort Range of numbers is divided into m equal regions 0 to am 1 am to 2am 1 2am to 3am 1 One bucket is assigned to hold numbers that fall within each region The numbers are placed into the appropriate buckets The numbers in each bucket will be sorted using a sequential sorting algorithm Works well if the original numbers are uniformly distributed across a known interval say0toa 1 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 86 Unsorted numbers Buckets Sort contents of buckets Merge lists Sorted numbers Figure 48 Bucket sort Sequential time ts n mnmlognm n nlognm Onlognm Parallel Algorithm Bucket sort can be parallelized by assigning one processor for each bucket reduces second term in the preceding equation to nplognp forp processors where p m Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 87 Unsorted numbers 7 processors Buckets S ort c ontents of buckets Merge lists Sorted numbers Figure 49 One parallel version of bucket sort Further Parallelization By partitioning the sequence into m regions one region for each processor Each processor maintains p small buckets and separates the numbers in its region into its own small buckets These small buckets are then emptied into the p nal buckets for sorting which requires each processor to send one small bucket to each of the other processors bucket i to processor 139 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 88 rims Unsorted numbers 7 processors Small buckets Em pty sm all buckets L arge buckets S ort c ontents of buckets Merge lists Sorted numbers Figure 410 Parallel version of bucket sort Analysis Phase 1 Computation and Communication Partition numbers tcompl 739 tcomml tstarcup tdataquot Phase 2 Computation Sort into small buckets toompz 71 Phase 3 Communication Send to large buckets If all the communications could overlap toms p Dam quotPz data Phase 4 Computation Sort large buckets Icomp4 np10gnp Overall rp rm than np p mm Ivpotato nplognp Assumed that numbers are uniformly distributed to obtain these formulas Worstcase scenario would occur when all the numbers fell into one bucket Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers B any Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 89 alltoall routine Could be used for Phase 3 sends data from each process to every other process Process 0 Process n 7 1 see also next slide S end buffer Receive II III buffer Send buffer Process 1 Process n 7 1 Process 0 Process n 7 2 Figure 4 ll Alltoall broadcast The alltoall routine will actually transfer the rows of an array to columns Alltoall Po P1 P2 P3 Figure 412 Effect of alltoall on an array Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 90 Numerical Integration A general diVideandconquer technique divides the region continually into parts and lets some optimization function decide when certain regions are suf ciently diVided Example Numerical integration 1 jfxdx Can diVide the area into separate parts each of which can be calculated by a separate process Each region could be calculated using an approximation given by rectangles x p q Figure 413 Numerical integration using rectangles Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 91 A Better Approximation Aligning the rectangles x p q Figure 414 More accurate numerical integration using rectangles x p q Figure 415 Numerical integration using the trapezoidal method Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 92 Static Assignment SPMD pseudocode ProcessPl if i master read number of intervals required printf Enter number of intervals scanf 96dquot ampn bcastampn Pgroup broadcast interval to all processes region b ap length of region for each process start a region i starting x coordinate for process end start region ending x coordinate for process d b an size of interval area 00 for x start 3 lt end 3 x d area area fx fxd area 05 area d reduceiaddampintegral amparea Pgroup form sum of areas A reduce operation is used to add the areas computed by the individual processes Can simplify the calculation somewhat by algebraic manipulation see text Adaptive Quadrature Method whereby the solution adapts to the shape of the curve Example Use three areas A B and C The computation is terminated when the area computed for the largest of the A and B regions is suf ciently close to the sum of the areas computed for the other two regions Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 93 f x x Figure 416 Adaptive quadrature construction Some care might be needed in choosing when to terminate f x C 0 A B x Figure 417 Adaptive quadrature with false termination Might cause us to terminate early as two large regions are the same ie C 0 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 94 Gravitational N Body Problem Objective is to nd positions and movements of bodies in space say planets that are subject to gravitational forces from other bodies using Newtonian laws of physics The gravitational force between two bodies of masses ma and mb is given by Gmamb 2 r where G is the gravitational constant and r is the distance between the bodies Subject to forces a body will accelerate according to Newton s second law F ma where m is the mass of the body F is the force it experiences and a is the resultant acceleration Let the time interval be At Then for a particular body of mass m the force is given by 1 mvt 7v At F and a new velocity 1 FAt Vt vt m where v 1 is the velocity of body at time t 1 and VI is the velocity of body at time t If a body is moving at a velocity v over the time interval At its position changes by 11 t x 7x vAt z where x 1s 1ts pos1tion at t1me t Once bodies move to new positions the forces change and the computation has to be repeated Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 95 ThreeDimensional Space In a threedimensional space having a coordinate system x y z the distance between the bodies at xa ya 2a and xb yb 21 is given by r Axb 7 xaf yb 7ny 2b azaf The forces are resolved in the three directions using for example Gmambfcbixa 2 FX r r Gmamb ybiya Fy 2 r r Gm m z 72 b b Fz a r r where particles are of mass ma and M and have coordinates xa ya 2a and xb yb zb Sequential Code The overall gravitational Nbody computation can be described by the algorithm for each time period O t lt tmax t for each body for t for i O i lt N i F Forceiroutinei compute force on ith body vinew vi F dt m compute new velocity and xliinew xi vinew dt new position leapfrog for i O i lt nmax i for each body xi xinew update velocity and position Vli VIiIDSW Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 96 Parallel Code The algorithm is an 0N2 algorithm for one iteration as each of the N bodies is in uenced by each of the other N 1 bodies Not feasible to use this direct algorithm for most interesting Nbody problems where N is very large Time complexity can be reduced using the observation that a cluster of distant bodies can be approximated as a single distant body of the total mass of the cluster sited at the center of mass of the cluster Center of mass Distant cluster of bodies Figure 418 Clustering distant bodies Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 97 BarnesHut Algorithm Start with whole space in which one cube contains the bodies or particles First this cube is divided into eight subcubes If a subcube contains no particles the subcube is deleted from further consideration If a subcube contains more than one body it is recursively divided until every subcube contains one body This process creates an octtree that is a tree with up to eight edges from each node The leaves represent cells each containing one body After the tree has been constructed the total mass and center of mass of the subcube is stored at each node The force on each body can then be obtained by traversing the tree starting at the root stopping at a node when the clustering approximation can be used eg when r2 CDI QN where 9 is a constant typically 10 or less 9 is called the opening angle Constructing the tree requires a time of Onlog n and so does computing all the forces so that the overall time complexity of the method is On log n Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 98 Particle s Figure 419 Recursive division of twodimensional space Subdivision direct1on Partial quadtree Orthogonal Recursive Bisection Consider a twodimensional square area First a vertical line is found that divides the area into two areas each with an equal number of bodies For each area a horizontal line is found that divides it into two areas each with an equal number of bodies Repeated until there are as many areas as processors and then one processor is assigned to each area Figure 420 Orthogonal recursive bisection method Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 99 Chapter 3 Em barrassingly Parallel Computations A computation that can be divided into a number of completely independent parts each of which can be executed by a separate processor Input data lt5 lt5 Cb Results Figure 31 Disconnected computational graph embarrassingly parallel problem spawn Send initial data 5 end recv Master send recv Collect results Figure 32 Practical embarrassingly parallel computational graph with dynamic process creation and the masterslave approach Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 63 Embarrassingly Parallel Examples Low level image operations a Shifting Object shifted by Ax in the xdimension and Ay in the ydimension x x Ax y y Ay where x and y are the original and x and y39 are the new coordinates b Scaling Object scaled by a factor Sx in the xdirection and Sy in the y direction x xSx y ySy c Rotation Object rotated through an angle 9 about the origin of the coordinate system x xcose ysin9 y39 xsin9 ycos9 Process a Square region for each process Figure 33 Partitionng into regions for individual processes Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 64 Process 640 480 b Row region for each process Pseudocode to Perform Image Shift Master for i 0 row O i lt 48 i row row 10 for each process sendrow Pi send row no for i O i lt 480 i initialize temp for j 0 j lt 640 j temPJ39laPIIIil j 0 for i O i lt 640 480 i for each pixel recvoldrowoldcolnewrownewcol PANY accept new coords if newrow lt 0 i newrow gt 480 i newcol lt 0 i newcol gt 640 tempimapnewrow newcolmapoldrow oldcol for i O i lt 480 i update bitmap for j 0 j lt 640 j maPIi j temPJ39laPIIIil j Slides for Parallel Programming Techniques andApplieations using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 65 Slave recv row Pmas tar for o1drow row for o1dco1 receive row no o1drow lt row 10 o1drow o1drow deltaix O o1dco1 lt 640 o1dco1 transform coords o1dco1 deltaiy newrow newcol shift in x direction shift in y direction sendo1drowo1dco1newrownewco1 Pmaster coords to master Analysis Sequential ts n2 On2 Parallel Communication licomm ltstartup mrdata 7 2 7 2 licomm ipastartup thata 4quot Istartup ltdata 7 0p l n Computation 2 n imp 2 On2p P Overall Execution Time tp t COIHp l COIHIH For constant p this is On2 However the constant hidden in the communication part far exceeds those constants in the computation in most practical situations Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers airy Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 66 Mandelbrot Set Set of points in a complex plane that are quasistable will increase and decrease but not exceed some limit when computed by iterating the function zk 2k2 c where zkH is the k 1th iteration of the complex number 2 a bi and c is a complex number giving the position of the point in the complex plane The initial value forz is zero Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit Magnitude of z is the length of the vector given by Zlength Jail 2 Computing the complex function Zk1 2k2 c is simpli ed by recognizing that 22 a2 Zabi bi2 612 b2 Zabi or a real part that is a2 b2 and an imaginary part that is Zab The next iteration values can be produced by computing 7 2 2 Zreal 7 Zreal 39 Zimag creal Zimag 2Zrealzimag cimag Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 67 Seq Routine computing value of one pt returning no of iterations structure complex C float real float imag int calipixelcomplex c int count max complex z float temp lengthsq max 256 zreal O zimag 0 count O number of iterations do C temp zreal zreal zimag zimag creal zimag 2 zreal zimag cimag zreal temp lengthsq zreal zreal zimag zimag count while lengthsq lt 40 88 count lt max return count Scaling Coordinate System For computational ef ciency let scaleireal realimax realimindisp7width scaleiimag imagimax imagimindisp7height Including scaling the code could be of the form 0 x lt dispiwidth x screen coordinates x and y for x for y O y lt dispiheight y C creal realimin float x scaleireal cimag imagimin float y scaleiimag color calipixelc displayx y color where display is a routine to display the pixel X y at the computed color Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 68 2 39 Imaginary 0 Real 392 Figure 34 Mandelbrot set Parallelizing Mandelbrot Set Computation Static Task Assignment Master for i 0 row O i lt 48 i row row lO for each process sendamprow Pi for i send row no 0 i lt 480 640 i from processes any order recvampc ampcolor PANY receive coordinatescolors display c color display pixel on screen Slave process i recv amprow for x Pmaster receive row no O x lt dispiwidth x screen coordinates x and y for Y row y lt row 10 y creal minireal float x scaleireal cimag miniimag float y scaleiimag color calipixel c sendampc ampcolor Pmaster send coords color to master Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 69 Dynamic Task Assignment Work PoolProcessor Farms Work pool O O 0 xa ya Xe ye x0 yo O xb yb xd yd Task Return results request new task Figure 35 Work pool approach Coding for Work Pool Approach Master count O counter for termination row O row being sent for k O k lt procno k C assuming procnoltdisp7height sendamprow Pk dataitag send initial row to process count count rows sent row next row do C recv ampslave ampr color PAM resultitag count reduce count as rows received if row lt dispiheight C send amprow Palave dataitag send next row row next row count else send amprow Palave terminatoritag terminate rowsirecv display r color display row while count gt 0 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 70 Slave recvy Pmaster ANYTAG sourceitag receive lst row to compute while sourceitag dataitag cimag imagimin float y scaleiimag for x O x lt dispiwidth x compute row colors creal realimin float x scaleireal color 3 calipixelc sendampi ampy color Pmaster resultitag row colors to master recvy Pmaster sourceitag receive next row Rows outstanding in slaves count 0 ROW sent di spheight I Increment I I Row returned I Terminate Decrement Figure 36 Counter termination Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 71 Analysis Sequential I max X n On Parallel program Phase 1 Communication Row number is sent to each slave Icomml Sastartup Idata Phase 2 Computation Slaves perform their Mandelbrot computation in parallel lt max X 7 comp S 1 Phase 3 Communication Results passed back to master using individual sends 7 n roommZ 7 Eastmth ltdata Overall max X 7 7 IF 3 S E gastmp tdata Monte Carlo Methods Basis of Monte Carlo methods is the use of random selections in calculations Example To calculate 7 A circle is formed within a square Circle has unit radius so that square has sides 2 X 2 Total area 4 2 Area1t 4 2 Figure 37 Computing 11 by a Monte Carlo method Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 72 The ratio of the area of the circle to the square is given by Area of circle 11 l 2 Area of square 2 X 2 bl Points within the square are chosen randomly and a score is kept of how many points happen to lie within the circle The fraction of points within the circle will be n4 given a suf cient number of randomly selected samples Computing an Integral One quadrant of the construction in Figure 37 can be described by the integral LEAl ixzdx E 4 A random pair of numbers xryr would be generated each between 0 and l and then counted as in circle if yr S 1 7 x3 that is yr2 x3 S l x Figure 38 Function being integrated in computing 11 by a Monte Carlo method Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 73 Alternative better Method Use the random values of x to compute fx and sum the values of fx N Area IZNW as mxxrxo i 1 where x are randomly generated values of x between x1 and x2 Example Computing the integral I PM 7 3x dx 9 1 Sequential Code sum O for i O i lt N i N random samples xr rand7vx1 x2 generate next random value sum sum xr xr 3 xr compute fxr area sum N x2 31 The routine randvXl X2 returns a pseudorandom number between X1 and X2 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 74 Parallel Implementation Master Partial sum Slaves Random number Random number process Figure 39 Parallel Monte Carlo integration Pseudocode Master fori O i lt Nn i for j 0 j lt 11 j nno of random numbers for slave xrj rand load numbers to be sent recvPANy reqitag Psource wait for a slave to make request sendxr ampn Psource computeitag fori O i lt slaveino i terminate computation recvPi reqitag sendPi stopitag sum O reduceiadd ampsum Pgroup Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 75 Slave sum O sendPmaster reqitag recvxr 8m Pmaster sourceitag while sourceitag computeitag for i O i lt 1 i sum sum xri xri 3 xri send Pmaster reqitag recvxr 8m Pmaster sourceitag reduceiadd ampsum Pgroup P Random Number Generation The most popular way of creating a pueudorandom number sequence x1 x2 X3 xH xi Kit1 xr xquot is by evaluating Jet1 from a carefully chosen function of xi often of the form Jet1 axl c mod m where a c and m are constants chosen to create a sequence that has similar properties to truly random sequences Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 76 Parallel Random Number Generation It turns out that Jet1 axl c mod m x k Axl C mod m where A 61quot mod m C Cak 1 or 2 a1 a0 mod m and k is a selected jump constant Figure 310 Parallel computation of a sequence Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 77 Chapter 11 Image Processing Lowlevel Image Processing Operates directly on a stored image to improve or enhance it Stored image consists of a twodimensional array of pixels picture elements Origin 0 Picture element pixel Figure 111 Pixmap 39 I Many lowlevel imag r J assume images and refer to pixels as having gray level values or intensities Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 271 Computational Requirements Suppose a pixmap has 1024 X 1024 pixels and 8bit pixels Storage requirement is 220 bytes 1 Mbytes Suppose each pixel must be operated upon just once Then 220 operations are needed in the time of one frame At 10398secondoperation lOns operation this would take 10 ms In realtime applications the speed of computation must be at the frame rate typically 60785 framessecond All pixels in the image must be processed in the time of one frame that is in 12716 ms Typically many highcomplexity operations must be performed not just one operation Point Processing Operations that produce an output based upon the value of a single pixel Thresholding Pixels with values above a predetermined threshold value are kept and others below the threshold are reduced to 0 Given a pixel xi the operation on each pixel is if x lt threshold x1 0 else x1 1 Contrast Stretching Range of gray level values extended to make details more visible Given pixel of value x within range x and xh the contrast stretched to the range xH to xL by multiplying x by x 7 x xi mimi xL Gray Level Reduction Number of bits used to represent the gray level reduced Simple method would be to truncate the lesser signi cant bits Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 272 Il togranl Shows the number of pixels in the image at each gray level Number ofpixels Ii iih I 0 Gray level 255 Figure 112 Image histogram Sequen alcode fori O i lt heightimax x forj 0 j lt widthimax y hiStIPIi I39ll histhIi I39ll 1 where the pixels are contained in the array p 1 1 and hist k will hold the number of pixels haVing the kth gray level Similar to adding numbers to an accumulating sum and similar parallel solutions can be used for computing histograms Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 273 Smoothing Sharpening and Noise Reduction Smoothing suppresses large uctuations in intensity over the image area and can be achieved by reducing the highfrequency content Sharpening accentuates the transitions enhancing the detail and can be achieved by two ways Noise reduction suppresses a noise signal present in the image Often requires a local operation with access to a group of pixels around the pixel to be updated A common group size is 3 X 3 x0 x1 x2 X3 X4 XS x6 x7 x8 Figure 113 Pixel values for a 3 X 3 group Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 274 Mean A simple smoothing technique is to take the mean or average of a group of pixels as the new value of the central pixel Given a 3 X 3 group the computation is x0x1x2x3x4x5x6x7x8 9 x4 where x 439 is the new value for x4 Sequential Code Nine steps to compute the average for each pixel or 9n for n pixels A sequential time complexity of On Parallel Code Number of steps can be reduced by separating the computation into four data transfer steps in lockstep fashion Step 1 Step 2 Step 3 Step 4 Each pixel adds Each pixel adds Each pixel adds pixel Each pixel adds pixel pixel from left pixel from right from above from below Figure 114 Fourstep data transfer for the computation of mean Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 275 x6x7 x6x7x8 a Step 1 b Step 2 x0x1x2 x0x1x2 2121 mils WW8 A x6 x7 Ixsl x6 x7T x8 x6x7x8 x6x7x8 0 Step 3 d Step 4 Figure 115 Parallel mean data accumulation Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 276 Median Sequential Code Median can be found by ordering the pixel values from smallest to largest and choosing the center pixel value assuming an odd number of pixels With a 3 X 3 group suppose the values in ascending order are yo y1 y2 y3 y4 y5 y6 y7 and y8 The median is y4 Suggests that all the values in the group must rst be sorted and then the fth element taken to replace the original value of the pixel Using bubble sort in which the lesser values are found rst in order sorting could in fact be terminated after the fth lowest value is obtained Number ofsteps for nding each median is given by 8 7 6 5 4 30 steps or 30n for n pixels Parallel Code An Approximate Sorting Algorithm First a compareandexchange operationperformed on each of the rows requiring three steps For the ith row we have piljil 9 pi Pi lt gt Pi l piljil 9 pi where gt means compare and exchange if left gray level greater than right gray level Then the columns sorted using three compareandexchange operations piilljgt pi Piaf e pilj piil jgt pi The value in piJ is taken to be the fth largest pixel value Does not always select fth largest valueReasonable approximation Only requires six steps Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 277 t t Largest Next largest in row in roW t l l g z z Next largest in column i it i Figure 116 Approximate median algorithm requiring six steps Weighted Masks The mean method could be described by a weighted 3 X 3 mask Suppose the weights are wo w1 w2 W3 W4 w5 w6 W7 and W8 and pixel values are x0 x1 x2 x3 x4 x5 x6 x7 and x8 The new center pixel value x 439 is given by WOXO w1x1 w2x2 w3x3 w4x4 WSXS w x w7x7 w8x8 The scale factor lk is set to maintain the correct grayscale balance in the image after the operation Often k is given by wo wl w2 W3 w4 W5 w6 W7 The summation of products wixi from two functions w andx is the discrete cross cor relation of f with w written as f 9 w Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 278 Mask Pixels Result W0 W1 W2 x0 x1 x2 W3 W4 W5 x3 x4 x5 96439 WG W7 W8 x6 x7 x8 Figure 117 Using a 3 X 3 weighted mask 1 1 1 16 9 1 1 1 1 Figure 118 Mask to compute mean Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers B any Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 279 Figure 119 A noise reduction mask l l l l k7 9 l 8 l l l l Figure 1110 Highpass sharpening lter mask The computation done with this mask is 7x17x27x37x57x67x77x8 9 7 8x47x0 x4 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 280 Edge Detection Highlighting edges of object where an edge is a signi cant change in gray level intensity Gradient and Magnitude Let us rst consider a onedimension gray level function fx say along a row If the function x were differentiated the rst derivative if6x measures the gradient of the transition and would be a positive going or negativegoing spike at a transition The direction of the transition either increasing or decreasing can be identi ed by the polarity of the rst derivative Intensity transition First derivative Second derivative Figure 11 11 Edge detection using differentiation Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 281 Image Function A twodimensional discretized gray level function xy Gradient magnitude 7 a 2 a 2 W 7 6 9 a Q Gradient Direction 0f lx tani1 6 y y if 6x where I is the angle with respect to the y aXis The gradient can be approximated t0 4 for reduced computational effort Image Constant intensity Gradient Figure 1112 Gray level gradient and direction Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 282 Edge Detection Masks For discrete functions the derivatives are approximated by differences The term ifBx is the difference in the x direction and ifBy is the difference in the y direction We might consider computing the approximate gradient using pixel values x5 and x3 to get ifBx and x7 and x1 to get ifBy ie 6 ws xs 6 w l so that me x77x1 x57x3l Two masks are needed one to obtain x7 x1 and one to obtain x5 x3 The absolute values of the results of each mask are added together Prewitt Operator The approximate gradient obtained from 6 6 5 x6 7x0 x7 7x1 x8 7x2 3f a m x27x0x57x3 x87x6 Then mex67x0x77x1x87x2 x27x0x57x3x87x6 which requires using the two 3 X 3 masks Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 283 Figure 1113 Prewitt operator Sobel Operator Derivatives are approximated to gm x6 2x7 x8 7 x0 2x1 x2 gm x2 2x5 x8 7x0 2x3 x6 Operators implementing rst derivatives will tend to enhance noise However the Sobel operator also has a smoothing action Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 284 Figure 11 14 Sahel upmm a 2m vi Sahel melm39 mm 11 15 Edge mm wA anbelapnmx m m Laplace Operator The Laplace secondorder derivative is de ned as 2 2 V2 f 6 6 6x 6y approximated to sz 4x47x1x3x5x7 which can be obtained with the single mask 0 1 0 1 4 1 0 1 0 Figure 1116 Laplace operator Upper pixel Left pixel Right pixel x7 Lower pixel Figure 1117 Pixels used in Laplace operator Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 286 Figure 11 12 Effecmeaphceapnmx The Hough Transform Pnrpuse mg m m mam afequonns afhnzs um masl hkzly m 5239s afplxels m an Image A 1m 15 dzscnbedbythz equunn y ax 5 mm m pummztns a and a Imquy dzscnbe m pnmculath a m slap and a m mmfn an m yraxxs A Search fax this 1m wnh m masl pixels made anm Lhzm wuldbe campummnr anymhlhmvelyexpnsm 5110013 algan39hm Snppuse m equ mnafthz 1m 15 namngedas a 7m w Evexypmm um m an a spcx c hm m m w space mu mp mm same pm mm 57 b spa panmztex 5pm Thzm39axe we can and m nnmhemfpumts an eachpuss nlz hm m m w space by simplycanmmg whznamappmg maps mm a pmmmthz ark spa m m ywcb i b x1 y1 xbyl bxay a a 12 Pixel in image a x y Plane b Parameter space Figure 1119 Mapping a line into a b space Finding the Most Likely Lines A single point x1 y1 can be mapped into the points of a line b x1a yl in the a b space In the mapping process discrete values will be used to a coarse prescribed precision and the computation is rounded to the nearest possible ab coordinates The mapping process is done for every point in the xy space A record is kept of those ab points that have been obtained by incrementing the corre sponding accumulator Each accumulator will have the number of pixels that map into a single point in the parameter space The points in the parameter space with locally maximum numbers of pixels are chosen as lines Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 288 Method will fail for vertical lines and with lines that approach this extreme To avoid the problem line equationrearranged to polar coordinates r xcos 0 ysin 0 where r is the perpendicular distance to the origin in the original x y coordinate system and 0 is the angle between r and the xaXis rxcoseysin0 61xyP1ane b n 9 plane Figure 1120 Mapping a line into r 0 space The 0 value will be in degrees and very conveniently will also be the gradient angle of the line with respect to the xaXis Each line will map to a single point in the r 0 space A vertical line will simply map into a point where 0 0 if a positive intercept with the xaXis or 0 180 if a negative intercept with the xaXis A horizontal line has 0 90 Each point in the xy space will map into a curve in the r 0 space ie x1 y1 maps into rx1cos 0 y1 sin 0 Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 289 Implementation Assume origin at the top left corner V Figure 1121 Normal representation using image coordinate system The parameter space is divided into small rectangular regions One accumulator is provided for each region Accumulators of those regions that a pixel maps into are incremented This process must be done for all the pixels in the image If all values of 9 were tried ie incrementing 9 through all its values the computational effort would be given by the number of discrete values of 9 say k intervals With 71 pixels the complexity is Okn The computational effort can be reduced signi cantly by limiting the range of lines for individual pixels using some criteria A single value of 9 could be selected based upon the gradient of the line Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 290 Accumulator r 15 10 5 0 0 10 20 30 gt 6 Figure 1122 Accumulators accr9 for the Hough transform Sequential Code for x O x lt xmax x for each pixel for y o y lt ymax y sobe1ampx ampy dx dy find 3 and y gradients magnitude gradimagdx dy find magnitude if needed if magnitude gt threshold theta gradidirdx dy atan2 theta thetaiquantizetheta r x costheta y sintheta r riquantizer accr theta appendr theta x Y fn increment accumulator append point to line Finally when all pixels have been considered the values held in each accumulator will give the number of pixels that can map onto the corresponding line The most likely lines are those with the most pixels and accumulators with the local maxima are chosen An algorithm for choosing the local maxima must be deVised Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 291 Parallel Code Since the computation for each accumulator is independent of the other accumulations it could be performed simultaneously although each requires read access to the whole image Left as an exercise Transformation into the Frequency Domain Fourier Transform Many applications for the Fourier transform in science and engineering In the area of image processing the Fourier transform is used in image enhancement restoration and compression The image is atwodimensional discretized function fx y but rst start with the one dimensional case For completeness let us rst review the results of the Fourier series and Fourier transform concepts from rst principles Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 292 Fourier Series The Fourier series is a summation of sine and cosine terms and can be written as a 2 2 o njt 11 ft xt 3 21 aj cos T bj s1n T J T is the period lT f wherefis a frequency By some mathematical manipulation we can get the more convenient description of the series 00 xt Z Xje j7oo t 2 1in where is the jth Fourier coef cient in a complex form andi 471 The Fourier coef cients can also be computed from speci c integrals Fourier Transform Continuous Functions The previous summation developed into an integral x0 jXfe2quot df where Xf is a continuous function of frequency The function Xf can be obtained from Xf xie 2quotquotf di X f is the spectrum of xt or more simply the Fourier transform of xt The original function xt can obtained from Xf using the rst integral given which is the inverse Fourier transform Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 293 Discrete Functions For functions having a set of N discrete values by replacing the integral with a summa tion leading to the discrete Fourier transform DFT given by 7 jk Xk JivNfoJe 27ij j 0 and inverse discrete Fourier transform given by er 27ii xk o for 0 S k S N l The N real input values x0 x1 x2 xN71 produce NcompleX transform values X0 X1 X2 XN71 Fourier Transforms in Image Processing A twodimensional Fourier transform is N71 M71 SynLhr j le Z ijke N M 0 k0 wherej and k are row and column coordinates 0 Sj S N l and 0 S k S M 1 Assume image is square where N M Equation can be rearranged into km 39l N4 N4 2 i j 721iJ S N N X m 7 Z Z x jke e o k 0 Inner summation is a onedimensional DFT operating onN points of a row to produce a transformed row Outer summation is a onedimensional DFT operating on N points of a column We might write N l 721i11 le Zije N 0 Hence the twodimensional DFT can be divided into two sequential phases one oper ating on rows of elements and one operating on columns of transformed elements Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 294 k Transform Transform rows columns Figure 1123 Twodimensional DFT Applications Frequency ltering can be described by the convolution operation M k g39 k f39 k the same operation as the crosscorrelation operation for symmetrical masks where g39 k describes the weighted mask lter and f j k describes the image It can be shown that the Fourier transform of a product of functions is given by the con volution of the transforms of the individual functions Hence the convolution of two functions can be obtained by taking the Fourier trans forms of each function multiplying the transforms HO k G0 k X F0 k element by element multiplication where F j k is the Fourier transform of f39 k and G39 k is the Fourier transform of g j k and then taking the inverse transform to return the result into the original spatial domain Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 295 lm age Transform Convolution Multiply Inverse 1k f i k F0 k transform thc H IL k quot MIL k gjk 80 k quot GO k Filterimage a Direct convolution b Using Fourier transform Figure 1124 Convolution using Fourier transforms Parallelizing the Discrete Fourier Transform Starting from Xk Ex af x0 10 and using the notation w ZnN N71 Xk ijwlk 10 The w terms are called the twiddle factors Each input value has to be multiplied by a twiddle factor The inverse transform can be obtained by replacing w with wil Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 296 Sequential Code for every point for k O k lt N k Xk O for j 0 j lt N j compute summation xk xk wj k xj where xk is the kth transformed point xk is the kth input and w Summation step requires complex number arithmetic eZpiN The code can be rewritten as for k O k lt N k Xk O a 1 for j 0 j lt N j xk Xk a xj a a wk where a is a temporary variable Elementary MasterSlave Implementation One slave process of N slave processes could be assigned to produce one of the transformed values ie the kth slave process produces x k Parallel time complexity withNs1ave processes is ON Master process X10 X11 Xn 1 Figure 1125 Masterslave approach for implementing the DFT directly Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers any Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 297 Pipeline Implementation Unfolding the inner loop for x k we have Xk 0 a 1 Xk Xk a x0 a a wk Xk Xk a x1 a a wk Xk1 Xk1 a x2 a a wk Xk Xk a x3 a a wk Each pair of statements Xk Xk a x0 a a wk could be performed by a separate pipeline stage xll39 Process f Values for next iteration Xk X a k X a Figure 1126 One stage of a pipeline implementation of DFT algorithm Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 298 Output sequence X01X11X2X3l a Pipeline structure Figure 1127 Discrete Fourier transform with a pipeline X0 Xlll X2 X3 X4 X5 X6 TTTTTTT PN71 PN72 I I Plpehne stages I P 2 P 1 P0 gt Time b Timing diagram Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 299 DFT as a MatrixVector Product The kth element of the discrete Fourier transform is given by 0 l 2 3 Nil Xkx0w x1w x2w x3w xN1w and the whole transform can be described by a matrixvector product X0 1 1 1 1 1 x0 X1 1 w w2 W3 wNi1 x1 X2 1 w2 W4 W6 w2ltN7U x2 X3 l l W3 W6 W9 w3ltN7U x3 N Xk 1 wk w2k w3k wNilk xk XNi1 1wN71w2N71W3N71I IWN71N71 xNil Note wo 1 Hence the parallel methods of producing a matrixvector product as described in Chapter 10 can be used for the discrete Fourier transform Fast Fourier Transform Method of obtaining discrete Fourier transform with a time complexity of ON logN instead of ONZ Let us start with the discrete Fourier transform equation 7 1 N71 k X k 7 v 2 xj wl 0 where w e ZMN Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 300 Many formulations The version that starts by dividing the summation into the following two parts 1 Nzyi 2k N271 21k 7 J J Xk N Z x2jw Z x2jlw 0 0 where the rst summation processes the x s with even indices and the second summation processes the x s with odd indices Rearranging we get 1 1 m 2jk k 1 m 2jk X k 2N2 xZJW W N2 E x2JIW 0r 39k 7 jk 1 1 Na 1 414 k 1 M2 1 4mm X k 2N2 x218 W N2 12 x21 0 Each summation can now be recognized as an N2 discrete Fourier transform operating on the N 2 even points and the N2 odd points respectively 1 k Xk Xevenw Xodd for k 0 l N l where Xeven is the N2point DFT ofthe numbers with even indices x0 x2 x4 and X Odd is the N2point DFT of the numbers with odd indices x1 x3 x5 Now suppose k is limited to 0 l N2 l the rst N2 values ofthe total Nvalues The complete sequence can be divided into two parts Xk Xeven kaodd and Xk N2 Xeven W NZXodd Xeven kaodd since wklN2 wk where 0 S k lt N2 Hence we could compute Xk and XkN2 using two N2point transforms Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 301 Input sequence x0 x1 x2 N2 pt XeVen x3 DFT Xk N2 pt x DFT XkN2 N72 Xodd gtlt Wk xNil k01N2 Figure 1128 Decomposition ofNpoint DFT into two N2point DFTs Transform Each of the N2point DFTs can be decomposed into two N4point DFTs and the decomposition could be continued until single points are to be transformed A 1point DFT is simply Computation often depicted in the form x0 X0 x1 X1 x2 1 X2 x3 X3 the value of the point Figure 1129 Fourpoint discrete Fourier transform Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 302 Xk E02468101214wk213579111315 204812wquot2261014w 215913w 2371 115 E08wk2412wk2210wk2614219wk25l3wk231lwk2715 x0 x8 x4 x12 x2 x10 x6 x14 x1 x9 x5 x13 x3 x11 x7 x15 00001000 0100 1100 0010 1010 0110 1011 0001 1001 0101 1101 0011 1011 0111 1111 Figure 1130 Sixteenpoint DFT decomposition Sequential Code The sequential time complexity is essentially ON logN since there are logN steps and each step requires a computation proportional to N where there are N numbers The algorithm can be implemented recursively or iteratively Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 303 Parallelizing the FFT Algorithm Binary Exchange Algorithm Figure 1131 Sixteenpoint FFT computational ow Process Row P y Inputs Outputs 0000 x0 X0 0001 X P0 0010 X x3 4w 0100 x4 X4 0101 x5 v X5 P1 0110 x A A A A l W X 6 WWW VXgtltg 6 0111 x7 X7 1000 x A A A A A A A l X 8 V V V V V V 8 1001 x9 AeeeeA Vltgtlt8X9 P2 1010 x10 AeeA gtltgtltgtltgtlt X10 10m a IIIA 3 AA o o o X 1101 x v V39 v X 13 13 P3 1110 x14 W X14 1111 x15 V X15 Figure 1132 Mapping processors onto 16point FFT computation Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 304 Analysis Computation Given p processors and N data points at each step each processor will compute Np points and each point requires one multiplication and addition With logN steps the parallel time complexity is given by licomp 0N10g Communication If p N communication occurs at each step and there is one data exchange between pairs of processors at each ofthe logN steps Suppose a hypercube or another interconnection network that allows simultaneous exchanges is used In that case the communication time complexity is given by l COIHIH N If p lt N the communication time complexity is simply given by room 010g p Transpose Algorithm If processors organized as a twodimensional array communications would rst take place between processors in each column and then in each row P0 P1 P2 P3 5 t Figure 1133 FFT using transpose algorithm 7 rst two steps Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 305 During the rst two steps all communication would be within a processor as shown During the last two steps the communication would be between processors In the transpose algorithm between the rst two steps and the last two steps the array elements are transpose P1 P3 9 at i Figure 1134 Transposing array for transpose algorithm gages QQ 330 After the transpose the last two steps proceed but now involve communication only within the processors The only communication between processors is to transpose the array P0 P2 690 6 0 W 66 E Figure 1135 FFT using transpose algorithm 7 last two steps Slides for Parallel Programming Techniques and Applications using Networked Workstations and Parallel Computers Barry Wilkinson and Michael Allen Prentice Hall 1999 All rights reserved Page 306

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.