### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Parallel Systems CSE 4500

UCONN

GPA 3.73

### View Full Document

## 30

## 0

## Popular in Course

## Popular in Engineering Computer Science

This 18 page Class Notes was uploaded by Americo Huel on Thursday September 17, 2015. The Class Notes belongs to CSE 4500 at University of Connecticut taught by Alexander Shvartsman in Fall. Since its upload, it has received 30 views. For similar materials see /class/205928/cse-4500-university-of-connecticut in Engineering Computer Science at University of Connecticut.

## Reviews for Parallel Systems

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/17/15

CSE 4500 228 Fall 2009 Selected Notes Set 7 Alexander A Shvartsrnan Computer Science and Engineering University of Connecticut Copyright 20022009 by Alexander A Shvartsmanr All rights reserved 7 Asynchronous Algorithms for Parallel Cooperation N THIS chapter we give examples of asynchronous algorithms for a basic parallel cooperation problem that we call the Do All problemi We also note that fully asynchronous algorithms tolerate arbitrary delays between the local steps of individual processors In particular such algorithms tolerate in nite delays thus they are also able to tolerate crashes of the processors provided at least one processor does not crash Given a collection of N independent idempotent similarlysized tasks and P processors perform all tasks The techniques for solving the Do All problem in asynchronous settings differ substantially from the techniques used to solve this problem in synchronous settings where the knowledge of time allows any processor to make decisions based on the timely progress of other processors We rst discuss the Do All problem in Section 7ili In Section 72 we give a simple optimal twoprocessor algorithmi Then in Section 73 we give a three processor algorithm called algorithm T In the next handout in Sections 817 83 we give a more general algorithm for l S P S Ni 71 The Do All Problem We assume that it takes a unit of time to perform any task and we abstract performing a task by writing a known value eg 1 to shared memory We assume that the memory is initially clear The tasks are represented by a shared array AHMN initialized to zerosi Then to perform a task i a processor reads Ai if the read value is 0 the task needs to be performed and the processor performs the assignment AM I 1 else the task has already been performed and nothing needs to be done Clearly this abstraction has the property that the CSE 4500 228 Fall 2009 Set 7 Page 2 tasks are independent idempotent and similarly sized Having an algorithm that solves the problem of initializing an array to 17s we can easily transform it to perform an arbitrary collection of N independent idempotent and similarly sized tasks by replacing the assignment AM I l77 everywhere it occurs by the following Perform task i I llquot More generally we de ne Do All to be the following problem Given a set of tasks AluN and P processors perform each task at least once Solving Do All when P N is one of the simplest computations that can be performed by a PRAMl The following optimal PRAM program solves Do All in constant time thus achieving linear speedupl forall processors pid LN parbegin perform Apid parend Next suppose that P Ni lt easy to modify the program above so that each processor performs lNPl tasksi Padding is used to include empty tasks when P does not divide Ni forall processors pid 1P parbegin for i 0 to lNPl 71 do perform Apid i as P rof parend The above algorithm takes time TM NP and its work or cost is C P TM P NP N Thus this solution is also optimal and in particular it achieves linear speedup 5 eAUTm NNP Pl 72 An algorithm for two processors We now give a simple and optimal solution for the Do All problem for N tasks using two processors iiei P 2i Quite different considerations are necessary when designing a parallel algorithm in which the number of processors is much smaller than the size of the input The goal in this situation when the underly ing machine is synchronous is to nd a method whose parallel time complexity is at most the sequential time complexity divided by the number of processors plus a small additive overhead ie for the Do All problem the parallel time must be N2 01 to be considered optimal here we assume that it takes a unit of time to perform one taski Note that constant factors are important CSE 4500 228 Fall 2009 Set 7 Page 3 and cannot be hidden in Onotationi When considering algorithms for asyn chronous models the goal is to have the parallel work complexity iiei cost be equal to the sequential complexity plus small additive overhead ie for Do All optimality is achieved when work is N 01 For the Do All problem it is easy to achieve this goal with two processors Twoprocessor DoAll Algorithm 39 The processor with P113 0 henceforth P0 reads and then writes locations sequentially starting at 1 and moving up pro cessor P1 reads and then writes locations sequentially starting at N and moving down Both processors stop when they read a 1 Clearly the work performed by the two processors is at most N 1 each task but the last is performed exactly once while the nal task is performed at most twice if both processors nd it not done Thus the work is no more than 2 N 1 73 Algorithm T a threeprocessor algorithm The rst nontrivial case is that of three processorsl There are two important points to the algorithm we are about present the implementation is nontrivial even though the idea is simple and the case of four processors is still open This makes our algorithm interestingi Here is an intuitive description of a three processor algorithmi Processor P0 works leftto right processor P1 works rightto left and P2 works starting from the middle and alternately expanding in both directions If P0 and P2 meet they both know that an entire pre x of the tasks has been performed Processor P0 then jumps to the leftmost not performed by itself or P2 and P2 jumps to the new middle of unperformed tasks A meeting of P1 and P2 is symmetric When P0 and P1 meet the computation is complete lntuitively processors can maintain an upper bound on the number of un performed tasks remaining that starts at N and is halved every time a collision occursl Thus at most logN collisions are experienced by each processor High level pseudocode for the algorithm is given in Figure 7117 lmplementation of the highlevel algorithm requires some form of commu nication among the asynchronous processorsi At a collision a processor must determine which processor previously performed the task ln the case of a col lision with P2 a processor must also determine what portion of the task array to jump over This communication may be implemented either by writing ad ditional information to the cells of the array or by using auxiliary variablesl If the task array that the processors are working on is also used to hold auxiliary information implementation is straightforward When processor P2 performs a task at the left respl right end of its area it records the location of the next unwritten cell to the right respl lefti P0 and P1 write the values 71 and N1 respectively to signal no unperformed tasks A total of NOlog N reads and N Olog N writes are required on the asynchronous modeli CSE 4500 228 Fall 2009 Set 7 Page 4 ln summary7 algorithm T provides the following boundsi Theorem 731 The Do All problem for three processors can be solved with N Olog N writes to and N Olog N reads from the task arrayi OH set current position to 1 repeat if no collision then perform the task and increment current position else if collision with P1 then exit else collision with P2 set current position at the right of 132 s area until current position gt N T1 set current position to N repeat if no collision then perform task and decrement current position else if collision with P0 then exit else collision with P2 set current position at the left of 132 s area until current position lt l T2 initialize middle and boundaries of current work area repeat if no collision then perform the next 2 tasks away from the middle else if collision with P0 then set left boundary at rightmost task written by P2 set middle halfway between left and right boundaries else collision with 1 set right boundary at leftmost task performed by P2 set middle halfway between left and right boundaries until done Figure 717 A highlevel description of algorithm Ti Processor with PH 239 executes thread Til CSE 4500 228 Fall 2009 Selected Notes Set 1 Alexander A Shvartsrnan Computer Science and Engineering University of Connecticut Copyright 20022009 by Alexander A Shvartsmanr All rights reserved 1 PRAM Models Measures of Ef ciency and Notation ORMULATING suitable models of parallel computation goes hand in hand with the study of algorithms and their complexity Here we revisit the model of computation that is the subject of our presentation and overview its major variants We discuss the complexity measures that we use to characterize the ef ciency of algorithms expressed in the selected models We introduce the highlevel notation commonly used to specify parallel algorithms and we discuss the implementation and architectural issues related to the abstract models we study 11 Computing with Multiple Processors Multiple cooperating processors are often deployed to increase ef ciency of com putation according to the following criteria Increase system throughput by performing multiple looselycoupled or unre lated tasks in parallel This ranges from assigning processing elements to speci c system tasks erg7 augmenting central processors with dedicated inputoutput processors7 to assigning generalpurpose processors to spe ci c application tasks erg7 providing several processors in a timesharing system so that multiple applications can be executed concurrently Reduce response time for a particular task by partitioning it into components t at can be executed in parallel and assigning a dedicated processing el ement to each such component erg7 searching a large le in parallel by assigning parts of the le to individual processors CSE 4500 228 Fall 2009 Set 1 Page 2 Systems meeting the rst objective have for some time been available through integrated hardware and operating system software solutions Working towards the second objective has proved more dif cult since discovering and developing ef cient parallel algorithms is in general a very challenging task ln some cases existing sequential algorithms can be parallelized but this is timeconsuming and such techniques are dif cult to generalize Although generalpurpose tech niques such as pipelining are able to speed up the execution time of sequential algorithms such speedups are constant and have little impact on the ability of the algorithm to be executed on multiple pipelined processorsi Thus speeding up a single large computational task by using multiple processors requires that there exist an ef cient and fasterthan sequential parallel algorithm for that task 12 PRAM and Variants ln the study of algorithms for the tightlycoupled synchronous shared memory multiprocessors one abstract model attracted signi cant amount of attention This model is the Parallel Random Access Machine or PRAMl The PRAM model was formulated for the setting where the processors need to cooperate in working towards a common computational goal The PRAM combines the simplicity of a single processor RAM Random Access Machine model with the power of parallelisml The PRAM model is used widely in parallel algorithms researchi lt is a convenient and elegant model and a plethora of ef cient algorithms has been developed for this model 121 The Basic PRAM Model Since there are several standard variants of the model we start by describing the general characteristics of the PRAM familyi Common features of PRAM models A PRAM is de ned as a set of synchronous processors that can concurrently access shared memory 0 There are P serial RAM processors with unique identi ers PID in a compact range eg in the range 1 i i i i ach processor has access to its PID and the number of processors Pl 0 The global memory is accessible by all processors and is denoted as shared the P processors can concurrently access the shared locations and memory access takes unit time CSE 4500 228 Fall 2009 Set 1 Page 3 0 There are Q shared memory cells and the input of size N S Q is stored in the rst N cellsi Except for the cells holding the input all other memory cells are cleared iiei contain zeroesi The processors have access to the input size N l 0 Each processor has local memory denoted as private The local memory is not accessible to other processors 0 All memory cells are capable of storing log maxN P bitsi It is usually also assumed that the size of the shared memory Q is polynomial in the number of processors P and the size of the input N and that the private memory is relatively small eigi polylogarithmic in P or even constant The PRAM instruction set is essentially that of a conventional RAMi The instructions executed by the processors are de ned in terms of three synchronous cyc es Read cycle a processor reads a value from a location in shared memory into its private memory Compute cycle a processor performs a computation using the contents of its private memory Write cycle a processor writes a value from a location in private memory to a location in shared memory We also need to consider the selection of the next instruction to execute This is usually the next instruction in the instruction stream but it can be selected by a conditional or an unconditional branchl PRAM variants There are several variants of the basic model each de ning the allowable pat terns of shared memory access There are three main variants EREW the exclusiveread exclusivew te variant allows no concurrent access by more than one processor to the same memory location CREW the concurrentread exclusivewrite variant allows arbitrary concurrent read access to a memory location but no concurrent write access to the same location CROW the concurrentread cancurrent wn39te variant allows arbitrary concur rent read and write access to the same memory location With the concur rent writes enabled the following three write policies are distinguished 1 COMMON CROW in which all concurrently writing processors must write the same value CSE 4500 228 Fall 2009 Set 1 Page 4 2 ARBITRARY CRCW in which the write of a single processor is arbi trarily chosen to succeed an 3 PRIORITY CROW in which the processor with the highest PID suc ceeds in writing its value Among these variants the EREW is the weakest and the PRIORITY CRCW is the strongest PRAM variantl Any algorithm for the weaker variant also works in the stronger variantl The converse is not true but stronger variants can be simulated by the weaker variants at a modest cost eg at a polylogarithmic time overheadl In particular it can be shown that a PRIORITY PRAM can be simulated on an EREW PRAM with a slowdown that is logarithmic in the number of processors Finally when showing lower bounds an impractically strong model called the IDEAL PRAM is sometimes use I n t is model there are no limits on the sizes of the private and shared memories and a processor can perform arbitrary computation on its private memory in unit time The IDEAL PRAMs are also classi ed as EREW CREW and CRCW IDEAL PRAMsl Such IDEAL PRAMs were used in showing lower bounds results for the settings without failures When proving some of the lower bounds for the faultprone models one can also consider a powerful PRAM model in which processors are able to read and locally process the entire shared memory at unit cost We call such models the memory snapshot models In much of the exposition we use the concurrentread concurrentwrite CROW PRAMI In the algorithms we present all concurrently writing processors write the same value making our approach independent of the CRCW conventions for concurrent writesl In the cases where we deal with other than the CRCW variant we will explicitly point this out 122 Advantages and Limitations of the PRAM Model To understand why the PRAM model is widely adopted for algorithmic study we take a brief look at the PRAMls sequential ancestor the RAMI The reasons for the success of the RAM model of computation have been summarized by Leslie Va iant The phenomenon of general purpose computation in the sequential domain is both speci c and powerful We take the view here that it can be characterized by the simultaneous satisfaction of three requirements First there exists at least one highlevel language 1 that humans nd satisfactory for expressing arbitrary complex algorithmsl Second there exists a machine architecture M in t is case the von Neumann model that can be implemented ef ciently in existing technologies Finally there exists a compilation algorithm A that can transform arbitrary programs in M to run on M ef ciently CSE 4500 228 Fall 2009 Set 1 Page 5 The s me harmony is yet to be achieved in the domain of decentralized parallel and distributed computation There is no universally accepted high level programming paradigm M while there are several feasible but dissimilar machines M that dictate a particular style of programming that suits speci c compilation algorithms A At the abstract level however the PRAM model comes close to satisfying the three requirements In the next section we discuss a highlevel programming language notation for the model Other notations used in the literature differ essentially at the syntactic level These notations have been widely accepted and used for specifying PRAM programs Likewise techniques exist for efficient compilation of such highlevel language programs for execution on the abstract PRAMi The problem of course is that implementations of general synchronous shared memory machines using extant technologies are scalable only to modest ranges of processors and memory However efficient simulations of PRAMs on some realizable parallel architectures exist eigi hypercube and where general simulations are not available some PRAM algorithms have been mapped onto the target architecture It is worthwhile to pursue the study of PRAM computations since improvements in PRAM algorithms in many cases result in improvements in algorithms for practical architectures Finally lower bounds are also informative The computing folklore states that one can always slow down programs by using real hardware77 In other words theoretically provable inef ciencies for parallel computation are usu ally ampli ed in practice By t quot the A 39 quot 39 quot for the PRAM we can in many cases conclude that better performance cannot be achieved on any other realistic parallel architecture We now review the main idealized assumptions in the PRAM model Memory access concurrency processors can read from in the CREW and CROW variant and write to in the CROW variant the same shared mem ory cell concurrentlyi In the extreme case all processors are able to access a given cell concurrentlyi Memory access bandwidth processors can access shared memory in par allel and such access takes unit time Global synchronization of processors processors proceed in lockstep and an observer outside the PRAM can associate a global time77 with every event iei although the processors do not have access to global time77 and t e PRAM does not provide a global clock processors can keep accurate local clocks by counting their steps and communicating through shared memory Faultfree execution environment the processors memory and proces sormemory communication are free of faults Each of the above assumptions increase the attractiveness of the PRAM model as the target of parallel algorithm development At the same time these assump CSE 4500 228 Fall 2009 Set 1 Page 6 tions make it dif cult for the PRAM to be implemented as a scalable architecture and decrease the signi cance of PRAM results The challenges of synchronizing subcomponents providing constant length access paths to memory cells com municating between components at a high bandwidth and of masking failures are present even in uniprocessorsi Because of the large number of processors in the parallel setting the physical limitations encountered in meeting these challenges in a scalable fashion become problematic much sooner There are a number of proposals that address the above assumptions How ever we consider synchrony to be a desirable enabling property of parallel com putationi Completely removing the assumption of synchrony undermines the very goal of parallel computation 7 the goal whose objective is to achieve a predictable Pifold speedup as compared to the best sequential algorithms We thus proceed with the 39 that the h d y parallelism is a valuable abstraction because it yields a model with a widely ac ceptable programming paradigm that makes the speci c speedup goals directly quanti able lt is therefore important to continue projecting this abstraction externally while hiding the lessthanperfect reality in a transparent and an ef cient way 123 Parallelism and E iciency Speeding up computation is one of the central reasons for using parallel com putersi If a certain task can be done in sequential time T1 using a uniprocessor then ideally we would like to perform the same task using P processors in par allel time TP such that Tp T1 P Obtaining such speedup is an important goal of parallel algorithm designi lt is very important to measure speedup relative to the most ef cient se quential algorithm 7 achieving speed up over poor sequential algorithms is not that meaningfuli T us me measure speedup as follows Let TfN be the time of the best possible uniprocessor algorithm for a certain problem with inputs of size N in order to establish that this is indeed the best algorithm one needs to contrast its time with the lower bound on time for the problem being solved If a Pprocessor parallel algorithm solves this problem in time Tp N then the achieved speedup S is de ned as 5 TfNTPN The speedup is linear when S P ln the setting where multiple processors cooperate on a collection of common tasks in order to hope to achieve a speedup close to linear in P we need to assume that the processors are operating either in synchrony or at least at comparable speedsi This assumption is in fact implemented in highly parallel computing devices such as systolic arrays and SIMD singleinstruction multiple data devices such as vector processorsi ln the synchronous setting where there CSE 4500 228 Fall 2009 Set 1 Page 7 are P processors that operate in lockstep ilei operate in complete synchrony the extent to which the goal of ef cient speedup is reached can be quanti ed as we de ne next The parallel work sometimes referred to as cost of synchronous parallel algorithms measures the total number of machine instructions executed by the processors during the computation It is expressed as the product of the number of processors P and the parallel time Tp N on inputs of size Ni Thus the work W of a Pprocessor synchronous parallel algorithm running in time Tp N is W P Tp N This measure requires that we account for all available computing resources by counting all instructions executed by processors including the instructions of any idle processors The notion of work is of course meaningful for uniprocessors as well Let T1 N be the time complexity of a sequential algorithm solving a certain problem on inputs of size Ni We observe that since 1 W P TPN 1 T1N T1N thus the notions of work cost and time are equivalent for sequential algorithmsi For a particular problem a parallel algorithm is considered optimal when the parallel work on a problem of size N is related by a multiplicative constant to the best known or best achievable sequential time TfN for the same problem That is for optimal parallel computation we require that W P TPltNgt eltTrltNgtgt Note that in this case the speedup is linear in P 7 in fact workoptimality is achieved if and only if the speedup linear in Pi We say that a parallel algorithm is e cient or polylogarmically e cient when its work has polylogarithmic overhead relative to the best sequential al gorithm W P TPN 9mm 10g0lt1gt N Note that if a speci c parallel algorithm is ef cient then its speedup is near linear in P within a polylogarthimic factor that is S TrltNgt TPltNgt eonlog N where c is a positive constant that depends on the algorithmi We say that a parallel algorithm is polynomially e cient when its work has small polynomial overhead relative to the best sequential algorithm W P TPltNgt eltTrltNgt NE where 5 is a constant such that 0 lt 5 lt 1 We require that 5 lt 1 because if we allow 5 2 1 then a trivial parallel algorthm for P N processors where CSE 4500 228 Fall 2009 Set 1 Page 8 each processor simply executes the best sequential algorithm can be considered ef cient since its work is W P TfN Tf N N1i Note that if a parallel algorithm is polynomially ef cient then its speedup is away from from linear by a polynomial factor that is 5 TfN TPN GODNW where 5 is a positive constant that depends on the algorithmi Of course for the computation to be practical in all cases the constants hidden in the bigoh notation must be sma i The number of processors P must also be scalable with the problem parame ters For achieving a meaningful betterthanconstant speedup for an input of size N we must be able to increase the number of processors P so that it is non trivial compared to N andor to the best sequential time TfNi For example it is desirable for a given problem to achieve parallel time polylogarithmic in N by deploying suf ciently many processors i Note that using work as the main complexity measure is more strict than using parallel time alone This is because the goal of minimizing parallel time is in some cases achieved by using a polynomial number of processors eigi quadratic in input sizei Such an algorithm cannot be considered workef cient if its work far exceeds the best sequential time for the same problemi For some problems one can establish lower bounds on work WLB N P of parallel algorithms such that WLB N P wTfNi Clearly in this case no optimal parallel algorithms existi Therefore we extend the notion of optimality and we call a parallel algorithm workoptimal if WltN7Pgt P TPltNgt 0ltWLBltN7Pgtgtl 124 Synchrony and Cooperation We take the view that synchrony is an important component of achieving the promise of the massivelyparallel ef cient computationi When a system is not completely synchronous then at least we must be able to make assumptions about the time it takes for a processor to perform a unit of work and to com municatei Such systems are sometimes called timedasynchronousi In working towards the goal of ef cient computation it is reasonable to assume that the processors available to a computation are all equally powerfuli It makes little sense to include for example processors that are much slower than the rest In designing for quot quot in a or t39 d system we count on the availability of suf cient computational resources whose behavior is within the designed parametersi Of course we need to be able to deal with processors with erratic timing behavior but the computation need not be exclu sively concerned with such deviants provided of course the correctness can be maintained This brings us to another important point In developing highly parallel al gorithms it is assumed that the processors are willing participants in a solution CSE 4500 228 Fall 2009 Set 1 Page 9 that is the processors are there to cooperate and do not need to be coerced into cooperation We must also be able to deal with unexpected deviations from such willing cooperation but such cases are the exception not the norm ln particular we can rely on good citizenship77 of the cooperating processors not to exhibit arbitrary or malicious behavior 125 Architectural Considerations To combine isolated processors and memories there is a variety of interconnec tion topologies that yield models of computation and realizable architectures covering the spectra of paradigms from sharedmemory to messagepassing and from tightlycoupled to looselycoupled For example the PRAM parallel ran dom access machine model can be viewed as a sharedmemory tightlycoupled model Systolic array architectures are based on a messagepassing tightly coupled model Wide area networks fall into the messagepassing loosely coupled model while local area clusters of processors sharing disk storage belong to the sharedmemory looselycoupled mo el These are just few of the possi ble data points and additional re nements can be made for example on the basis of the ability to perform ne vs coarsegrain computation Note that the sharedmemorymessage passing dichotomy alone is not suf cient to determine whether a decentralized machine constitutes a suitable parallel processing plat form For example a collection of PCs sharing a le server is a s aredmemory machine not very wellsuited for solving nelygrained parallel problems while a hypercube multiprocessor is a messagepassing machine that is a basis for several contemporary parallel machines Here we consider the models of computation in the sharedmemory tightly coupled paradigm which in addition are suited for negrain parallelism The PRAM is a convenient abstraction of such models which has been successfully used in the development of ef cient parallel algorithms Several ef cient simu lations of such models in other parallel architectures have been developed eg hypercubes making many sharedmemory algorithms transparently portable to such architectures ln other cases an algorithm can be mapped directly onto another architecture ln the previous section we have discussed some of the advantages and disadvantages of the model we chose as the basis for our exposition Ef cient or optimal parallel algorithms have been developed for many funda mental computation tasks These include adding N integers sorting N records and manipulating lists and trees eg computing the rankings of the elements of a Nsize list and computing a preorder numbering or subtree sizes of a Nsize tree These algorithms play an important role in realizing the promise of high speedups using massive parallelism Example 121 Sorting Sorting is one of the most fundamental operations in computing ln the domain of sequential computation sorting N values requires NlogN operations Thus in the parallel setting the ef ciency and speedup CSE 4500 228 Fall 2009 Set 1 Page 10 Synchronous parallelism Asynchronous concurrency with concurrent memory access with interleaved memory access x 0 I 0 parbegin cobegin x z 1 x z x 1 parend 11 x1x2 Figure 11 Synchronous parallel and asynchronous concurrent executionsi goals are to be able to sort in polylogarithmic in N time using a linear in number of processors These goals have been achieved for several models of decentralized computation ranging from shared memory models to message passing xeddegree network models Optimal logarithmic time algorithms have also been discovered D 126 A Programming Notation PRAM algorithms are usually presented as high level programs that can be com piled into assemblylevel PRAM instructions using conventional techniques The programs are described in a model independent fashion using blockstructured notation with the parbeginparend construct specifying parallelism and other conventional control structures such as ifthenelse and whiledo to control the ow of execution within a single processor In reading the parallel code it is important to keep in mind that the semantics of programs in the synchronous parallel model differ from the semantics of programs in the asynchronous con current shared memory mo e i The most striking difference between the asynchronous concurrent model with interleaved memory access and the parallel model is observed for the syn chronous CROW memory access discipline In the synchronous case we have concurrent execution and assignment while in the asynchronous case the atomic actions that make up the program elements are interleaved in some order Consider the two program fragments in Figure 11 each executed by two processors We use the parbeginparend brackets in this monograph to denote parallel execution with memory access concurrencyi In this example we use the accepted cobegincoend brackets to denote concurrent execution with in terleaving The statements or statement blocks separated by are executed concurrently The expressions in braces describe the state of the computation and the angled brackets are used to denote atomic actions For the synchronous program the value of z is deterministically computed to be 1 that is the synchronous program is equivalent to a single multiple assignment 11 I z 11 177 In the asynchronous case the value of z is either 1 or 2 which is caused by interference due to possible interleavings within CSE 4500 228 Fall 2009 Set 1 Page 11 the programi Even if we do not require the atomic evaluation of z 1 then I 1 still holds in the synchronous case while the results are essentially unde ned for the asynchronous programi These differences illustrate the unusual feature of the CROW PRAM syn chronous parallel readwrite to the same memory location When two processors execute the program segment both processors read 1 simultaneously and store the incremented value to 1 simultaneously establishing z 1 In order to reason about the synchronous parallel programs one needs to specify precisely the relative timing of all the commands and statements It may be intuitively understood that the parallel assignments in Figure 111 are synchronized since they have identical timing characteristics However if the right hand side of the second assignment was changed to a more complicated expression that takes longer or a different number of instructions to evaluate then to synchronize the assignment we would have to specify that the eval uations of the righthand side expressions are performed rst followed by the parallel assignmentsi In most cases we are dealing with the programs executed by the individual PRAM processors are identical but the processors have their own instruction counters and they may be working on different data Such model is called the singleprogram multipledata model or SPMDi The SPMD programs where the number of processors and their PIDs are known can be expressed as fo lows forall processors pid 1P parbegin Pr ram pid parend Each processor executes the sequential Programpid based on its proces sor identi eri This can be equivalently expressed as parbegin Program1 Program2 ProgramP parend The SPMD model can be viewed as a generalization of the SIMD single instruction multipledata model when in the SIMD case each processor can either execute the current instruction or perform on a noop based on certain criteria On the other hand SPMD can also be viewed as a restriction of the MIMD multipleinstruction multipledata model by providing means of syn chronizationi The asynchronous shared memory model in its general form is essentially the MIMD modeli Special care must be taken when constructing SPMD parallel programs con sisting of the parbeginparend constructs above and the conventionallylooking ifthenelse and whiledo blocksi For example in the following program CSE 4500 228 Fall 2009 Set 1 Page 12 forall processors pid 12 parbegin if expressionpid then blockl else block fi parend the speci c timing of operations performed by the two processors if each takes a different branch in the if need to be precisely de ned for the program to make sense A similar situation exists with the while guard do77 loop con struct when some of the processors exit the construct after evaluating the guard expression and some continue executing the loop A number of techniques exist for unambiguously compiling these constructs into machine language instructionsi These techniques typically use processor masking approaches and ensure synchrony by pad ing77 the shorter execution paths by an appropriate number of noop instructions Such transformations affect only the lexical structure of the program text and do not affect the com putational complexity of algorithms Examining this in more detail is outside our scope In our algorithms we avoid the ambiguity and enforce synchrony by using the while loops in which the guard expressions are evaluated identi cally by all processors and in which the if thenelse branches consist of simple assignments whose meaning and timing is made clear by the program texti For example in the program such as the one below we always evaluate the righthand side expressions rst and then perform the assignment all in paral lel so that the pre and postconditions holdi To ensure this at compile time the number of machine instructions in the righthand side evaluation of the rst more complicated assignment is counted This number is used to determine the number of noops necessary to append to the second assignment s righthand side evaluation to ensure that both evaluation consist of the same number of instructions for simplicity assuming identical timing for all instructions x 1 y 2 forall processors pid 12 parbegin x5y0 The programming notation we overviewed is generally easy to work with and reason about W ile in the asynchronous concurrent model one needs to worry about the interleavings of and the interference among the threads of execution in the synchronous parallel model we have an essentially mechanical process of understanding the deterministic timing characteristics of the programs7 lexical structure We will be using such notation in presenting algorithms for multiprocessorsi

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

#### "There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.