Class Note for CMPSCI 691 at UMass(10)
Class Note for CMPSCI 691 at UMass(10)
Popular in Course
Popular in Department
This 4 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 14 views.
Reviews for Class Note for CMPSCI 691 at UMass(10)
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 02/06/15
CMPSCI 691W Parallel and Concurrent Programming Lecture 11 March 15 Lecturer Emery Berger Spring 2006 Scribe Ed Walters This lecture covers multiprogrammed processors and the Hood library 111 MultiProgramming Techniques 0 Static Partitioning Static partitioning partitions serial work T1 the time it takes a single processor to do all the work evenly among P lightweight processes Therefore if each process runs in parallel each process performs T1 P worth of work This produces a linear speedup Static partitioning is appropriate when the user is the sole process running on the system and is often accomplished using shared memoryMP1 techniques In terms of performance static partitioning suffers a sharp decrease in performance when the amount of work exceeds the number of processors Static partitioning also suffers with an unknown or dynamic number of processors 0 Multiprogramming When another process is concurrently running on the system the number of pro cesses available PA will be less than the total P The most desirable result is to fully utilize all of the available parallelismi e T T1 P4 Unfortunately some of the work may not be fully parallelizeable resulting in leftover chunks that increase the execution time Multiprogramming used to be acceptable for use with applications such as scientific programming but this is no longer the case 0 Dynamic Scheduling This scheme partitions the work into many potentially millions threads of small granularity Generally the scheme is implemented by creating a userlevel thread scheduler and library since pthreads would be prohibitively expensive If we define the processor average PA as the time average number of processors available for execution as determined by the kernel then this scheme aims to achieve an execution time of T T1 P4 regardless of kernel scheduling 112 Implementation of Dynamic Scheduling and Hood 0 The DAG Model Mulitthreaded communication is often modeled as a Directed Acyclic Graph where each node represents a single unit of work instruction that takes one cycle to execute The edges represent the dependencies between two units of work where for an ii 1 ii executes then 71 We limit the graph to a single source and a single potential spawn for each node limiting outdegree to two Using this model if our work T1 is the number of nodes the maximum speed up possible T30 the run time on an infinite number of processes is the length of a longest path through the DAG Note that a node can only be executed if it is ready i e all of its ancestors in the graph have been executed o Hood s Nonblocking worh stealer Theory The Hood user level threads library uses a non blocking work stealer As opposed to Cilk Hood does not use locks so Hood does not have to worry about situations such as swapped out locks or a changing number of processors Therefore Hood can provide tighter scheduling guarantees In theory the expected value of the execution time is EN 0T1R4 T30PPA 11 1 112 Lecture 11 March 15 where the kernel is adversarial and the bound is optimal to a constant factor i e for any 5 gt 0 T 0T1PA T30 lg1ePP4 with probability at least 1 e In practice T TlPA TOOPPA and furthermore T T1 PA whenever P is small compared to the average parallelism T1T30 o Basics of Work Stealing 1 Deques In work stealing each process maintains a deque of ready threads A process grabs work by popping a thread from the bottom of the deque and executing it If a subject thread blocks or terminates the processor can pop another one If the processor spawns a thread it pushes one of the two threads onto the deque bottom and executes the other 2 Stealing Stealing occurs when an idle process takes a thread from the top of the deque of a random victim process 3 Nonblocking operations Since the Hood deques are nonblocking atomic loadteststore oper ations must be used to maintain synchronization In addition there will always be a constant number of attempts to any deque op before success Someone will eventually succeed If a process has completed one steal attempt and is about to immediately attempt another then the process yields This yield is important to ensure progress because a process can be working and subsequently swapped out In the worst case all the remaining processes can be empty spin locking and making steal attempts on empty queues If yielding is present then the working processes will not be starved Without yielding steal attempts rise dramatically with increasing number of processes resulting in a precipitous drop in performance if the number of processes exceeds available processors 0 Scheduling Lower Bounds Using this model we can obtain lower bounds for execution time At each time step i 1 T the kernel decides to schedule a specific subset of the P processes each of which executes a single instruction At step i let 1 denote the number of processes scheduled by the kernel We can now define the processor average as PA 1 and the average execution time as T 1 231 1 Therefore as the processor average increases the average execution time decreases We also now can see that T Z 1 since 2 1 2 T1 In other words we have to at the very least do all of the work available We can also get the bound T Z TaoPP4 because the kernel can just force us to run the critical path information so 2 1 Z TOOP In summary our bounds dictate that there must be at least the critical path lengths T30 worth of steps i where 1 0 For each such step the kernel can potentially schedule 1 P processes 0 Scheduling Upper Bounds We now prove two theorems involving the run time of the non blocking work stealer Theorem 111 A scheduler is greedy if at each step i the number of nodes executed is min1 of ready nodes Any greedy schedule in this context has an upper time bound of TlPA TOOPPA Proof Since T P17 23 1 we want to prove that P17 23 1 S T1R4 TaoPI A or 21 1 5 T1 TOOP During each step each scheduled process pays one token which is placed in one of two buckets the work buchet which is used when a processes executes a node or the idle buchet which is used when the process does not execute Lecture 11 March 15 113 At the end of the execution process execution will end up with T1 tokens in the work bucket On the other hand here are at most T30 steps where a process places a token in the idle bucket and at each step at most P tokens are placed in the idle bucket This provides an upper bound of T1 TOOP Theorem 112 The nonblocking worh stealer runs in expected time 0T1PA T30PPA Proof If we have S steal attempts then we can break the work buckets into two different categories If the process is executing then it places a token the work bucket as above nishing with 0T1 tokens at the end of execution Otherwise the process tries to steal and therefore puts a token in the steal bucket finishing with 08 tokens in the steal bucket We define an enabling edge as an edge in our DAG Graph a n such that the execution of it causes 11 to be ready The node a is termed the designated parent of 11 These edges in turn form an enabling tree Note that the enabling tree is a dynamically determined structure Next we informally prove the following lemma Lemma 113 For any deque at all times during the execution of the workstealing algorithm the parents of the nodes in the deque lie on a roottoleaf path in the enabling tree Proof For any process at any time during its execution on is the ready node of the thread being executed 111112 vk are the ready nodes of the rest of the threads in the process s deque ordered from bottom to top For i 01 k the node m is the designated parent of 11 Everything currently in the deque is a child of an element in the enabling tree Every time an element in the queue is run and therefore something is put on the queue you move down the enabling tree Therefore since all of the nodes 11 are ordered so are their corresponding m nodes so each m is an ancestor of MA in the enabling tree I We can use the above lemma to determine the potentials at each step i each ready node a has the potential dim 3T ll where du is the depth of a in the enabling tree The full potential at step i is ltIgt 1015110 for all such ready nodes it These potentials represent the fact that when you attempt to steal something the potential for stealing anything else decreases by roughly 13 Including the depth of the enabling tree is important because the nodes that are higher on the critical path have much more of an effect on the readiness of subsequent nodes than nodes lower on the critical path The top most node itself contributes a constant fraction with an initial potential of ltIgtU 3T Each of P steal attempts cause the potential to decrease by a constant fraction until potential decreases to 0 after 0T30P steals Therefore the expected number of steals is ES 0T30P For our description above we know that the total amount of tokens is 0T1 S 0T1 TOOP since we either put a token in the work bucket or the steal bucket Therefore the expected run time is 0T1PA TwpR4 114 Lecture 11 March 15 0 Performance Since we know our execution time is T 5 clTlPA engPPA we can provide a T 1 utilization metric W Z ng mm The ratio P T1 T30 is called the normalized number of processes Therefore for all applications and inputs we can provide a lower bound on the utilization based on the normalized number of processes Looking at the utilization gures for the knary benchmark and some real applications shows that the model is a decent fit Hood itself also performs admirably there is no precipitous drop in performance after the number of processors is exceeded by the number of processes If we vary the number of processes by testing the subject concurrently with the synthetic application cycler that randomly spawns processes then we find that the model also works very well We conclude by stating that the non blocking work stealer provides good performance with welldefined bounds on commodity systems 113 Related Work 1 1 31 Coscheduling Coscheduling is a technique whereby all of a computation s processes are scheduled in parallel It is also called gang scheduling It is a very poor choice for certain computation mixes For example running two computations on a four processor machine one with four processes and one with one only will produce poor results because the poor fit will cause a number of processing slots to be wasted On the other hand this solution works well for resourceintensive applications and heavily communicating processes i e applications with blocking processes 1 1 32 Process Control With process control processes are dynamically created and killed by each computation to fit the number of processors available This solution complements the non blocking work stealer because process creation and destruction can be tied to its deque contents and therefore for process control can keep P close to PA On the other hand since the number of processes is so closely tied to a constant there is less chance to improve performance through latency hiding
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'