### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Class Note for CMPSCI 691 at UMass(10)

### View Full Document

## 14

## 0

## Popular in Course

## Popular in Department

This 4 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at University of Massachusetts taught by a professor in Fall. Since its upload, it has received 14 views.

## Similar to Course at UMass

## Popular in Subject

## Reviews for Class Note for CMPSCI 691 at UMass(10)

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 02/06/15

CMPSCI 691W Parallel and Concurrent Programming Lecture 11 March 15 Lecturer Emery Berger Spring 2006 Scribe Ed Walters This lecture covers multiprogrammed processors and the Hood library 111 MultiProgramming Techniques 0 Static Partitioning Static partitioning partitions serial work T1 the time it takes a single processor to do all the work evenly among P lightweight processes Therefore if each process runs in parallel each process performs T1 P worth of work This produces a linear speedup Static partitioning is appropriate when the user is the sole process running on the system and is often accomplished using shared memoryMP1 techniques In terms of performance static partitioning suffers a sharp decrease in performance when the amount of work exceeds the number of processors Static partitioning also suffers with an unknown or dynamic number of processors 0 Multiprogramming When another process is concurrently running on the system the number of pro cesses available PA will be less than the total P The most desirable result is to fully utilize all of the available parallelismi e T T1 P4 Unfortunately some of the work may not be fully parallelizeable resulting in leftover chunks that increase the execution time Multiprogramming used to be acceptable for use with applications such as scientific programming but this is no longer the case 0 Dynamic Scheduling This scheme partitions the work into many potentially millions threads of small granularity Generally the scheme is implemented by creating a userlevel thread scheduler and library since pthreads would be prohibitively expensive If we define the processor average PA as the time average number of processors available for execution as determined by the kernel then this scheme aims to achieve an execution time of T T1 P4 regardless of kernel scheduling 112 Implementation of Dynamic Scheduling and Hood 0 The DAG Model Mulitthreaded communication is often modeled as a Directed Acyclic Graph where each node represents a single unit of work instruction that takes one cycle to execute The edges represent the dependencies between two units of work where for an ii 1 ii executes then 71 We limit the graph to a single source and a single potential spawn for each node limiting outdegree to two Using this model if our work T1 is the number of nodes the maximum speed up possible T30 the run time on an infinite number of processes is the length of a longest path through the DAG Note that a node can only be executed if it is ready i e all of its ancestors in the graph have been executed o Hood s Nonblocking worh stealer Theory The Hood user level threads library uses a non blocking work stealer As opposed to Cilk Hood does not use locks so Hood does not have to worry about situations such as swapped out locks or a changing number of processors Therefore Hood can provide tighter scheduling guarantees In theory the expected value of the execution time is EN 0T1R4 T30PPA 11 1 112 Lecture 11 March 15 where the kernel is adversarial and the bound is optimal to a constant factor i e for any 5 gt 0 T 0T1PA T30 lg1ePP4 with probability at least 1 e In practice T TlPA TOOPPA and furthermore T T1 PA whenever P is small compared to the average parallelism T1T30 o Basics of Work Stealing 1 Deques In work stealing each process maintains a deque of ready threads A process grabs work by popping a thread from the bottom of the deque and executing it If a subject thread blocks or terminates the processor can pop another one If the processor spawns a thread it pushes one of the two threads onto the deque bottom and executes the other 2 Stealing Stealing occurs when an idle process takes a thread from the top of the deque of a random victim process 3 Nonblocking operations Since the Hood deques are nonblocking atomic loadteststore oper ations must be used to maintain synchronization In addition there will always be a constant number of attempts to any deque op before success Someone will eventually succeed If a process has completed one steal attempt and is about to immediately attempt another then the process yields This yield is important to ensure progress because a process can be working and subsequently swapped out In the worst case all the remaining processes can be empty spin locking and making steal attempts on empty queues If yielding is present then the working processes will not be starved Without yielding steal attempts rise dramatically with increasing number of processes resulting in a precipitous drop in performance if the number of processes exceeds available processors 0 Scheduling Lower Bounds Using this model we can obtain lower bounds for execution time At each time step i 1 T the kernel decides to schedule a specific subset of the P processes each of which executes a single instruction At step i let 1 denote the number of processes scheduled by the kernel We can now define the processor average as PA 1 and the average execution time as T 1 231 1 Therefore as the processor average increases the average execution time decreases We also now can see that T Z 1 since 2 1 2 T1 In other words we have to at the very least do all of the work available We can also get the bound T Z TaoPP4 because the kernel can just force us to run the critical path information so 2 1 Z TOOP In summary our bounds dictate that there must be at least the critical path lengths T30 worth of steps i where 1 0 For each such step the kernel can potentially schedule 1 P processes 0 Scheduling Upper Bounds We now prove two theorems involving the run time of the non blocking work stealer Theorem 111 A scheduler is greedy if at each step i the number of nodes executed is min1 of ready nodes Any greedy schedule in this context has an upper time bound of TlPA TOOPPA Proof Since T P17 23 1 we want to prove that P17 23 1 S T1R4 TaoPI A or 21 1 5 T1 TOOP During each step each scheduled process pays one token which is placed in one of two buckets the work buchet which is used when a processes executes a node or the idle buchet which is used when the process does not execute Lecture 11 March 15 113 At the end of the execution process execution will end up with T1 tokens in the work bucket On the other hand here are at most T30 steps where a process places a token in the idle bucket and at each step at most P tokens are placed in the idle bucket This provides an upper bound of T1 TOOP Theorem 112 The nonblocking worh stealer runs in expected time 0T1PA T30PPA Proof If we have S steal attempts then we can break the work buckets into two different categories If the process is executing then it places a token the work bucket as above nishing with 0T1 tokens at the end of execution Otherwise the process tries to steal and therefore puts a token in the steal bucket finishing with 08 tokens in the steal bucket We define an enabling edge as an edge in our DAG Graph a n such that the execution of it causes 11 to be ready The node a is termed the designated parent of 11 These edges in turn form an enabling tree Note that the enabling tree is a dynamically determined structure Next we informally prove the following lemma Lemma 113 For any deque at all times during the execution of the workstealing algorithm the parents of the nodes in the deque lie on a roottoleaf path in the enabling tree Proof For any process at any time during its execution on is the ready node of the thread being executed 111112 vk are the ready nodes of the rest of the threads in the process s deque ordered from bottom to top For i 01 k the node m is the designated parent of 11 Everything currently in the deque is a child of an element in the enabling tree Every time an element in the queue is run and therefore something is put on the queue you move down the enabling tree Therefore since all of the nodes 11 are ordered so are their corresponding m nodes so each m is an ancestor of MA in the enabling tree I We can use the above lemma to determine the potentials at each step i each ready node a has the potential dim 3T ll where du is the depth of a in the enabling tree The full potential at step i is ltIgt 1015110 for all such ready nodes it These potentials represent the fact that when you attempt to steal something the potential for stealing anything else decreases by roughly 13 Including the depth of the enabling tree is important because the nodes that are higher on the critical path have much more of an effect on the readiness of subsequent nodes than nodes lower on the critical path The top most node itself contributes a constant fraction with an initial potential of ltIgtU 3T Each of P steal attempts cause the potential to decrease by a constant fraction until potential decreases to 0 after 0T30P steals Therefore the expected number of steals is ES 0T30P For our description above we know that the total amount of tokens is 0T1 S 0T1 TOOP since we either put a token in the work bucket or the steal bucket Therefore the expected run time is 0T1PA TwpR4 114 Lecture 11 March 15 0 Performance Since we know our execution time is T 5 clTlPA engPPA we can provide a T 1 utilization metric W Z ng mm The ratio P T1 T30 is called the normalized number of processes Therefore for all applications and inputs we can provide a lower bound on the utilization based on the normalized number of processes Looking at the utilization gures for the knary benchmark and some real applications shows that the model is a decent fit Hood itself also performs admirably there is no precipitous drop in performance after the number of processors is exceeded by the number of processes If we vary the number of processes by testing the subject concurrently with the synthetic application cycler that randomly spawns processes then we find that the model also works very well We conclude by stating that the non blocking work stealer provides good performance with welldefined bounds on commodity systems 113 Related Work 1 1 31 Coscheduling Coscheduling is a technique whereby all of a computation s processes are scheduled in parallel It is also called gang scheduling It is a very poor choice for certain computation mixes For example running two computations on a four processor machine one with four processes and one with one only will produce poor results because the poor fit will cause a number of processing slots to be wasted On the other hand this solution works well for resourceintensive applications and heavily communicating processes i e applications with blocking processes 1 1 32 Process Control With process control processes are dynamically created and killed by each computation to fit the number of processors available This solution complements the non blocking work stealer because process creation and destruction can be tied to its deque contents and therefore for process control can keep P close to PA On the other hand since the number of processes is so closely tied to a constant there is less chance to improve performance through latency hiding

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "I made $350 in just two days after posting my first study guide."

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.