Popular in Course
Popular in ComputerScienence
This 112 page Class Notes was uploaded by Vito Kilback on Wednesday September 23, 2015. The Class Notes belongs to CS370 at Drexel University taught by Staff in Fall. Since its upload, it has received 38 views. For similar materials see /class/212463/cs370-drexel-university in ComputerScienence at Drexel University.
Reviews for OperatingSystems
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/23/15
Scheduling Algorithms Jay Kothari 39aykdrexeledu CS 370 Operating Systems July 9 2008 CPU Scheduling lOqueue H lO request time slice expired interrupt wait for an occurs interrupt 0 Earlier we talked about the lifecycle of a thread 0 Active threads work their way from Ready queue to Running to various waiting queues 0 Question How is the OS to decide which of several tasks to take off a queue 0 Obvious queue to worry about is ready queue 0 Others can be scheduled as well however 0 Scheduling deciding which threads are given access to resources from moment to moment Assumptions about Scheduling 0 CPU scheduling big area of research in early 70 s 0 Many implicit assumptions for CPU scheduling 0 One program per user 0 One thread per program 0 Programs are independent 0 Clearly these are unrealistic but they simplify the problem so it can be solved 0 For instance is fair about fairness among users or programs 0 If I run one compilation job and you run five you get five times as much CPU on many operating systems 0 The highlevel goal Dole out CPU time to optimize some desired parameters of system Program behavior and scheduling Assumption CPU Bursts load store add store CPU bursi read from file wait for 0 IO burst store increment index CPU burst write to file wait for 0 1 IO bu rst load store add store CPU burst read from file wait for O O burst i 16 24 32 burst duration milliseconds frequency Assumption CPU Bursts load store CPU burst read from file wait for IO O burst store increment in CPU burst write to le wait for O O burst load store add store CPU burst read from file wait for IO lO burst 8 1 6 24 32 40 O Execution model programs alternate between bursts of CPU and lO 0 Program typically uses the CPU for some period of time then does IO then uses CPU again 0 Each scheduling decision is about which job to give to the CPU for use by its next CPU burst 0 With timeslicing thread may be forced to give up CPU before finishing current CPU burst What is important in a scheduling algorithm Scheduling Policy GoalsCriteria 0 Minimize Response Time 0 Minimize elapsed time to do an operation or job 0 Response time is what the user sees 0 Time to echo a keystroke in editor 0 Time to compile a program Realtime Tasks Must meet deadlines imposed by World 0 Maximize Throughput Maximize operations or jobs per second 0 Throughput related to response time but not identical 0 Minimizing response time will lead to more context switching than if you only maximized throughput 0 Two parts to maximizing throughput 0 Minimize overhead for example contextswitching 0 Efficient use of resources CPU disk memory etc 0 Fairness 0 Share CPU among users in some equitable way 0 Fairness is not minimizing average response time 0 Better average response time by making system less fair How can we evaluate a scheduling algorithm First Come First Sene Scheduling 1 I FirstCome FirstServed FCFS I Also First In First Outquot FIFO or Run until donequot I In early systems FCFS meant one program scheduled until done including lO I Now means keep CPU until thread blocks Exampie Process Burst Time P11 24 PE 3 P3 3 I Suppose processes arrive in the order P1 P2 P3 The Gantt Chart for the schedule is 0 24 27 30 I Waiting time for P1 0 P2 24 P3 27 I Average waiting time 0 24 273 17 I Average Completion time 24 27 303 27 FCFS 2 0 Example continued 0 Suppose that processes arrive in order P2 P3 P1 Now the Gantt chart for the schedule is 0 Waiting time for P1 6 P2 0 P3 3 0 Average waiting time 6 O 33 3 0 Average Completion time 3 6 303 13 0 In second case 0 Average waiting time is much better before it was 17 0 Average completion time is better before it was 27 0 FIFO Pros and Cons 0 Simple 0 Short jobs get stuck behind long ones 0 Supermarket Getting milk always stuck behind cart full of small items Upside get to read about Britney Spears ProsCons of FCFS Round Robin RR Scheduling 0 FCFS Scheme Potentially bad for short jobs 0 Depends on submit order 0 If you are first in line at supermarket with milk you don t care who is behind you on the other hand 0 Round Robin Scheme 0 Each process gets a small unit of CPU time time quantum usually 10100 milliseconds 0 After quantum expires the process is preempted and added to the end of the ready queue 0 n processes in ready queue and time quantum is q gt 0 Each process gets 1n of the CPU time 0 In chunks of at most q time units 0 No process waits more than n 1q time units 0 Performance 0 q large 2 FCFS 0 q small gt Interleaved really small gt hyperthreading 0 q must be large with respect to context switch othenNise overhead is too high all overhead RR with Time Quantum 20 milliseconds Frugss ngsLTim g V 53 i 5 3 3 24 0 2O 28 48 68 88 108 112125 145153 0Example 0 Waiting time for P1 68201 12 8872 P220020 P3280884812510885 P44801086888 0 Average waiting time 7220858846614 0 Average completion time 125281531124 10412 0 Thus RoundRobin Pros and Cons 0 Better for short jobs Fair Contextswitching time adds up for long jobs Round Robin Discussion 0 How do you choose time slice 0 What if too big 0 Response time suffers 0 What if infinite oo 0 Get back FIFO 0 What if time slice too small 0 Throughput suffers 0 Actual choices of timeslice Initially UNIX timeslice one second 0 Worked ok when UNIX was used by one or two people 0 What if three compilations going on 3 seconds to echo each keystroke In practice need to balance shortjob performance and longjob throughput Typical time slice today is between 10ms 100ms 0 Typical contextswitching overhead is 01 ms 1ms 0 Roughly 1 overhead due to contextswitching Comparing FCFS and RR 0 Assuming zerocost contextswitching time is RR always better than FCFS 0 Simple example 10 jobs each take 100s of CPU time RR scheduler quantum of 1s All jobs start at the same time 0 Completion Times 0 Both RR and FCFS finish at the same time 0 Average response time is much worse under RR 0 Bad when all jobs same length 0 Also Cache state must be shared between all jobs with RR but can be devoted to each job with FIFO 0 Total time for RR longer even for zerocost switch FCFS with Different Time Quantum U B 32 85 153 What If You Could See the Future 0 Could we always mirror best FCFS 0 Shortest Job First SJF Run whatever job has the least amount of computation to do 0 Sometimes called Shortest Time to Completion First STCF 0 Shortest Remaining Time First SRTF 0 Preemptive version of SJF if job arrives and has a shorter time to completion than the remaining time on the current job immediately preempt CPU 0 Sometimes called Shortest Remaining Time to Completion First SRTCF 0 These can be applied either to a whole program or the current CPU burst of each program 0 Idea is to get short jobs out of the system 0 Big effect on short jobs only small effect on long ones 0 Result is better average response time Discussion 0 SJFSRTF are the best you can do at minimizing average response time 0 Provably optimal SJF among nonpreemptive SRTF among preemptive 0 Since SRTF is always at least as good as SJF focus on SRTF 0 Comparison of SRTF with FCFS and RR 0 What if all jobs the same length 0 SRTF becomes the same as FCFS ie FCFS is best can do if all jobs the same length 0 What if jobs have varying length 0 SRTF and RR short jobs not stuck behind long ones Example SRTF l ll C s C39s C39s 0 Three jobs no IO Ho 0 AB both CPU bound run for week C lO bound loop 1ms CPU 9ms disk lO 0 If only one at a time C uses 90 of the disk A or B could use 100 of the CPU 0 With FIFO 0 Once A or B get in keep CPU for two weeks 0 What about RR or SRTF 0 Easier to see with a timeline Example SRTF 2 I Eli 109m time silica RR lms ma slice C s E39s 130 ME EA A A II II II qu CS C39s If if Last Word on SRTF 0 Starvation 0 SRTF can lead to starvation if many small jobs 0 Large jobs never get to run 0 Somehow need to predict future 0 How can we do this 0 Some systems ask the user 0 When you submit a job have to say how long it will take 0 To stop cheating system kills job if takes too long 0 But Even nonmalicious users have trouble predicting runtime of their jobs 0 Bottom line can t really know how long job will take 0 However can use SRTF as a yardstick for measuring other policies 0 Optimal so can t do any better SRTF Pros amp Cons 0 Optimal average response time 0 Hard to predict future 0 Unfair How About Predicting the Next CPU Burst Length Adaptive Changing policy based on past behavior 0 CPU scheduling in virtual memory in file systems etc 0 Works because programs have predictable behavior 0 If program was lO bound in past likely in future 0 If computer behavior were random wouldn t help 0 Example SRTF with estimated burst length 0 Use an estimator function on previous bursts Let tn1 tn2 tn3 etc be previous CPU burst lengths Estimate next burst Tn ftn1 tn2 tn3 0 Function f could be one of many different time series estimation schemes Kalman filters etc 0 For instance exponential averaging Tn ortn11orTn1 with Oltorsf my quotguessquot 15 10 Multi level Feedback Scheduling 0 Another method for exploiting past behavior 0 Multiple queues each with different priority 0 Higher priority queues often considered foreground tasks 0 Each queue has its own scheduling algo 0 eg foreground RR background FCFS 0 Sometimes multiple RR priorities with quantum increasing exponentially highest1ms next2ms next 4ms etc 0 Adjust each job s priority as follows details vary 0 Job starts in highest priority queue 0 If timeout expires drop one level 0 If timeout doesn t expire push up one level or to top Scheduling Details 0 Result approximates SRTF 0 CPU bound jobs drop like a rock 0 Short running lO bound jobs stay near top 0 Scheduling must be done between the queues 0 Fixed priority scheduling 0 serve all from highest priority then next priority etc 0 Time slice 0 each queue gets a certain amount of CPU time 0 eg 70 to highest 20 next 10 lowest 0 Countermeasure user action that can foil intent of the OS designer 0 For multilevel feedback put in a bunch of meaningless lO to keep job s priority high 0 Of course if everyone did this wouldn t work 0 Example of Othello program 0 Playing against competitor so key was to do computing at higher priority the competitors 0 Put in printf s ran much faster FaWness What about fairness 0 Strict fixedpriority scheduling between queues is unfair run highest then next etc 0 long running jobs may never get CPU 0 In Multics shut down machine found 10 year old job 0 Must give longrunning jobs a fraction of the CPU even when there are shorter jobs to run 0 Tradeoff fairness gained by hurting avg response time 0 How to implement fairness 0 Could give each queue some fraction of the CPU 0 What if one longrunning job and 100 shortrunning ones 0 Like express lanes in a supermarket sometimes express lanes get so long get better service by going into one of the other lines 0 Could increase priority of jobs that don t get service 0 What is done in UNIX This is ad hoc what rate should you increase priorities 0 And as system gets overloaded no job gets CPU time so everyone increases in priority gt Interactive jobs suffer Lottery Scheduling 0 Yet another alternative Lottery Scheduling 0 Give each job some number of lottery tickets 0 On each time slice randomly pick a winning ticket 0 On average CPU time is proportional to number of tickets given to eachjob 0 How to assign tickets 0 To approximate SRTF short running jobs get more long running jobs get fewer 0 To avoid starvation every job gets at least one ticket everyone makes progress Advantage over strict priority scheduling behaves gracefully as load changes 0 Adding or deleting a job affects all jobs proportionally independent of how many tickets each job possesses Lottery Scheduling Example 0 Lottery Scheduling Example 0 Assume short jobs get 10 tickets long jobs get 1 ticket 5 quotliri a V L iL r39 1 3139 iii iii ii 7 i 7 4a r Lglliii g 0 What if too many short jobs to give reasonable response time 0 ln UNIX if load average is 100 hard to make progress 0 One approach log some user out Scheduling Algorithm Evaluation 0 Deterministic modeling 0 takes a predetermined workload and compute the performance of each algorithm for that workload Queuing models 0 Mathematical approach for handling stochastic workloads 0 lmplementationSimulation 0 Build system which allows actual algorithms to be run against actual data Most flexiblegeneral I I performance Simulation 3 statistics for F0 F8 actual CPU performance process simulation 2 statistics execution performance simulation 3 statistics for RR q 14 HRq14 Last Word on Scheduling 0 When do the details of the scheduling policy and fairness really matter 0 When there aren t enough resources to go around 0 When should you simply buy a faster computer 0 Or network link or expanded highway or 0 One approach Buy it when it will pay for itself in improved response time 0 Assuming you re paying for worse response time in reduced productivity customer angst etc 0 Might think that you should buy a faster X when X is utilized 100 but usually response time goes to infinity as utilizationgt100 0 An interesting implication of this curve 0 Most scheduling algorithms work fine in the linear portion of the load curve fail otherwise 0 Argues for buying a faster X when hit knee of curve Recap Scheduling l 0 Scheduling selecting a waiting process from the ready queue and allocating the CPU to it 0 FCFS Scheduling 0 Run threads to completion in order of submission 0 Pros Simple 0 Cons Short jobs get stuck behind long ones 0 RoundRobin Scheduling 0 Give each thread a small amount of CPU time when it executes cycle between all ready threads 0 Pros Better for short jobs 0 Cons Poor when jobs are same length Recap Scheduling 2 0 Shortest Job First SJFShortest Remaining Time First SRTF 0 Run whatever job has the least amount of computation to doleast remaining amount of computation to do 0 Pros Optimal average response time 0 Cons Hard to predict future Unfair 0 MultiLevel Feedback Scheduling 0 Multiple queues of different priorities 0 Automatic promotiondemotion of process priority in order to approximate SJFSRTF 0 Lottery Scheduling 0 Give each thread a prioritydependent number of tokens short tasks gt more tokens 0 Reserve a minimum number of tokens for every thread to ensure forward progressfairness Page Replacement Strategies Jay Kothari iavkdrexeledu Maxim Shevertalov maxdrexeledu CS 370 Operating Systems Summer 2008 Page Replacement Policies Why do we care about Replacement Policy 0 Replacement is an issue with any cache 0 Particularly important with pages 0 The cost of being wrong is high must go to disk 0 Must keep important pages in memory not toss them out FIFO First In First Out 0 Throw out oldest page Be fair let every page live in memory for same amount of time 0 Bad because throws out heavily used pages instead of infrequently used pages 0 MIN Minimum 0 Replace page that won t be used for the longest time 0 Great but can t really know future 0 Makes good comparison case however 0 RANDOM 0 Pick random page for every replacement 0 Typical solution for TLB s Simple hardware 0 Pretty unpredictable makes it hard to make real time guarantees Page Replacement Policies continued LRU Least Recently Used 0 Replace page that hasn t been used for the longest time 0 Programs have locality so if something not used for a while unlikely to be used in the near future Seems like LRU should be a good approximation to MIN 0 How to implement LRU Use a list Page m Page 7 3 Page 1 g Page 2 0 On each use remove page from list and place at head 0 LRU page is at tail 0 Problems with this scheme for paging 0 Need to know immediately when each page used so that can change position in list 0 Many instructions for each hardware access 0 In practice people approximate LRU more later Example FIFO 0 Suppose we have 3 page frames 4 virtual pages and following reference stream OABCABDADBCB 0 Consider FIFO Page replacement 0 FIFO 7 faults 0 When referencing D replacing A is bad choice since need A again right away Example FIFO 0 Suppose we have the same reference stream OABCABDADBCB 0 Consider MIN Page replacement 0 MIN 5 faults 0 Where will D be brought in Look for page not referenced farthest in future 0 What will LRU do 0 Same decisions as MIN here but won t always be true When will LBU perform poorly 0 Consider the following A B C D A B C D A B C D 0 LRU Performs as follows same as FIFO here 0 Every reference is a page fault 0 MIN Does much better Graph of Page Faults vs Number of Frames 1 2 3 4 5 6 0 One desirable property When you add memory the miss rate goes down 0 Does this always happen 0 Seems like it should right 0 No BeLady s anomaly 0 Certain replacement algorithms FIFO don t have this obvious property Adding Memory Doesn t Always Help Fault Rate 0 Does adding memory reduce number of page faults 0 Yes for LRU and MIN 0 Not necessarily for FIFO Called Belady s anomaly EEDABEABCDE D E 0 After adding memory 0 With FIFO contents can be completely different 0 In contrast with LRU or MIN contents of memory with X pages are a subset of contents with X1 Page lmplementing LRU 0 Perfect Timestamp page on each reference 0 Keep list of pages ordered by time of reference 0 Too expensive to implement in reality for many reasons 0 Clock Algorithm Arrange physical pages in circle with single clock hand I 0 Approximate LRU approx to approx to MIN v 0 Replace an old page not the oldest page Set of all pages 0 Details in Memory I 0 Hardware use bit per physical page I 0 Hardware sets use bit on each reference 0 If use bit isn t set means not referenced in a long time Nachos hardware sets use bit in the TLB you have to copy this back to page table when TLB entry gets replaced 0 On page fault 0 Advance clock hand not real time 0 Check use bit 1 gtused recently clear and leave alone 0 gtselected candidate for replacement 0 Will always find a page or loop forever 0 Even if all use bits set will eventually loop aroundgtFlFO Clock Algorithm Not Recently Used x I Smgle Clock Hand Advances OI IlY on page fault I Set of all pages Check for pages not used recently in Memory Mark pages as not used recently 0 What if hand moving slowly 0 Good sign or bad sign 0 Not many page faults andor find page quickly 0 What if hand is moving quickly 0 Lots of page faults andor lots of reference bits set 0 One way to view clock algorithm 0 Crude partitioning of pages into two groups young and old 0 Why not partition into more than 2 groups Nth Chance version of Clock Algorithm 0 Nth chance algorithm Give page N chances 0 08 keeps counter per page sweeps 0 On page fault OS checks use bit 0 1gtclear use and also clear counter used in last sweep 0 0gtincrement counter if countN replace page 0 Means that clock hand has to sweep by N times without page being used before page is replaced 0 How do we pick N 0 Why pick large N Better approx to LRU 0 If N 1K really good approximation 0 Why pick small N More efficient 0 Otherwise might have to look a long way to find free page 0 What about dirty pages Takes extra overhead to replace a dirty page so give dirty pages an extra chance before replacing 0 Common approach 0 Clean pages use N21 0 Dirty pages use N2 and write back to disk when N21 Clock Algorithms Details Which bits of a PTE entry are useful to us 0 Use Set when page is referenced cleared by clock algorithm 0 Modified set when page is modified cleared when page written to disk 0 Valid ok for program to reference this page 0 Read only ok for program to read page but not modify 0 For example for catching modifications to code pages 0 Do we really need hardwaresupported modified bit 0 No Can emulate it BSD Unix using readonly bit 0 Initially mark all pages as readonly even data pages 0 On write trap to OS 08 sets software modified bit and marks page as readwrite 0 Whenever page comes back in from disk mark readonly Clock Algorithms Details continued 0 Do we really need a hardwaresupported use bit 0 No Can emulate it similar to above 0 Mark all pages as invalid even if in memory 0 On read to invalid page trap to OS 0 OS sets use bit and marks page readonly 0 Get modified bit in same way as previous 0 On write trap to OS either invalid or readonly 0 Set use and modified bits mark page readwrite 0 When clock hand passes by reset use and modified bits and mark page as invalid again 0 Remember however that clock is just an approximation of LRU 0 Can we do a better approximation given that we have to take page faults on some reads and writes to collect use information 0 Need to identify an old page not oldest page 0 Answer second chance list Second Chance List Algorithm VAXVIVIS 0 Split memory in two Active list RW SC list Invalid 0 Access pages in Active list at full speed 0 Otherwise Page Fault 0 Always move overflow page from end of Active list to front of Second chance list SC and mark invalid 0 Desired Page On SC List move to front of Active list mark RW 0 Not on SC list page in to front of Active list mark RW page out LRU victim at end of SC list Second Chance List Algorithm VAXVIVIS A LRU victim Directly Second Mapped Pages Chance List Marked RW Marked Invalid List FIFO N Lis139 LRU ew Pageiquot gt Active From dlsk Pages Victims 0 Split memory in two Active list RVV SC list Invalid 0 Access pages in Active list at full speed 0 Otherwise Page Fault 0 Always move overflow page from end of Active list to front of Second chance list SC and mark invalid 0 Desired Page On SC List move to front of Active list mark RW 0 Not on SC list page in to front of Active list mark RW page out LRU victim at end of SC list Second Chance List Algorithm con t 0 How many pages for second chance list 0 lf 0 gt FIFO 0 If all gt LRU but page fault on every page reference 0 Pick intermediate value Result is 0 Pro Few disk accesses page only goes to disk if unused for a long time 0 Con Increased overhead trapping to OS software hardware tradeoff 0 With page translation we can adapt to any kind of access the program makes 0 Later we will show how to use page translation protection to share memory between threads on widely separated machines 0 Question why didn t VAX include use bit 0 Strecker architect asked 08 people they said they didn t need it so didn t implement it 0 He later got blamed but VAX did OK anyway Free List Single Clock Hand Advances as needed to keep I freelisf full quotbackgroundquot I Set of all pages 39 in Memory 39 I 5 Free Pages For Processes 0 Keep set of free pages ready for use in demand paging 0 Freelist filled in background by Clock algorithm or other technique Pageout demon 0 Dirty pages start copying back to disk when enter list 0 Like VAX second chance list 0 If page needed before reused just return to active set 0 Advantage Faster for page fault 0 Can always use page or pages immediately on fault Demand Paging more details 0 Does softwareloaded TLB need use bit Two Options 0 Hardware sets use bit in TLB when TLB entry is replaced software copies use bit back to page table 0 Software manages TLB entries as FIFO list everything not in TLB is SecondChance list managed as strict LRU 0 Core Map 0 Page tables map virtual page gt physical page 0 Do we need a reverse mapping ie physical page gt virtual page 0 Yes Clock algorithm runs through page frames If sharing then multiple virtualpages per physical page 0 Can t push page out to disk without invalidating all PTEs Allocation of Page Frames Memory Pages 0 How do we allocate memory among different processes 0 Does every process get the same fraction of memory Different fractions 0 Should we completely swap some processes out of memory 0 Each process needs minimum number of pages 0 Want to make sure that all processes that are loaded into memory can make forward progress 0 Example IBM 370 6 pages to handle SS MOVE instruction 0 instruction is 6 bytes might span 2 pages 0 2 pages to handle from 0 2 pages to handle to 0 Possible Replacement Scopes 0 Global replacement process selects replacement frame from set of all frames one process can take a frame from another 0 Local replacement each process selects from only its own set of allocated frames Fixed Priority Allocation 0 Equal allocation Fixed Scheme 0 Every process gets same amount of memory 0 Example 100 frames 5 processesgtprocess gets 20 frames 0 Proportional allocation Fixed Scheme 0 Allocate according to the size of process 0 Computation proceeds as follows 55 sin nf ipracess pl and 5 Elsi m fatal numbzr nfagfmmas x m ni allum m fan pig 0 Priority Allocation 0 Proportional scheme using priorities rather than size 0 Same type of computation as previous scheme 0 Possible behavior If process pi generates a page fault select for replacement a frame from a process with lower priority number 0 Perhaps we should use an adaptive scheme instead 0 What if some application just needs more memory PageFault Frequency Allocation 0 Can we reduce Capacity misses by dynamically changing the number of pagesapplication 0 Establish acceptable pagefault rate 0 If actual rate too low process loses frame 0 If actual rate too high process gains frame 0 Question What if we just don t have enough memory Page Fault Frequency Allocation 0 Can we reduce Capacity misses by dynamically changing the number of pagesapplication increase number of frames upper bound pagefault rate lower bound decrease number of frames number of frames 0 Establish acceptable page fault rate 0 If actual rate too low process loses frame 0 If actual rate too high process gains frame 39 Question What if we just don t have enough memory Thrashing 0 If a process does not have enough pages the pagefault rate is very high This leadsto 0 low CPU utilization 0 operating system spends most of its time swapping to disk Thrashing a a process is busy swapping pages in and out 0 Questions 0 How do we detect Thrashing 0 What is best response to Thrashing Thrashing thrashing CPU utilization degree of multiprogramming 0 If a process does not have enough pages the pagefault rate is very high This leadsto 0 low CPU utilization 0 operating system spends most of its time swapping to disk Thrashing a a process is busy swapping pages in and out 0 Questions 0 How do we detect Thrashing 0 What is best response to Thrashing Locality In A Memory Reference Pattern 0 Program Memory Access Patterns have temporal and spatial locality 0 Group of Pages accessed along a given time slice called the Working Set 0 Working Set defines minimum number of pages needed for process to behave well 0 Not enough memory for Working SetgtThrashing 0 Better to swap out process Locality In A Memory Reference Pattern 6 Program Memory Access Patterns have temporal and spatial locality a Group of Pages accessed along a given time slice called the Working Set a Working Set defines minimum number of pages needed for process to behave well e Not enough memory for Working SetgtThraehing l iiil lllllll llllllll llliliilllilll a Better t0 swap out process Working Set Model page referencetable 2615777751623412844434344418234443444 A A t WSt1125671 W802 34 A a workingset window a fixed number of page references 0 Example 10000 instructions 0 WSi working set of Process Pi total set of pages referenced in the most recent A varies in time 0 if A too small will not encompass entire locality if A too large will encompass several localities if A no will encompass entire program 0 D ZlWSil a total demand frames 0 if D gt m gt Thrashing 0 Policy if D gt m then suspendswap out processes 0 This can improve overall system behavior by a lot What about Compulsory Misses 0 Recall that compulsory misses are misses that occur the first time that a page is seen 0 Pages that are touched for the first time 0 Pages that are touched after process is swapped outswapped back in 0 Clustering 0 On a pagefault bring in multiple pages around the faulting page 0 Since efficiency of disk reads increases with sequential reads makes sense to read several sequential pages 0 Working Set Tracking 0 Use algorithm to try to track working set of application 0 When swapping process back in swap in working set Paging Summary Replacement policies FIFO Place pages on queue replace page at end 0 MIN Replace page that will be used farthest in future 0 LRU Replace page used farthest in past 0 Clock Algorithm Approximation to LRU 0 Arrange all pages in circular list 0 Sweep through them marking as not in use o If page not in use for one pass than can replace 0 Nthchance clock algorithm Another approx LRU 0 Give pages multiple passes of clock hand before replacing Second Chance List algorithm Yet another approx LRU 0 Divide pages into two groups one of which is truly LRU and managed on page faults 0 Working Set 0 Set of pages touched by a process recently Thrashing a process is busy swapping pages in and out 0 Process will thrash if working set doesn t fit in memory 0 Need to swap out a process Memory Kernel and Address Spaces Maxim Shevertalov Jay Kothari 38370 Operating Systems 7312008 Virtualizing Resources 0 Physical Reality 0 Different ProcessesThreads share the same hardware 0 Need to multiplex CPU Just finished scheduling 0 Need to multiplex use of Memory Today 0 Need to multiplex disk and devices later in term 0 Why worry about memory sharing 0 The complete working state of a process andor kernel is defined by its data in memory and registers 0 Consequently cannot just let different threads of control use the same memory 0 Physics two different pieces of data cannot occupy the same locations in memory 0 Probably don t want different threads to even have access to each other s memory protection Important Aspects of Memory Multiplexing 0 Controlled overlap 0 Separate state of threads should not collide in physical memory Obviously unexpected overlap causes chaos 0 Conversely would like the ability to overlap when desired for communication 0 Protection 0 Prevent access to private memory of other processes 0 Different pages of memory can be given special behavior Read Only Invisible to user programs etc 0 Kernel data protected from User programs 0 Programs protected from themselves Example of General Address Translation Code Dan 2 Code Data 5mquot 1 Data Heap 1 Stack 2 Frog 1 Frog 2 Virtual Dan 1 Virtual Address Heap 2 Address Space 1 Space 2 Code 2 OS code Translation Map 1 05 W Translation Map 2 OS heap amp Stacks Physical Address Space Two Views of Memory Virtual Physical Addresses Addresses Untranslated read or write 0 Recall Address Space 0 All the addresses and state a process can touch 0 Each process and kernel has different address space 0 Consequently two views of memory 0 View from the CPU what program sees virtual memory 0 View from memory physical memory 0 Translation box converts between the two views 0 Translation helps to implement protection 0 If task A cannot even gain access to task B s data no way for A to adversely affect B 0 With translation every program can be linkedloaded into same region of user address space 0 Overlap avoided through translation not relocation User gt Kernel System Call 0 Can t let inmate user get out of padded cell on own 0 Would defeat purpose of protection 0 So how does the user program get back into kernel user process user mode user process executing H calls system call return from system can mode bit 1 I I k I trap return eme mode bit 0 mode bit 1 kernel mode execute system call mOde bquot 0 0 System call Voluntary procedure call into kernel 0 Hardware for controlled User gtKernel transition 0 Can any kernel routine be called 0 No Only specific ones 0 System call ID encoded into system call instruction 0 Index forces welldefined interface with kernel Summary 0 Memory is a resource that must be shared 0 Controlled Overlap only shared when appropriate 0 Translation Change Virtual Addresses into Physical Addresses 0 Protection Prevent unauthorized Sharing of resources 0 Simple Protection through Segmentation 0 Baseimit registers restrict memory accessible to user 0 Can be used to translate as well 0 Full translation of addresses through Memory Management Unit MMU 0 Every Access translated through page table 0 Changing of page tables only available to kernel 0 DualMode 0 KernelUser distinction User restricted 0 User gt Kernel System calls Traps or Interrupts 0 Interprocess communication shared memory or through kernel system calls Segmentation with Base and Limit registers Base Virtual Address Physical Limit Address Yes Error 0 Could use baselimit for dynamic address translation often called segmentation 0 Alter address of every loadstore by adding base 0 User allowed to readwrite within segment 0 Accesses are relative to segment so don t have to be relocated when program moved to different segment 0 This gives program the illusion that it is running on its own dedicated machine with memory starting at O 0 Program gets continuous region of memory 0 Addresses within program do not have to be relocated when program placed in different region of DRAM Issues with simple segmentation method process 6 process 6 process 6 process 6 process 5 process 5 Process 5 process 5 process 9 process 9 process 2 gt gt 39gt process 10 OS 03 OS 08 0 Fragmentation problem 0 Not every process is the same size 0 Over time memory space becomes fragmented 0 Really bad if want space to grow dynamically eg heap 0 Hard to do inter process sharing 0 Want to share code segments when possible 0 Want to share memory between processes 0 Helped by by providing multiple segments per process 0 Need enough physical memory for every process More Flexible Segmentation subroutine stack main program user view of hysioal logical address meg memory space ory space 0 Logical View multiple separate segments 0 Typical Code Data Stack 0 Others memory sharing etc 0 Each segment is given region of contiguous memory 0 Has a base and limit 0 Can reside anywhere in physical memory Implementing the llulti Segment Model Error BaseO LimitO Base Limit1 Basez Li PhySIcal Base3 Limit3 Address Base4 Limit4 0 Segment map resides in processor 0 Segment number mapped into baselimit pair 0 Base added to offset to generate physical address 0 Error check catches offset out of range 0 As many chunks of physical memory as entries 0 Segment addressed by portion of virtual address 0 What is VN 0 Can mark segments as invalid requires check as well Implementing Paging Pagi fTab39eSize Access Error Access Error 0 Page Table One per process 0 Resides in physical memory 0 Contains physical page and permission for each virtual page gtgtPermissions include Valid bits Read Write etc 0 Virtual address mapping 0 Offset from Virtual address copied to Physical Address gtgtExample 10 bit offset gt 1024byte pages 0 Virtual page is all remaining bits gtgtExample for 32 bits 32 10 22 bits ie 4 million entries gtgtPhysical page copied from table into physical address 0 Check Page Table bounds and permissions Multi level Translation page 0 LimitO page 1 page 2 Limit2 page Base3 Limit3 page 4 vr Base4 Limit4 Access Error Access Error 0 What must be savedrestored on context switch 0 Contents of toplevel segment registers for this example 0 Pointer to toplevel table page table 0 Sharing 0 Complete segments 0 Individual pages Two level page table PaggeTableP Qin ter Inverted Page Table 0 With all previous examples Forward Page Tables 0 Size of page table is at least as large as amount of virtual memory allocated to processes 0 Physical memory may be much less gtgtMuch of process space may be out on disk or not in use 0 Answer use a hash table 0 Called an Inverted Page Table 0 Size is independent of virtual address space 0 Directly related to amount of physical memory 0 Very attractive option for 64bit address spaces 0 Cons Complexity of managing hash changes 0 Often in hardware Schematic View of Swapping operating system process P1 swap out L swap in A lt user space backing store mainmemorv 0 Extreme form of Context Switch Swapping In order to make room for next process some or all of the previous process is moved to d sk gtgtLikely need to send out complete segments 0 This greatly increases the cost of contextswitching 0 Desirable alternative 0 Some way to keep only active portions of a process in memory at any one time 0 Need finer granularity control over physical memory Summary 0 MultiSegment Model enables each process to think it has access to all the memory it needs 0 Pages gt finer granularity of data so that not everything needs to be swapped during a context switch 0 Swapping means moving all or some of the process s memory from physical memory to the backing store 0 Simple Paging gt page table may get very large for a sparse address space 0 Combine segmentation and paging gt multiIevel translation 0 Multilevel translation 0 May be slow if lots of levels 0 Inverted Page table 0 Store a hash table of of physical addresses 0 Size is independent of virtual address space Caching Concept 0 Cache a repository for copies that can be accessed more quickly than the original 0 Make frequent case fast and infrequent case less dominant 0 Caching underlies many of the techniques that are used today to make computers fast 0 Can cache memory locations address translations pages file blocks file names network routes etc 0 Only good if 0 Frequent case frequent enough and 0 Infrequent case not too expensive 0 Important measure Average Access time Hit Rate x Hit Time Miss Rate x Miss Time Why Cache Why Cache 0 ProcessorDRAM Memory Gap latency 0 Processor 60 per year doubles every 15 years 0 Memory 9 per year doubles every 10 years 0 Gap grows 50 per year ProcessorDRAM Memory Gap latency 1 000 H O O H 0 Performance Another Reason to Cache page 0 LimitO page 1 page 2 Limit2 page Base3 Limit3 N page 4 vr Base4 Limit4 V Access Error Access Error 0 Cannot afford to translate on every access 0 At least three DRAM accesses per actual DRAM access 0 Or perhaps lO if page table partially on disk 0 Even worse What if we are using caching to make memory access faster than DRAM access 0 Solution Cache translations 0 Translation Cache TLB Translation Lookaside Buffer Why does caching help Why does caching help Probability of reference 0 Temporal Locality Locality in Time 0 Keep things that were accessed Address Space recently closer to the processor l3333333333 Lower Level Upper Level Memory Memory 0 Spatial Locality Locality in Space 0 Move contiguous blocks to the upper levels F P B rom rocessor To Processor Memory Hierarchy 0 Take advantage of the principle of locality to 0 Present as much memory as in the cheapest technology 0 Provide access at speed offered by the fastest technology Processor Control Speed ns 1s 10s 100s 100s 100000005 10000000000s 105 ms 1 Os sec Size bytes 100s KsMs Ms 6s Ts Where does data go into a cache Direct mapped Set associative Fully associative Bbd 01234567 Bmd 01234567 Bmd 01234567 no i ha no 39i39ihie i SetSetSetSet 0123 Where does data go into a cache Direct mapped Set associative Fully associative block 12 can go block 12 can go block 12 can go only into block 4 anywhere in set 0 anywhere 12 mod 8 12 mod 4 Bhdlt 01234567 Bhdlt 01234567 Bhdlt 01234567 no no V A no 1amp7 aft 739 SetSetSetSet 0123 Sources of Cache Misses 0 Compulsory 0 Capacity 0 Conflict 0 Coherence Sources of Cache Misses 0 Compulsory cold start or process migration first reference first access to a block 0 Cold fact of life not a whole lot you can do about it 0 Note If you are going to run billions of instruction Compulsory Misses are insignificant 0 Capacity 0 Cache cannot contain all blocks access by the program 0 Solution increase cache size 0 Conflict collision 0 Multiple memory locations mapped to the same cache location 0 Solution 1 increase cache size 0 Solution 2 increase associativity 0 Coherence lnvalidation other process eg lO updates memory How to access data in a cache u Set Select Data Select 0 Index Used to Lookup Candidates in Cache 0 Index identifies the set 0 Tag used to identify actual copy 0 If no candidates match then declare cache miss 0 Block is minimum quantum of caching 0 Data select field used to select data within block 0 Many caching applications don t have data select field Direct Mapped Cache tag I index I yte select Valid Bit Cache Tag It9q I tag byte 63 0 Direct Mapped 2N byte cache 0 The uppermost 32 N bits are always the Cache Tag 0 The lowest M bits are the Byte Select Block Size 2M 0 Example 1 KB Direct Mapped Cache with 32 B Blocks 0 Index chooses potential block 0 Tag checked to verify block 0 Byte select chooses byte within block Set Associative Cache Select Valid Cache T Cache Data Cache Data Cache T Cache Block 0 Cache Block 0 OR Hit Cache Block 0 Nway set associative N entries per Cache Index 0 N direct mapped caches operates in parallel 0 Example Twoway set associative cache 0 Cache Index selects a set from the cache 0 Two tags in the set are compared to input in parallel 0 Data is selected based on the tag result Fully Associative Cache Ex 0X01 Cache Tag Valid Bit Cache Data 31 O O 63 0 Fully Associative Every block can hold any line 0 Address does not include a cache index 0 Compare Cache Tags of all Cache Entries in Parallel 0 Example Block Size32B blocks 0 We need N 27bit comparators 0 Still have byte select to choose from within block What happens on a write What happens on a write 0 Write through The information is written to both the block in the cache and to the block in the lower level memory 0 Write back The information is written only to the block in the cache 0 Modified cache block is written to main memory only when it is replaced 0 Question is block clean or dirty 0 Pros and Cons of each 0 WT gtgtPRO read misses cannot result in writes gtgtCON Processor held up on writes unless writes buffered 0 WB gtgtPRO repeated writes not sent to DRAM processor not held up on writes gtgtCON More complex Read miss may require writeback of dirty data What is in a PTE 0 What is in a Page Table Entry or PTE 0 Pointer to nextlevel page table or to actual page 0 Permission bits valid readonly readwrite writeonly 0 Example Intel x86 architecture PTE 0 Address same format previous slide 10 10 12bit offset 0 Intermediate page tables called Directories 119 76543210 P Present same as valid bit in other architectures W Writeable U User accessible PWT Page write transparent external cache write through PCD Page cache disabled page cannot be cached A Accessed page has been accessed recently D Dirty PTE only page has been modified recently L L1gt4MB page directory only Bottom 22 bits of virtual address serve as offset Caching Address Translation F LB Physical Address Caching Address Translation F LB N Physical Address 0 Question does page locality exist 0 Instruction accesses spend a lot of time on the same page since accesses sequential 0 Stack accesses have definite locality of reference 0 Data accesses have less page locality but still some 0 Can we have a TLB hierarchy 0 Sure multiple levels at different sizesspeeds What Actually Happens on a TLB Miss 0 Hardware traversed page tables 0 On TLB miss hardware in MMU looks at current page table to fill TLB may walk multiple levels 0 If PTE valid hardware fills TLB and processor never knows 0 If PTE marked as invalid causes Page Fault after which kernel decides what to do afterwards 0 Software traversed Page tables like MIPS 0 On TLB miss processor receives TLB fault 0 Kernel traverses page table to find PTE 0 If PTE valid fills TLB and returns from fault 0 If PTE marked as invalid internally calls Page Fault handler 0 Most chip sets provide hardware traversal 0 Modern operating systems tend to have more TLB faults since they use translation for many things 0 Examples 0 shared segments 0 user level portions of an operating system What happens on a Context Switch 0 Need to do something since TLBs map virtual addresses to physical addresses 0 Address Space just changed so TLB entries no longer valid 0 Options 0 lnvalidate TLB simple but might be expensive 0 What if switching frequently between processes 0 Include Procesle in TLB 0 This is an architectural solution needs hardware 0 What if translation tables change 0 For example to move page from memory to disk or vice versa 0 Must invalidate TLB entry 0 Otherwise might think that page is still in memory What TLB organization makes sense 0 Needs to be really fast 0 Critical path of memory access 0 In simplest view before the cache 0 Thus this adds to access time reducing cache speed 0 Seems to argue for Direct Mapped or Low Associativity 0 However needs to have very few conflicts 0 With TLB the Miss Time extremely high 0 This argues that cost of Conflict Miss Time is much higher than slightly increased cost of access Hit Time 0 Thrashing continuous conflicts between accesses 0 What if use low order bits of page as index into TLB 0 First page of code data stack may map to same entry 0 Need 3 way associativity at least 0 What if use high order bits as index 0 TLB mostly unused for small programs TLB organization 0 How big does TLB actually have to be 0 Usually small 128512 entries 0 Not very big can support higher associativity 0 TLB usually organized as fullyassociative cache 0 Lookup is by Virtual Address 0 Returns Physical Address other info 0 What happens when fullyassociative is too slow 0 Put a small 416 entry directmapped cache in front 0 Called a TLB Slice Reducing Translation Time Further V Access Rights V Access Rights V Access Rights V Access Rights 0 Machines with TLBs go one step further they overlap TLB lookup with cache access 0 Works because offset available early Demand Paging 0 Modern programs require a lot of physical memory 0 Memory per system growing faster than 2530year 0 But they don t use all their memory all of the time 0 9010 rule programs spend 90 of their time in 10 of their code 0 Wasteful to require all of user s code to be in memory 0 Solution use main memory as cache for disk Processor Control Datapath Illusion of Infinite Memory quot Physical Memor Memory 512 M 4 GB 0 Disk is larger than physical memory 0 ln use virtual memory can be bigger than physical memory 0 Combined memory of running processes much larger than physical memory 0 More programs fit into memory allowing more concurrency 0 Principle Transparent Level of Indirection page table 0 Supports flexible placement of physical data 0 Data could be on disk or somewhere across network 0 Variable location of data transparent to user program 0 Performance issue not correctness issue Demand Paging is Caching 0 Since Demand Paging is Caching must ask 0 What is block size 1 page 0 What is organization of this cache ie directmapped setassociative fullyassociative 0 Fully associative arbitrary virtual gtphysical mapping 0 How do we find a page in the cache when look for it 0 First check TLB then pagetable traversal 0 What is page replacement policy ie LRU Random 0 This requires more explanation kinda LRU 0 What happens on a miss 0 Go to lower level to fill miss ie disk 0 What happens on a write writethrough write back 0 Definitely writeback Need dirty bit Demand Paging Mechanisms 0 PTE helps us implement demand paging 0 Valid Page in memory PTE points at physical page 0 Not Valid Page not in memory use info in PTE to find it on disk when necessary 0 Suppose user references page with invalid PTE 0 Memory Management Unit MMU traps to OS 7 0 Resulting trap is a Page Faul 0 TLB for new page will be loaded when thread continued 0 While pulling pages off disk for one process OS runs another process from ready queue 0 Suspended process sits on wait queue Software Loaded TLB 0 MlPSSlMlCSNachos TLB is loaded by software 0 High TLB hit rate ok to trap to software to fill the TLB even if slower 0 Simpler hardware and added flexibility software can maintain translation tables in whatever convenient format 0 How can a process run without access to page table 0 Fast path I39LB hit with valid1 0 Translation to physical page done by hardware 0 Slow path I39LB hit with valid0 or TLB miss 0 Hardware receives a TLB Fault 0 What does OS do on a TLB Fault 0 Traverse page table to find appropriate PTE 0 If valid1 load page table entry into TLB continue thread 0 If valid0 perform Page Fault detailed previously 0 Continue thread 0 Everything is transparent to the user process 0 It doesn t know about paging tofrom disk 0 It doesn t even know about software TLB handling Transparent Exceptions User Faulting Inst 1 TLB Faul l39s Fetish 0 How to transparently restart faulting instructions 0 Could we just skip it 0 No need to perform load or store after reconnecting physical page 0 Hardware must help out by saving 0 Faulting instruction and partial state 0 Need to know which instruction caused fault 0 Is single PC sufficient to identify faulting position 0 Processor State sufficient to restart user thread 0 Saverestore registers stack etc 0 What if an instruction has sideeffects Consider Problems 0 What if an instruction has side effects 0 Options 0 Unwind side effects easy to restart 0 Finish off sideeffects messy 0 Example 1 mov sp10 0 What if page fault occurs when write to stack pointer 0 Did sp get incremented before or after the page fault 0 Example 2 strcpy r1 r2 0 Source and destination overlap can t unwind in principle 0 IBM 8370 and VAX solution execute twice once read only o What about RISC processors 0 For instance delayed branches 0 Example bne somewhere 1d r1sp 0 Precise exception state consists of two PCs PC and nPC 0 Delayed exceptions 0 Example div r1 r2 r3 Id r1 sp 0 What if takes many cycles to discover divide by zero but load has already caused page fault Precise Exceptions 0 Precise state of the machine is preserved as if program executed up to the offending instruction 0 All previous instructions completed 0 Offending instruction and all following instructions act as if they have not even started 0 Same system code will work on different implementations 0 Difficult in the presence of pipelining out of order execution 0 MIPS takes this position 0 Imprecise system software has to figure out what is where and put it all back together 0 Performance goals often lead designers to forsake precise interrupts 0 system software developers user markets etc usually wish they had not done this 0 Modern techniques for outoforder execution and branch prediction help implement precise interrupts Next Time 0 Page Replacement Strategies