Operating Systems Design Principles
Operating Systems Design Principles COP 5611
University of Central Florida
Popular in Course
Popular in Computer Programming
This 150 page Class Notes was uploaded by Luisa Beer on Thursday October 22, 2015. The Class Notes belongs to COP 5611 at University of Central Florida taught by Staff in Fall. Since its upload, it has received 25 views. For similar materials see /class/227473/cop-5611-university-of-central-florida in Computer Programming at University of Central Florida.
Reviews for Operating Systems Design Principles
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/22/15
COP 4600 Objective 3 So far Q 0bj1 read logondat and created our event list lt59 0bj2 function Boot called bootdat was read Kernel loaded and MEMMAP initialized main Scan Event List InterruptHandler Interrupt 0 Saves state of 0 get next event interrupted program 4 0 Update Simulation 0 Gives control to clock CLOCK Servnce Routine OBJ 1 Simulatorc Objective 3 InterruptHandler 9 LogonService 9 9 Creates a PCB 9 Read Scriptdat 9 9 Allocate memory And Load Program from user s script I n III v IV LEVEL I LogonService Allocates and Initializes the PCB for LOGON events o PCB stored in termtable array of pointers to PCB s o Initialize PCB 0 Call GetScript PCB LEVEL II Initialize the process script and pgmid 0 Call Nextpgm PCB LEVEL II Allocate and Load the next program from script LEVEL II GetScript Reads a script from scriptdat Open scriptdat file Read script till LOGOFF encountered Q Load script to PCB gt script This gives us the list of programs a user will execute LEVEL II Nextpgm Makes a transition to the next program in the script if possible Q if pcb gtpgmid gt 0 ampamp pcb gtfirstrb NULL Deallocpg m LEVEL 111 Q if pcb gtpgmid NULL return if pcb gtpcb gtpgmid LOGOFF print to out return Getmemory LEVEL III New Segment Table built Loader LEVEL Il39I Load program into MEM Q Assign Current PCB R Ready Status LEVEL III 7 GetMemory Allocates Segment Table and sets up information for each segment Eg editordat Q Open editordat and read the number of Segments Allocate a Segment Table Initialize each segment acc bits length etc as we did in Boot ltogt Allocate memory in MEM for each segments code This is done by calling Base Allocseg LEVEL IV ifbase lt 0 compactmem LEVEL IV OBJ6 LEVEL III Loader Reads editordat for instruction and loads each instruction into MEM Call GetInstr OBJ 2 to get opcodes and operand Q Call Displaypgm to echo to out LEVEL III 7 Deallocpgm Frees all allocated segments for the current program and then frees the segment table Call Deallocseg for each segment in the pcbgtsegtable freepcbgtsegtable LEVEL IV Deallocseg Return a segment to the free list Incase there are two adjacent segments in the free list ca MergeSeg LEVEL IV Allocseg Searches the Free memory list for free memory segments in MEM for a segment with a length equal to that requested Q Return index in MEM incase enough memory found Qgt Update Other Memory Parameters eg TotalFree LEVEL V Mergeseg Scans Memory list and if it nds two adjacent memory blocks it merges them into one The Design and Implementation of a LogStructured File System BY Christian Diercks and Rajarshi Chakraborty Outline Introduction amp History Existing File System Problems Log Structured File Systems File Location and Reading Free Space Management Segment Cleaning Simulation results for Sprite LFS Crash Recovery Experience with Sprite LFS Related Work Conclusion EIEIEIEIEIEIEIEIEIEIEI Current tech trends I CPU disk gap increases I No major improvements in transfer rate I No major improvements in access time I Exponential increase size of main memory Technology Trends Technology Memoiy 1 us 100 115 w 1115 1115 Applications tend to become IO bound Solution Decouple disk bound applications from 10 I Cache file data in main memory I Decrease synchronous operations Memory is cheap I Increasing the cache heps for read type operations I But what happens with the write performance Write operations El Data needs to be flushed out to disk for safety reasons I Disk performance becomes dominated by write operations I Executing writes as soon as they occur reduces traffic but less than 5 of the potential bandwidth is used for new data The rest of time is spent seeking Existing File System Problems Two General problems 1 Data is spread around the disk in a way that causes too many small accesses 2 Write Synchronoust The application has to wait for a write to complete Metadata is written synchronously Small file workload make synchronously metadata writes dominating Key Points of LFS El Use file cache to buffer a sequence of file system changes El Write the changes to disk sequentially in a single disk write operation LogStructu red File Systems El Small file performance can be improved I Just write everything together to the disk sequentially in a single disk write operation El Log structured file system converts many small synchronous random writes into large asynchronous sequential transfers I Nearly 100 of raw disk BW can be utilized 4 Two Major issues need to be addressed gt How to manage the free Space on disk so that large Extends of free space Are always available for Writing new data How to retrieve information From the log On Disk Data Structures Data Structure Location Inode Log Inode Map Log Indirect block Log Segment summary Log Segment usage Log table Superblock Fixed Checkpoint region Fixed Directory change Log log Logical structure of file El Indexed structure is the same as FFS inode metadata m IMFTm wm directory entry m inode number Location of inodes is xed in FFSIM How to locate and read a File El Designer Goal I Match read performance of Unix FFS by using index structures in the log to allow random access El Inode maps I Used to find disk location of Inode I Kept in cache to avoid many disk accesses I Maps are divided into blocks that are written to segments on disk Checkpoint Inode Map and Segment usage Tab El Inode map El Segment usage table Jams Iquotquotocation checkpoint region is Read and Write Operation El Read a block I Inode ma block tr gt inode map block gt inode gt data Ioc Iquot El Write a block I Data block inode inode map block segment usage table 39 C mullm Current segment in memory quotquotquotquotquot quot g I Update inode map table ptr egment summary block segment usage table Layout of disk LFS vs FFS El Example of creating 2 files in different directories dirl dirZ W 1 1 Illlllllllli 9 LIL filel illllE Illl lllllillllmski dirl Hal T fiIeZ dir2 T Location of inodes is not H Inode I l Directow I l Data H Inode map 39 fixed in LFS Free Space Management El What to do When log Wraps around on disk I By this time earlier data is likely quite fragmented I Possibly data earlier on the log is not valid at this point not live an can be discar e I By consolidating such data the LFS is able to continue logging From this point on two solutions to this problem Threading do not touch live data 7 copy new data on the areas of the log that are dead at this point Potential problem excessive fragmentation Plus easy to conceptualize and do just thread all live areas together I Copy Live Data bring live data together in the log I This implies that some Garbage Collection takes place I CC may become expensive due to overheads I Longelived files may get copied many times ThreadingCopy and Compact Old log end Old Io end New log end New log end Threaded Log Copy and Compact Segmented Logs in LFS LFS uses both threading and copying Keep write longlarge Disk divided into large fixedsize extents called segments I Any given segment always written sequentially start to end 39 ta must be copied out before its rewritten I Log is threaded segment by segment El Segments are fixed and large size extents I Segrlrlient sizes are chosen large enough so that overhead for seek is sma I They yield a high fraction of disk bandwidth even with some seeks in between El Segment writes contain I I Directory data and Inode changes I Inode map where to find the Inodes for the files Segments El Segment unit of writing and cleaning l 512KB 1024KB Disk consists of segments checkpoint region u Segment summary block u Contains each block s identity ltinode number offset a Used to check validness of each block checkpoint region Segment Summary Block El This block contains I Each piece of information in the segment is identified file number offset etc I Summary Block is written after every partial segment write I Helps in deciding the block s Liveness El There is no free list or bitmap in LFS Multi Log Writing t 39Icmml block E New Man Summary himl n Cleaning of Segments El El Process of copying live data out of a segment is called Cleaning To free up segments copy live data from several segments to a new one ie pack live data together I Read a number of segments into memory I Identify live data I Write live data back to a smaller number of clean segments I Mark read segments as clean Various issues I How to identify 1ive objects in a se ent I How to identify le and offset of each live block I How to update that le s lrnode with new location of live blocks Cleaning of Segments El El El El Segment Cleaning Questions When to clean How many segments to clean Which segment to clean I Most fragmented How should live data be sorted when written out I Enhance locality of future reads l Age sort I Sprite LES starts cleaning segments when the number of clean segments falls below a threshold a few tens l Cleans a few tens at a time until the number of clean segments surpasses another threshold Cleaning of Segments cont Log end my Il ll Write Cost El Write cost measure how busy the disk is per byte of new data written Used to compare cleaning policies Includes segment cleaning overhead 39 39 latencvr 39 100 is perfect 7 no cleaning overh El Write cost is estimated as l toml bytes read and written new data written ead 10 means that 1 10 of disk time is spent Writing new data I read segs write segs write new new data written El If average utilization of live data in segments is u I Read N se I Write Nu old data I Leaves space N1eu of new data I 1 entirety need to readJsegment and the write cost is 1 quot Ifu0thenno Write Cost as Function of u El Need utilization below 8 or 5 I NOT overall disk utilization rather fraction of live blocks in segments that need to be Cleaned I Performance can be improved by reducing disk utilization El Less of disk in use lower write Cost El More of disk in use higher write Cost El Best Bimodal distribution Mostly full segments work With cleaning mostly emp y ones Write Cost as Function of u ulo cm Ho LugImr39mrctl Hm mm Him Ilnpl m ml 0 hill 1 112 04 on 08 10 mcuun ulnc in wgmunh clumml 1m Simulation Results El Analyze different cleaning policies under controlled conditions El FS Fixed number of 4KB files El Every step overwrites a file with new data Access patterns to overwrite files El Uniform Equal probability of being selected El Hotandcold 10 hot files selected 90 of the time 90 cold files selected 10 of the time HOTgtshortlived data amp COLDgtong lived data El Equal chance of selection within each group Simulation Results Contd El Fixed overall disk capacity utilization El No read traffic only write for simulation El Exhaust clean segments 9 Clean segments till a threshold El Experiments run till the write cost stabilized Initial simulation results El Greedy policies for both Uniform and Hot andCold l l l El Uniform least H utilized segments l lJ m cleaned 11 I Uniform no sorting Hit cm El IM I NW of live data in the 1 quot selected segment quotquotD H quot39 quot quot El HotandCold sorts hr Lupuuh ulilluliun live data by age in the selected segment Initial simulation results El Locality amp better grouping give worse performance quot 39 E quot than noIocality system i W I l nnpnnul El Hotandcod wuth unitiii 955 gt worse performance lyr Segment utilization with greedy cleaner I Only least utilized of all segments is cleaned El F l lllli in ul lwgmgmx Utilization of every mier seg ment drops to the Miquot 7 cleaning threshold lluirmuiwm hum El Dro in utilization of the Um col segments is slow liliill quot I H 394 in ik N More se ments clustered around te gleann9 paint for locality than In non locality based Simulation CostBenefit Policy lwmlil 7 lie PEIL39U gcnvl ulul i 11g ul llula llzlll illg39 owl U ml ii El The cleaner now chooses the segment with the highest bene ttocost ratio to clean El Benefit 1free space reclaimed 2amt of time to stay free El u utilization of segment I Free space 1 u I Time Age of the youngest block I Cost 1 read cost u writing cost 1u Cost benefit vs Greedy El Bimodal cleans cold at 75 u 1 and hot at 15 u El 90 of writes for hot means mostly 7777 7 quot quot hot segments are ill ll quot4 On MK Lil Segmcm iililizmion Lis ml 1 cm Why choose costbenefit policy El 50 less write cost than greedy policy El Even outperforms the best possible Unix FFS at high disk capacity utilization Gets better with increasing locality i r i ill quot1 04 l 0 ll Dbk capacity uulizanon El Segment usage table Number39of live bytes in Age of the youngest the segmentr block in the segment Segment usage table record for each segment Supports cost benefit Values are set when the segment is written Segment reused only when of live bytes 0 Blocks of this table kept in the checkpoint regions which is used in crash recovery Crash recovery El Traditional Unix FS must scan ALL metadata for restoring consistency after reboot takes a lot of time with increasing storage size El LFS last operation at the end of the log also used in other FS and databases quicker crash recovery 1 Checkpoint 2 Rollforward Checkpoints El Position in the log FS is consistent amp complete El 2phase process to create a checkpoint 1 Write all modified info to the log 2 Checkpoint region fixed position on disk checkpoint region gt all blocks in inode map amp segment usage table current time amp last segment written 3 2 checkpoint regions Checkpoints Contd Reboot computer Read checkpoint region 1 Initialize memory data structures oTwo checkpoint regions to handle crash during checkpoint operations oCheckpoint time is in the last block of the region oSystem uses the most recent time time for the failed checkpoint is not recorded 20 Creation of a checkpoint 1 Periodic intervals 2 File system is unmounted 3 System is shutdown Controlling recovery time Contd El Longer interval gt less checkpoint overhead more recovery time El Shorter interval gt less recovery time costlier normal ops El Alternative Checkpoint after a certain amount of new data written 21 Roll forwa rd El Scanning BEYOND the last checkpoint to recover max data El Use information from segment summary blocks for recovery El If found new inode in Segment Summary block gt update the inode map read from checkpoint gt new data block on the FS El Data blocks without new copy of inode gt incomplete version on disk gt ignored by FS Roll forward Contd El Adjusting utilization in the segment usage table to incorporate live data after rollforward utilization after checkpoint 0 initially El Adjusting utilization of deleted amp overwritten segments El Restoring consistency between directory entries amp inodes 22 Restoring consistency between directories and inodes 1 Special log record for each directory change Directory operation log ltoperation code location of djr entry contents new ref count for inode in the entrygt 2 Log entry exists but no inodedirectory block gt update and append the directories etc to the log and create a new checkpoint 3 Checkpoint represents consistency between directory operation log and inodes directory blocks in the log Implementation of Sprite LFS El Began in late1989 and operational by mid1990 El Implemented for the Sprite Network OS installed in 5 partitions and used by 30 users El Rollforward not implemented yet El Short checkpoint interval 30 sec El Data after last checkpoint discarded after reboot 23 Sprite LFS vs Unix FFS Sprite LFS Unix FFS Additional complexity for implementing segment cleaner No segment cleaner implemented No bitmap layout policies required Bitmap layout policies makes it as implemented complex as segment cleaner Recovery code fsck code Micro benchmarks small files Kc D mucus I slums El Best case performance no cleaning Sprite LFS vs SunOS 403 based on Unix FFS Sprite LFS segment size 1MB block size 4 KB SunOS block size SKB El El 9 a Sprite kept disk 17 busy while saturating CPU SunOS saturated disk 85 only 12 of potential disk bandwidth used for new data Sprite WINS Micro benchmarks large files Mlqlwlcs w Sp cl is slums Wm Rom Wrile Rm Rde Saqucuial erlmn aniix nlml El Sequential rereading requires seeks in Sprite hence its performance is lower than SunOS El Traditional FS logical locality assumed access pattern Logstructured FS temporal locality group recent createdmodified data Cleaning overheads Wrilc Cusilns the LFS les suns l n ma V Filtsysltm kg cm usez 133 4 l pcn 117 rckemel K I22 2 l xmp 264 MB KB L7 MBhour 11 237 7W7 130 3 Rwap 309 MB ski KR H 1 MBlmul 65 470 56 515 6 Fraction of segments 0180 V V 39 39 0160 quot Collected over a 4month period L140 0m Better performance than predicted 0100 through simulation low write cost 0030 39 range 03906 quot Segment utilization of user6 0040 lt 0 020 partition Uwu Large number of fully utilized and 0 0 0 2 04 0 6 0 10 chmcm utilization totally empty segments 25 Crash recovery time El Code can time recovery of various crash scenarios El Less data written between checkpoints gt less recovery time El Also dependent on number of files written between checkpoints Related Work El Previously implemented on writeonce media no reclaiming of log El Segment cleaning E garbage collection in programming languages El One block ltgt one file garbage identifying algo is simpler El Database use writeahead logging for crash recovery El Similar to group commit in database systems 26 Conclusion El The motivation for designing a new log structured file system El Design and architecture of the Sprite LFS including issues like free space management and segment cleaning Simulationbased study to choose the right design for implementation Implementationbased study and comparison to prove the superiority of Sprite LFS over traditional FFS I I Reference El Rosenblum M and Ousterhout J K The Design and Implementation of a LogStructured File Systemquot ACM transaction on Computer Systems Vol 10 No 1 February 1992 pp 26 52 El Rosenblum M The Design and Implementation of a LogStructured File Systemquot 27 Dhanyabad 28 Initialization and Process Initiation in UNIX Chapters 6 and 7 from Lions39 Commentary on UNIX Presented by Wade Spires COP 5611 March 22 2005 Outline 9 Overview 9 Initializiation Olnitialize Kernel Segments OEnable Memory Management 9 Process Initiation OCreate System Process OCreate Init Process 9 Summary 9 References Overview i start m40 s newproc main s1pc mainc sched mainc sleep s1pc swtch s1pc i start i i i i i i i 7m40 Q i newproc main s1pc mainc sched mainc sleep s1pc swtch s1pc Process Initiation newproc s1pc main mainc sched mainc sleep s1pc swtch s1pc Process Initiation Memory Management start 9 Initialize kernel segments 9 Initialize user segments 9 Initialize lO segments OStart Memory Management Unit MMU PDP11l40 Memory 9 8 kernel mode memory pages 9 8 user mode memory pages 9 32 to 4096 word page length 0Unix uses 4096 word page length 9 3 modes of memory access control Kernel Address Space Mapping LKE Mr um mucmm 7 lt7 ROUNDUP mm 64 mm Kernel mch from o In em Virtual Physical Virtual to Physical Address 16 bits to 18 bits 15 15 12 o APF l D I I ACTIVE PAGE FEELD DISPLACEMENT FIELD Virtual Address 12 6 5 0 EN 015 l I BLOCK NUMBER DISPLACEMENT IN BLOCK Displacement Field Construction of Physical Address 5 l3 l2 6 5 O ADF l VIRTUAL I I BLOC No I 5 Loonsss v5 I2 H PAGE ADDRESS FIELD ACTIVE PAGE REGISTER I7 5 0 PHYSICAL aLocx N0 PHYSIOM l I me 55 DISPLACEMENT IN BLOCK Active Page Registers APR 15 14 13 739 I PROEESSOR STATUS WORD APR 0 APR 0 APR 1 APR x APR 2 APR 2 ACTIVE APR 3 APR 3 REGISTERS APR 4 APR 4 APR 5 APR 5 APRS APR 6 APR APR 15 o 15 PAR PDR PAGE ADDRESS REGISTER PAGE DESCRIPTION REGISTER APR PAR and PDR 15 12 H 0 I PAF I I I I l PARiBase Address 15l IW PDRiPage Descripth Example 1010 0000 1000 1010 Virtual Address 1010000010001010 APF DF 101 0000010 001010 APF BN DIB Example H 1010000010001010 0011110011001110 APF BN DIB PSW Kernel lt User PDR PDR JhtuaIIderess Example 1 1 101 0000010 001010 1APF F Bbl EHB APRS 0000 0000 0000 1011 H 0111111100000110 PAl 0000000010114447 00000104 PIHR OOOOOOOOIIOHOOIOIO iPhysicalZderess Fggggggggggggi Kernel Address Space Mapping LKE Mr um mucmm 7 lt7 ROUNDUP mm 64 mm Kernel exlemk mm o by end Virtual Physical Important Memory Locations m40s OUSIZE 16 OPS 177776 OSSRO 177572 OKISAO 172340 OKISA6 172354 OKISDO 172300 OUISAO 177640 OUISA1 177642 OUISDO 177600 OUISD1 177602 OIO 7600 Size of User Block 64 1024 B Program Status Word Status Register Kernel Segment Address Register 0 Kernel Segment Address Register 6 Kernel Segment Descriptor Register 0 User Segment Address Register 0 User Segment Address Register 1 User Segment Descriptor Register 0 User Segment Descriptor Register 1 IO Segment Register mov mov mov clr mov mov mov add sob Initialize Kernel Segments start start at first address and descriptor registers KISAO r0 KISDO r1 increment pointer positions by 200 to 0 200 12008 blocks for r6 6 to 0 each segment register 392 r0 address set to current pointer position 77406 r1 descriptor set to 4K size and readwrite r4 r2 r3 1b Initialize User Segment start mov end 63 r2 address KISA6 set to 35h 3969 r2 mark end of program code a bio 1777 r2 and data area in user mov r2 r0 mov USIZE 1lt 8 6r1 descriptor set to 10248 size and readwrite end rounded to multiple of 64 iRightshift bits by 6 iClear each bit in r2 that is set in 177 7 upper 6 bits Initialize IIO Segment start m0V IO r0 address set to IO segment mov 77406 r1 descriptor set to 4K size and readwrite 8th segment KISA7 mapped into highest 4K word segment of the physical address space Start MMU start set stack pointer to highest mOV u USIZE 64391 Sp word of per process data area inc SSRO enable memory management Memory management enabled when rightmost bit of Status Register SSRO is set mov 30000 PS jsr pc main mov 170000 Sp clr sp rtt Call main start set previous mode of program status word to user 300008 11 00002 call main main returns and does this much later Overview i newproc s1pc sched mainc sleep s1pc swtch s1pc Initial Conditions in main OProcessor running at priority zero in kernel mode and with previous mode set to user 9 Kernel mode segmentation registers set and MMU enabled OAII data areas used by the OS initialized OStack pointer sp or r6 points to return address in start mNCDCNth h Summary of main Zero out and free all of core memory Determine amount of memory available Initialize swap space Set up systemkernel process Determine type of clock Initialize buffer pools Call newproc to create second process Call sched mNCDCNth h Summary of main Zero out and free all of core memory Determine amount of memory available Initialize swap space Set up systemkernel process Determine type of clock Initialize buffer pools Call newproc to create second process Call sched How to handle processes In main we need to Set up systemkernel process Call newproc to create second process Call sched First need to define two structures proc 0 user Process Structure proch struct proc P3tat Current state P ag 0 Process traits IO0r 0 Priority P0d 0 User ID IO00id Parent ID paddr Address of program39s data segment psize 0 Size of program image in blocks ptextp Pointer to text segment p6c NPROC struct user ufsav uqsav ussav uprocp uuisa16 uuisd16 utsize udske usgze u User Structure userh Arrays for storing registers r5 and r6 environment and sp 0 Address of corresponding proc structure 0 User page address and description registers Size of text data and stack segments User Structure userh 35 fields in total also concerned with 0 Saving floating point registers User identification Parameters for inputoutput operations 0 File access control 0 System call parameters Accounting information Proc Versus User proc user OOne allocated per process 9 One allocated per process 9 Never swapped from 9 May be swapped when not memory running OMust be available any time OOnly one available at a time 9 Points to user 9 Points to proc struct proc struct user plbc NPROC um Main Continued In main we can now Set up systemkernel process Call newproc to create second process Call sched Set up system process mainc proc0paddr ka6 9 Set address of process39s location in memory 9 Set size of process39s proc0pstat SRUN segment proc0psize USIZE proc0p ag SLOAD 9 Mark the process as pmc0p agquot SSYS runnable in memory and 11114me 2 amppmc0 should not be swapped out OSave its proc position Main Again In main we can now Set up systemkernel process Call newproc to create second process Call sched if newproc expand USIZE 1 estabur 0 100 copyout icode O sizeof icode return sched Overview l start m40 s main mainc sched mainc sleep s1pc swtch s1pc newproc O Initialize second proc structure proc1 OLocate unused slot in proc table OCopy proc039s fields into proc1 OSave environment and stack pointers into uursav OAllocate data area in memory for proc1 OCopy proc139s data area including uursav into proc139s data area OSet proc139s uuprocp to ampproc1 O Exact copy of procO made except value of uuprocp in proc1 is ampproc1 9 Return 0 Main Again In main we can now Set up systemkernel process Call newproc to create second process Call sched if newproc 4 O returned expand USIZE 1 estabur 0 100 copyout icode O sizeof icode return sched Overview newproc s1pc Scheduling the First Process sched See if anyone wants to be swapped in swap out processes until there is room 9 Disable interrupts OFind process that is runnable but not in memory OSearch fails the first time procO and proc1 are the only processes and are both in memory OCaII sleep to wait for runnable swapped process priority set to max 100 Overview l newproc s1pc sleep 9 Save PSW 9 Get current process proc0 9 Priority is negative so set status of current process to sleep 9 Call swtch Overview i start m40 s newproc main s1pc mainc sched mainc sleep s1pc swtch 9 Static variable p set to process 0 initially value preserved between calls 9 Call savu to save the stack pointer and the environment pointer for the current process in uursav 9 Call retu on proc0 context switch from O to O the first time OReset kernel address register for segment 6 to value passed as an argument 0Reset the stack and environment pointers to values appropriate to the revised current process whose execution is about to be resumed swtch 2 0 Search for highest priority runnable process starting atend 0 Process 1 found 0 retu called to switch to process 1 0 Call sureg to copy appropriate values for current process into user mode segmentation registers which were stored earlier in uuuisa and uuuisd Os copied the first time 0 Return 1 to main How newproc O Initialize second proc structure proc1 OLocate unused slot in proc table OCopy proc039s fields into proc1 OSave environment and stack pointers into uursav OAllocate data area in memory for proc1 OCopy proc139s data area including uursav into proc139s data area OSet proc139s uuprocp to ampproc1 O Exact copy of procO made except value of uuprocp in proc1 is ampproc1 9 Return 0 Overview l newproc s1pc sched mainc sleep s1pc swtch s1pc Main Again In main we can now Set up systemkernel process Call newproc to create second process Call sched if newproc lt lrcturncd expand USIZE 1 estabur 0 100 copyout icode O sizeof icode return sched main Again OCall expand to allocate new larger area for process 1 and copy original data into it no original data for proc1 OCall estabur to set prototype segmentation registers 1 data segment for proc1 OCall copyout to copy the array icode into start of user space 9 Return to start in m40s after the jump to main main Again OCall expand to allocate new larger area for process 1 and copy original data into it no original data for proc1 OCall estabur to set prototype segmentation registers 1 data segment for proc1 OCall copyout to copy the array icode into start of user space 9 Return to start in m40s after the jump to main estabur Set up prototype segmentation registers Ouuuisa Ouuuisd Segments created 0 Text Readonly O Data Readwrite 9 Stack Readwrite expands down Data Text Virtual Address Space main Again OCall expand to allocate new larger area for process 1 and copy original data into it no original data for proc1 OCall estabur to set prototype segmentation registers 1 data segment for proc1 OCall copyout to copy the array icode into start of user space 9 Return to start in m40s after the jump to main Overview newproc main mainc SID sched mainc sleep s1pc swtch s1pc start Again 9 Execute in user mode the instruction at user mode address 0 icode etcinit mov 170000 Sp clr sp rtt Execute Icode Inlt program mainc int icode 0104413 sys exec init initp 0000014 0000010 OOOO777br 0000014 initp init 0 0000000 0062457 init ltetCinit0gt 0061564 0064457 0064556 0000164 Equivalent C Program char init quotetcinit main execl init init O while 1 Summary i start m40 s newproc main s1pc mainc sched mainc sleep s1pc swtch s1pc Summary i 1 User turns computer on start m40s newproc main s1pc mainc sched mainc sleep s1pc swtch s1pc Summary 1 User turns computer on 2 Segments initialized 3 Memory management enabled newproc 81p c mainc sched mainc sleep s1pc swtch s1pc ncwproc s1pc sched mainc sleep s1pc swtch s1pc Summary 1 User turns computer on 2 Segments initialized 3 Memory management enabled 4 Initialize various components 5 Initialize system process 0 l start m40s main mainc sched mainc sleep slpc swtch slpc Summary User turns computer on Segments initialized Memory management enabled Initialize various components Initialize system process 0 Copy system process 0 gt 1 Summary User turns computer on Segments initialized t Memory management enabled ncwproc Initialize various components 31 Initialize system process 0 Copy system process 0 gt 1 SchedulerDisable Interrupts ncwproc s1pc Summary User turns computer on Segments initialized Memory management enabled Initialize various components Initialize system process 0 Copy system process 0 gt 1 SchedulerDisable Interrupts proc0 goes to sleep CONGOhooN h l start m40 s main mainc ncwproc s1pc schcd mainc sleep s1pc Summary User turns computer on Segments initialized Memory management enabled Initialize various components Initialize system process 0 Copy system process 0 gt 1 SchedulerDisable Interrupts procO goes to sleep Switch to proc1 old procO OOICDO ILOON ncwproc s1pc sched mainc sleep s1pc swtch s1pc Summary 0 1 User turns computer on 2 Segments initialized 3 Memory management enabled 4 Initialize various components 5 6 7 8 9 1 Initialize system process 0 Copy system process 0 gt 1 SchedulerDisable Interrupts procO goes to sleep Switch to proc1 old procO Load init into proc1 ncwproc s1pc main mainc schcd mainc sleep s1pc swtch s1pc Summary User turns computer on Segments initialized Memory management enabled Initialize various components Initialize system process 0 Copy system process 0 gt 1 SchedulerDisable Interrupts procO goes to sleep Switch to proc1 old procO 10 Load init into proc1 11 Run init OOICDO ILOON l start m40s ncwproc s1pc main main c schcd main c sleep s1pc swtch s1pc Summary User turns computer on Segments initialized Memory management enabled Initialize various components Initialize system process 0 Copy system process 0 gt 1 SchedulerDisable Interrupts procO goes to sleep Switch to proc1 proc0 copy 10 Load init into proc1 11 Run init 12 Initialization complete OOICDO ILOON References 6828 Fall 2004 6828 Operating System Engineering 12 March 2005 lthttpwwwpdoslcsmitedu68282004indexhtmgt John Lions Lions39 Commentary on UNIX 6th Edition With Source Code PeertoPeer Communications 1996 PDP1140 Processor Handbook Digital Equipment Corporation 1972 Operating R Stockton Gaines Systems Editor Time Clocks and the Ordering of Events in a Distributed System Leslie Lamport Massachusetts Computer Associates Inc The concept of one event happening before another in a distributed system is examined and is shown to de ne a partial ordering of the events A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events The use of the total ordering is illustrated with a method for solving synchronization problems The algorithm is then specialized for synchronizing physical clocks and a bound is derived on how far out of synchrony the clocks can become Key Words and Phrases distributed systems computer networks clock synchronization multiprocess systems CR Categories 432 529 Introduction The concept of time is fundamental to our way of thinking It is derived from the more basic concept of the order in which events occur We say that something happened at 315 if it occurred after our clock read 315 and before it read 316 The concept of the temporal ordering of events pervades our thinking about systems For example in an airline reservation system we specify that a request for a reservation should be granted if it is made before the ight is lled However we will see that this concept must be carefully reexamined when consid ering events in a distributed system General permission to make fair use in teaching or research of all or part ofthis material is granted to individual readers and to nonpro t libraries acting for them provided that ACM s copyright notice is given and that reference is made to the publication to its date of issue and to the fact that reprinting privileges were granted by permission of the Association for Computing Machinery To otherwise reprint a gure table other substantial excerpt or the entire work requires speci c permission as does republication or systematic or multiple reproduc tron This work was supported by the Advanced Research Projects Agency of the Department of Defense and Rome Air Development Center It was monitored by Rome Air Development Center under contract number F 3060276C0094 Author s address Computer Science Laboratory SRl Interna tional 333 Ravenswood Ave Menlo Park CA 94025 1978 ACM 000107827807000558 0075 558 A distributed system consists of a collection of distinct processes which are spatially separated and which com municate with one another by exchanging messages A network of interconnected computers such as the ARPA net is a distributed system A single computer can also be viewed as a distributed system in which the central control unit the memory units and the inputoutput channels are separate processes A system is distributed if the message transmission delay is not negligible com pared to the time between events in a single process We will concern ourselves primarily with systems of spatially separated computers However many of our remarks will apply more generally In particular a mul tiprocessing system on a single computer involves prob lems similar to those of a distributed system because of the unpredictable order in which certain events can occur In a distributed system it is sometimes impossible to say that one of two events occurred rst The relation happened before is therefore only a partial ordering of the events in the system We have found that problems often arise because people are not fully aware of this fact and its implications In this paper we discuss the partial ordering de ned by the happened before relation and give a distributed algorithm for extending it to a consistent total ordering of all the events This algorithm can provide a useful mechanism for implementing a distributed system We illustrate its use with a simple method for solving syn chronization problems Unexpected anomalous behav ior can occur if the ordering obtained by this algorithm differs from that perceived by the user This can be avoided by introducing real physical clocks We describe a simple method for synchronizing these clocks and derive an upper bound on how far out of synchrony they can drift The Partial Ordering Most people would probably say that an event a happened before an event b if a happened at an earlier time than b They might justify this de nition in terms of physical theories of time However if a system is to meet a speci cation correctly then that speci cation must be given in terms of events observable within the system If the speci cation is in terms of physical time then the system must contain real clocks Even if it does contain real clocks there is still the problem that such clocks are not perfectly accurate and do not keep precise physical time We will therefore de ne the happened before relation without using physical clocks We begin by de ning our system more precisely We assume that the system is composed of a collection of processes Each process consists of a sequence of events Depending upon the application the execution of a subprogram on a computer could be one event or the execution of a single machine instruction could be one Communications July 1978 of Volume 21 the ACM Number 7 Fig 1 process P process Q process R 390 J event We are assuming that the events of a process form a sequence where a occurs before b in this sequence if a happens before b In other words a single process is de ned to be a set of events with an a priori total ordering This seems to be what is generally meant by a process1 It would be trivial to extend our de nition to allow a process to split into distinct subprocesses but we will not bother to do so We assume that sending or receiving a message is an event in a process We can then de ne the happened before relation denoted by gt as follows De nition The relation gt on the set of events of a system is the smallest relation satisfying the following three conditions 1 If a and b are events in the same process and 0 comes before b then a gt b 2 If a is the sending of a message by one process and b is the receipt of the same message by another process then a gt b 3 If a gt b and b gt c then a gt c Two distinct events a and b are said to be concurrent if a gt b and b gt a We assume that a 74 a for any event a Systems in which an event can happen before itself do not seem to be physically meaningful This implies that gt is an irreflexive partial ordering on the set of all events in the system It is helpful to view this de nition in terms of a spacetime diagram such as Figure l The horizontal direction represents space and the vertical direction represents time later times being higher than earlier ones The dots denote events the vertical lines denote processes and the wavy lines denote messages2 It is easy to see that a gt b means that one can go from a to b in 39 The choice of what constitutes an event affects the ordering of events in a process For example the receipt of a message might denote the setting of an interrupt bit in a computer or the execution of a subprogram to handle that interrupt Since interrupts need not be handled in the order that they occur this choice will affect the order ing of a process messagereceiving events 2 Observe that messages may be received out of order We allow the sending of several messages to be a single event but for convenience we will assume that the receipt of a single message does not coincide with the sending or receipt of any other message 559 31 P N process R proces s Q Q4 U 0 K1 0 0 L4 0 the diagram by moving forward in time along process and message lines For example we have p1 gt r in Figure 1 Another way of viewing the de nition is to say that a gt b means that it is possible for event a to causally affect event b Two events are concurrent if neither can causally affect the other For example events 1 and 13 of Figure l are concurrent Even though we have drawn the diagram to imply that qg occurs at an earlier physical time than 123 process P cannot know what process Q did at 1 until it receives the message at p4 Before event 124 P could at most know what Q was planning to do at 13 This de nition will appear quite natural to the reader familiar with the invariant spacetime formulation of special relativity as described for example in l or the rst chapter of 2 In relativity the ordering of events is de ned in terms of messages that could be sent However we have taken the more pragmatic approach of only considering messages that actually are sent We should be able to determine if a system performed correctly by knowing only those events which did occur without knowing which events could have occurred Logical Clocks We now introduce clocks into the system We begin with an abstract point of view in which a clock is just a way of assigning a number to an event where the number is thought of as the time at which the event occurred More precisely we de ne a clock C for each process P to be a function which assigns a number Old to any event a in that process The entire system of quotcl39ocks is represented by the function C which assigns to any event b the number Cb where Cb Cjltbgt ifb is an event in process P For now we make no assumption about the relation of the numbers GM to physical time so we can think of the clocks C as logical rather than physical clocks They may be implemented by counters with no actual timing mechanism Communications July 1978 of Volume 21 the ACM Number 7 11 process P 03 9 process R I process Q I I I I I I I I I 390 A We now consider what it means for such a system of clocks to be correct We cannot base our de nition of correctness on physical time since that would require introducing clocks which keep physical time Our de nition must be based on the order in which events occur The strongest reasonable condition is that if an event a occurs before another event b then a should happen at an earlier time than b We state this condition more formally as follows Clock Condition For any events a b ifa gt b then Ca lt Cb Note that we cannot expect the converse condition to hold as well since that would imply that any two con current events must occur at the same time In Figure 1 p2 and p3 are both concurrent with 13 so this would mean that they both must occur at the same time as qa which would contradict the Clock Condition because p2 9 p3 It is easy to see from our de nition of the relation gt that the Clock Condition is satis ed if the following two conditions hold Cl If a and b are events in process Pi and a comes before b then C a lt Cibgt C2 If a is the sending of a message by process P and b is the receipt of that message by process Pj then Ca lt Let us consider the clocks in terms of a spacetime diagram We imagine that a process clock ticks through every number with the ticks occurring between the process events For example if a and b are consec utive events in process P with GM 4 and Cib 7 then clock ticks 5 6 and 7 occur between the two events We draw a dashed tick line through all the like numbered ticks of the different processes The space time diagram of Figure 1 might then yield the picture in Figure 2 Condition Cl means that there must be a tick line between any two events on a process line and S60 condition C2 means that every message line must cross a tick line From the pictorial meaning of gt it is easy to see why these two conditions imply the Clock Con dition We can consider the tick lines to be the time coordi nate lines of some Cartesian coordinate system on space time We can redraw Figure 2 to straighten these coor dinate lines thus obtaining Figure 3 Figure 3 is a valid alternate way of representing the same system of events as Figure 2 Without introducing the concept of physical time into the system which requires introducing physical clocks there is no way to decide which of these pictures is a better representation The reader may nd it helpful to visualize a two dimensional spatial network of processes which yields a threedimensional space time diagram Processes and messages are still represented by lines but tick lines become twodimensional surfaces Let us now assume that the processes are algorithms and the events represent certain actions during their execution We will show how to introduce clocks into the processes which satisfy the Clock Condition Process Pi s clock is represented by a register C so that 0a is the value contained by C during the event a The value of C will change between events so changing C does not itself constitute an event To guarantee that the system of clocks satis es the Clock Condition we will insure that it satis es conditions Cl and C2 Condition C1 is simple the processes need only obey the following implementation rule 1R1 Each process Pi increments Ci between any two successive events To meet condition C2 we require that each message m contain a timestamp Tm which equals the time at which the message was sent Upon receiving a message time stamped Tm a process must advance its clock to be later than Tm More precisely we have the following rule 1R2 a If event a is the sending of a message m by process Pi then the message m contains a timestamp Tm Cia b Upon receiving a message m process Pj sets CJ greater than or equal to its present value and greater than Tm In IR2b we consider the event which represents the receipt of the message m to occur after the setting of Cj This is just a notational nuisance and is irrelevant in any actual implementation Obviously 1R2 insures that C2 is satis ed Hence the simple implementation rules IR and 1R2 imply that the Clock Condition is satis ed so they guarantee a correct system of logical clocks Ordering the Events Totally We can use a system of clocks satisfying the Clock Condition to place a total ordering on the set of all system events We simply order the events by the times Communications July 1978 of Volume 21 the ACM Number 7 at which they occur To break ties we use any arbitrary total ordering lt of the processes More precisely we de ne a relation gt as follows if a is an event in process Pi and b is an event in process P then a gt b if and only if either i C a lt Cjb or ii Cilta Cjltbgt and P lt Pj It is easy to see that this de nes a total ordering and that the Clock Condition implies that if a gt b then a b In other words the relation gt is a way of completing the happened before partial order ing to a total ordering3 The ordering depends upon the system of clocks Cl and is not unique Different choices of clocks which satisfy the Clock Condition yield different relations Given any total ordering relation gt which extends gt there is a system of clocks satisfying the Clock Condition which yields that relation It is only the partial ordering gt which is uniquely determined by the system of events Being able to totally order the events can be very useful in implementing a distributed system In fact the reason for implementing a correct system of logical clocks is to obtain such a total ordering We will illustrate the use of this total ordering of events by solving the following version of the mutual exclusion problem Con sider a system composed of a xed collection of processes which share a single resource Only one process can use the resource at a time so the processes must synchronize themselves to avoid con ict We wish to nd an algo rithm for granting the resource to a process which satis es the following three conditions I A process which has been granted the resource must release it before it can be granted to another process II Different requests for the resource must be granted in the order in which they are made III If every process which is granted the resource eventually releases it then every request is eventually granted We assume that the resource is initially granted to exactly one process These are perfectly natural requirements They pre cisely specify what it means for a solution to be correct4 Observe how the conditions involve the ordering of events Condition 11 says nothing about which of two concurrently issued requests should be granted rst It is important to realize that this is a nontrivial problem Using a central scheduling process which grants requests in the order they are received will not work unless additional assumptions are made To see this let Po be the scheduling process Suppose P1 sends a request to P0 and then sends a message to P2 Upon receiving the latter message P2 sends a request to P0 It is possible for P2 S request to reach P0 before Pl s request does Condi tion II is then violated if Pg s request is granted rst To solve the problem we implement a system of quot The ordering lt establishes a priority among the processes If a fairer method is desired then lt can be made a function of the clock value For example if Cagt Cb andj lt i then we can let a b ifj lt Ca mod N S i and b a otherwise where N is the total number of processes 4 The term eventually should be made precise but that would require too long a diversion from our main topic 561 clocks with rules IRl and 1R2 and use them to de ne a total ordering gt of all events This provides a total ordering of all request and release operations With this ordering nding a solution becomes a straightforward exercise It just involves making sure that each process learns about all other processes operations To simplify the problem we make some assumptions They are not essential but they are introduced to avoid distracting implementation details We assume rst of all that for any two processes P and P the messages sent from P to Pf are received in the same order as they are sent Moreover we assume that every message is even tually received These assumptions can be avoided by introducing message numbers and message acknowledg ment protocols We also assume that a process can send messages directly to every other process Each process maintains its own request queue which is never seen by any other process We assume that the request queues initially contain the single message TozPo requests resource where P0 is the process initially granted the resource and To is less than the initial value of any clock The algorithm is then de ned by the following ve rules For convenience the actions de ned by each rule are assumed to form a single event I To request the resource process P sends the mes sage TmPi requests resource to every other process and puts that message on its request queue where Tm is the timestamp of the message 2 When process P receives the message TmzPi re quests resource it places it on its request queue and sends a timestamped acknowledgment message to Pt 3 To release the resource process Pi removes any TmPi requests resource message from its request queue and sends a timestamped P releases resource message to every other process 4 When process Pj receives a P releases resource message it removes any T2P requests resource message from its request queue 5 Process P is granted the resource when the follow ing two conditions are satis ed i There is a TmzP requests resource message in its request queue which is ordered before any other request in its queue by the relation gt To de ne the relation for messages we identify a message with the event of sending it ii P has received a message from every other process time stamped later than Tm6 Note that conditions i and ii of rule 5 are tested locally by Pt It is easy to verify that the algorithm de ned by these rules satis es conditions I III First of all observe that condition ii of rule 5 together with the assumption that messages are received in order guarantees that P has learned about all requests which preceded its current 5 This acknowledgment message need not be sent if P has already sent a message to P timestamped later than Tm If P lt PI then P need only have received a message timestamped 2 Tquot from P Communications July 1978 of Volume 21 the ACM Number 7 request Since rules 3 and 4 are the only ones which delete messages from the request queue it is then easy to see that condition I holds Condition 11 follows from the fact that the total ordering gt extends the partial ordering gt Rule 2 guarantees that after P requests the resource condition ii of rule 5 will eventually hold Rules 3 and 4 imply that if each process which is granted the resource eventually releases it then condition i of rule 5 will eventually hold thus proving condition III This is a distributed algorithm Each process inde pendently follows these rules and there is no central synchronizing process or central storage This approach can be generalized to implement any desired synchroni zation for such a distributed multiprocess system The synchronization is speci ed in terms of a State Machine consisting of a set C of possible commands a set S of possible states and a function e C x 8 S The relation eC S 5 means that executing the command C with the machine in state S causes the machine state to change to S In our example the set C consists of all the commands P requests resource and P releases resource and the state consists of a queue of waiting request commands where the request at the head of the queue is the currently granted one Executing a request com mand adds the request to the tail of the queue and executing a release command removes a command from the queue7 Each process independently simulates the execution of the State Machine using the commands issued by all the processes Synchronization is achieved because all processes order the commands according to their time stamps using the relation gt so each process uses the same sequence of commands A process can execute a command timestamped T when it has learned of all commands issued by all other processes with timestamps less than or equal to T The precise algorithm is straight forward and we will not bother to describe it This method allows one to implement any desired form of multiprocess synchronization in a distributed system However the resulting algorithm requires the active participation of all the processes A process must know all the commands issued by other processes so that the failure of a single process will make it impossible for any other process to execute State Machine com mands thereby halting the system The problem of failure is a dif cult one and it is beyond the scope of this paper to discuss it in any detail We will just observe that the entire concept of failure is only meaningful in the context of physical time Without physical time there is no way to distinguish a failed process from one which is just pausing between events A user can tell that a system has crashed only because he has been waiting too long for a response A method which works despite the failure of individual processes or communication lines is described in 3 7 If each process does not strictly alternate request and release commands then executing a release command could delete zero one or more than one request from the queue 562 Anomalous Behavior Our resource scheduling algorithm ordered the re quests according to the total ordering gt This permits the following type of anomalous behavior Consider a nationwide system of interconnected computers Suppose a person issues a request A on a computer A and then telephones a friend in another city to have him issue a request B on a different computer B It is quite possible for request B to receive a lower timestamp and be ordered before request A This can happen because the system has no way of knowing that A actually preceded B since that precedence information is based on messages exter nal to the system Let us examine the source of the problem more closely Let Sf be the set of all system events Let us introduce a set of events which contains the events in 9 together with all other relevant external events such as the phone calls in our example Let gt denote the hap pened before relation for g In our example we had A gt B but A B It is obvious that no algorithm based entirely upon events in y and which does not relate those events in any way with the other events in 2 can guarantee that request A is ordered before request B There are two possible ways to avoid such anomalous behavior The rst way is to explicitly introduce into the system the necessary information about the ordering gt In our example the person issuing request A could receive the timestamp T A of that request from the system When issuing request B his friend could specify that B be given a timestamp later than TA This gives the user the responsibility for avoiding anomalous behavior The second approach is to construct a system of clocks which satis es the following condition Strong Clock Condition For any events a b in y ifa gt b then Ca lt Cb This is stronger than the ordinary Clock Condition be cause gt is a stronger relation than gt It is not in general satis ed by our logical clocks Let us identify 3 with some set of real events in physical spacetime and let gt be the partial ordering of events de ned by special relativity One of the mysteries of the universe is that it is possible to construct a system of physical clocks which running quite independently of one another will satisfy the Strong Clock Condition We can therefore use physical clocks to eliminate anomalous behavior We now turn our attention to such clocks Physical Clocks Let us introduce a physical time coordinate into our spacetime picture and let Ct denote the reading of the clock C at physical time t8 For mathematical con We will assume a Newtonian spacetime If the relative motion of the clocks or gravitational effects are not negligible then Ct must be deduced from the actual clock reading by transforming from proper time to the arbitrarily chosen time coordinate Communications July 1978 of Volume 21 the ACM Number 7 venience we assume that the clocks run continuously rather than in discrete ticks A discrete clock can be thought of as a continuous one in which there is an error of up to tick in reading it More precisely we assume that Ct is a continuous differentiable function of 1 except for isolated jump discontinuities where the clock is reset Then dCtdt represents the rate at which the clock is running at time t In order for the clock C to be a true physical clock it must run at approximately the correct rate That is we must have dCtdt 2 l for all t More precisely we will assume that the following condition is satis ed PC There exists a constant K ltlt 1 such that for all i ldCtdt 1 lt K For typical crystal controlled clocks K s 10 It is not enough for the clocks individually to run at approximately the correct rate They must be synchro nized so that Ct z CJt for all i j and t More precisely there must be a suf ciently small constant 6 so that the following condition holds PC2 For all i j Ct Ct lt e If we consider vertical distance in Figure 2 to represent physical time then PC2 states that the variation in height of a single tick line is less than e Since two different clocks will never run at exactly the same rate they will tend to drift further and further apart We must therefore devise an algorithm to insure that PC2 always holds First however let us examine how small K and 6 must be to prevent anomalous behav ior We must insure that the system 3 of relevant physical events satis es the Strong Clock Condition We assume that our clocks satisfy the ordinary Clock Condition so we need only require that the Strong Clock Condition holds when a and b are events in 2 with a gt b Hence we need only consider events occurring in differ ent processes Let u be a number such that if event a occurs at physical time t and event b in another process satis es a gt b then b occurs later than physical time t 11 In other words u is less than the shortest transmission time for interprocess messages We can always choose pt equal to the shortest distance between processes divided by the speed of light However depending upon how messages in 2 are transmitted a could be signi cantly larger To avoid anomalous behavior we must make sure that for any i j and t Ct y Ct gt 0 Combining this with PCI and 2 allows us to relate the required smallness of K and e to the value of a as follows We assume that when a clock is reset it is always set forward and never back Setting it back could cause C1 to be violated PCl then implies that Ct a Ct gt I Kn Using PC2 it is then easy to deduce that Ct a Cjt gt 0 if the following inequality holds el K S a This inequality together with PC and PC2 implies that anomalous behavior is impossible 563 We now describe our algorithm for insuring that PC2 holds Let m be a message which is sent at physical time t and received at time t We de ne um t t to be the total delay of the message m This delay will of course not be known to the process which receives m However we assume that the receiving process knows some mini mum delay a 2 0 such that pm S 11 We call Em um 1quot the unpredictable delay of the message We now specialize rules RI and 2 for our physical clocks as follows IRl For each i if P does not receive a message at physical time t then C is differentiable at t and dCitdt gt 0 IR2 a If P sends a message m at physical time t then m contains a timestamp Tm Ct b Upon receiving a message m at time 1 process Pj sets Ct equal to maximum Ct 0 Tm u9 Although the rules are formally speci ed in terms of the physical time parameter a process only needs to know its own clock reading and the timestamps of mes sages it receives For mathematical convenience we are assuming that each event occurs at a precise instant of physical time and different events in the same process occur at different times These rules are then specializa tions of rules IRl and 1R2 so our system of clocks satis es the Clock Condition The fact that real events have a nite duration causes no dif culty in implement ing the algorithm The only real concern in the imple mentation is making sure that the discrete clock ticks are frequent enough so C1 is maintained We now show that this clock synchronizing algorithm can be used to satisfy condition PC2 We assume that the system of processes is described by a directed graph in which an are from process P to process Pj represents a communication line over which messages are sent directly from P to Pf We say that a message is sent over this are every 7 seconds if for any t P sends at least one message to Pj between physical times t and t T The diameter of the directed graph is the smallest number d such that for any pair of distinct processes Pj Pk there is a path from P to Pk having at most d arcs In addition to establishing PC2 the following theo rem bounds the length of time it can take the clocks to become synchronized when the system is rst started THEOREM Assume a strongly connected graph of processes with diameter d which always obeys rules IRl and IR2 Assume that for any message m am 5 u for some constant a and that for all t 2 to a PCI holds b There are constants 7 and g such that every 1 seconds a message with an unpredictable delay less than 5 is sent over every arc Then PC2 is satis ed with e z d2IC39T g for all t 2 to rd where the approximations assume 1 5 ltlt 739 The proof of this theorem is surprisingly dif cult and is given in the Appendix There has been a great deal of work done on the problem of synchronizing physical clocks We refer the reader to 4 for an intro Ct 0 gig Ct I61 Communications July 1978 of Volume 21 the ACM Number 7 t 1 t for all t 2 t Note that Cl 2 Ci t for all t 2 t 2 Suppose process P1 at time t1 sends a message to process P2 which is received at time t2 with an unpre dictable delay 5 5 where to s t 5 t2 Then for all t 2 t2 we have C520 2 Ct2 1 1ct t2 by 1 and PCI 2 C101 um 1 Kt t2 by IR2 b C101 1 KXt t1 12 10 m K02 1 Z C1t1 39 Kt 11 Hence with these assumptions for all t 2 t2 we have C320 2 C1t1 l 1ct 11 g 3 Now suppose that for i l n we have t s If lt 564 n1 to s 11 and that at time if process Pi sends a message to process PM which is received at time n1 with an unpredictable delay less than 5 Then repeated applica tion of the inequality 3 yields the following result for t Z In 5531 2 C1t1 1 Kt tx n 4 From PCl IRl and 2 we deduce that C101quot 2 C1t1 1 Kt139 t1 Combining this with 4 and using 2 we get C1t 2 C101 1 Kt t1 mg 5 for t 2 tn For any two processes P and P we can nd a sequence of processes P P0 P1 PM P n s d with communication arcs from each P to PM By hy pothesis b we can find times t t with ti ti 5 r and n1 If s v where u U 5 Hence an inequality of the form 5 holds with n s 1 whenever t 2 t1 d7 v For any i j and any t 11 with t1 2 to and t 2 t1 11 u we therefore have Cit 2 cz1 1 1ct t1 dg 6 Now let m be any message timestamped Tm and suppose it is sent at time t and received at time t We pretend that m has a clock Cm which runs at a constant rate such that Cmt tm and Cmt tm pm Then laquot 5 t t implies that dedt s 1 Rule IR2 b simply sets Cjt to maximum Ct 0 Cmt Hence clocks are reset only by setting them equal to other clocks For any time t 2 to pl 1 let Cx be the clock having the largest value at time tx Since all clocks run at a rate less than 1 x we have for all i and all t 2 tx ca 5 cxax 1 no xx 7 We now consider the following two cases i C is the clock Q of process Pq ii Cx is the clock Cm of a message sent at time t1 by process Pq In case i 7 simply becomes Ct s Cqtx 1 xt tx 8i In case ii since Cmt1 Cqt1 and dedt s 1 we have Cxtx s Cqt1 tx t1 Hence 7 yields 00 S Cqtl l xt t1 8ii Since tx 2 to pLl x we get Cqtx 14 1 K S Cqtx M by PCII s Cmtx U by choice of m S Cmtx tx t1ILmVm Mm S H tx t S Vm Tm by de nition of Cm cqm by IR239a1 Hence Cqtx pl 1c 5 Cqt1 so 1 t1 5 ul x and thus t1 2 to Communications July 1978 of Volume 21 the ACM Number 7 Letting t1 tt in case i we can combine 8i and 8ii to deduce that for any t t with t 2 t 2 to u l x there is a process P and a time t1 with t u l x S t S tx such that for all i Cit s Cqt1 1 xx t1 9 Choosing t and tx with t 2 tx d139 V we can combine 6 and 9 to conclude that there exists a t1 and a process R such that for all i Cqt1 1 Kt t1 dg s Ct s Cqt1 1 Kt t1 Letting t tx dr u we get 10 dTVSt t15d39rvul tc Combining this with 10 we get Cqt1 t t1 xd7 v d5 5 Ct s Cqt1 11 t tl Icdr V ul K Using the hypotheses that 1 ltlt 1 and u s 12 ltlt 7 we can rewrite l l as the following approximate inequality Cqt1 t 11 dKT 5 Ci S Cqt1 t 1 dIC T39 12 Since this holds for all i we get ICit quot Cjt 5 612KT 5 and this holds for all t 2 to dr 1 Note that relation 1 1 of the proof yields an exact upper bound for lCit Ct in case the assumption u g ltlt 139 is invalid An examination of the proof suggests a simple method for rapidly initializing the clocks or resynchronizing them if they should go out of synchrony for any reason Each process sends a message which is relayed to every other process The procedure can be initiated by any process and requires less than 2du g seconds to effect the synchronization assuming each of the messages has an unpredictable delay less than 5 Acknowledgment The use of timestamps to order operations and the concept of anomalous behavior are due to Paul Johnson and Robert Thomas Received March 1976 revised October 1977 References 1 Schwartz J T Relativity in Illustrations New York U Press New York 1962 2 Taylor ER and Wheeler J A Space Time Physics WH Freeman San Francisco 1966 3 Lamport L The implementation of reliable distributed multiprocess systems To appear in Computer Networks 4 Ellingson C and Kulpinski RJ Dissemination of systemtime IEEE Trans Comm Com23 5 May 1973 605 624 565 Programming J J Homing Languages Editor Shallow Binding in Lisp 15 Henry G Baker Jr Massachusetts Institute of Technology Shallow binding is a scheme which allows the value of a variable to be accessed in a bounded amount of computation An elegant model for shallow binding in Lisp 15 is presented in which contextswitching is an environment tree transformation called rerooting Rerooting is completely general and reversible and is optional in the sense that a Lisp 15 interpreter will operate correctly whether or not rerooting is in voked on every context change Since rerooting leaves assoc V a invariant for all variables v and all environments 3 the programmer can have access to a rerooting primitive shallowI which gives him dynamic control over whether accesses are shallow or deep and which affects only the speed of execution of a program not its semantics In addition multiple processes can be active in the same environment structure so long as rerooting is an indivisible operation Finally the concept of rerooting is shown to combine the concept of shallow binding in Lisp with Dijkstra s display for Algol and hence is a general model for shallow binding Key Words and Phrases Lisp 15 environment trees FUNARG S shallow binding deep binding multiprogramming Algol display CR Categories 413 422 432 General permission to make fair use in teaching or research of all or part of this material is granted to individual readers and to nonpro t libraries acting for them provided that ACM s copyright notice is given and that reference is made to the publication to its date of issue and to the fact that reprinting privileges were granted by permission of the Association for Computing Machinery To otherwise reprint a gure table other substantial excerpt or the entire work requires speci c permission as does republication or systematic or multiple reproduc tion This reSearch was supported by the Advanced Research Projects Agency of the Department of Defense and was monitored by the Office of Naval Research under contract number N0001475C0522 Author s present address Computer Science Department Univer sity of Rochester Rochester NY 14627 1978 ACM 000107827807000565 0075 Communications July 1978 of Volume 21 the ACM Number 7 Chapter 14 and 15 Program Swapping amp Introduction to Basic lO Rupesh Jain Mikel Rodriguez 50 OONQFquotJgtF39 N Outline Introduction Understanding the Sched Procedure Walkthrough of the program swapping code Four procedure manipulating the text array Basic lO under Unix The file Buff h The file Confc The Swap procedure Race Conditions 10Summary 11References What is Program Swapping Processes can be selectively swapped out and swap in to share the limited resource of main physical memory among several process It is also called roll in and roll out UNIX like other timesharing and multiprogramming system uses Program swapping Process l Process Main Memory Process 0 Process2 Process Main Memory MeowWm Handle by direct call 2024 on the procedure Swap Area xswap 4368 DISK SNapln Handle by direct call 2034 on the procedure swap 5196 Histom Originally Sched was called swap directly to swap out rather than xswap Four Procedure which manipulate the array of structures called text xswap xalloc xfree WNT xccdec Text Segment Text segment are segment which contain only pure code and data which remains unaltered throughout the program execution Information about text segment must be stored in a central location ie text array maino Program 3 Sched 31 940 Sched spends most of its time waiting in one of the following situation A grunout None of the processes which are swapped out is ready to run The situation can be changed by a call to wakeup or to xswap called by newproc B grunin There is atleast one process swapped out and ready to run but it hasn t been out more than 3 seconds and none of the process present in main memory is inactive or has been more than 2 seconds The situation can be changed by a call to quotsleepquot 3 Sched Procedure 1940 sched 1 194 1942 struct proc p1 1943 register struct proc rp 1944 re ister a n 1951 goto 100p Find the process ready to runselect loop I the longest one 1958 sp16 i 1959 n 1 A search made for the 1960 torrp ampproc0 rp lt ampprocNPROC rp process which is ready 1961 1frp gtp statSRU39N ampamp rp gtp7flagampSL0AD0 ampamp to run and has been 1962 rpgtptime gt n swapped out for the 1963 p1 rp longest time 1964 n rp gtp time 1965 here is no such process situation A hold 71 runout7 sleepmnmout pswm goto loop t see if there is Core for that process quot main memory area for to hold data segment if text segment needs to be present also text segment the size needs to be increased a rprgtpielzer 1m trprprgtpitextp if rpgtgtx ccount mallet coremap goto found2 f d k 522 for easy care If It has adequate sIze text and data then goto found22031 elm l 0er Rprecm rp proCD39PROC rpgt if rprgtp7flagamp ssys l SLOCK SLOAD sLom ampamp up stat SWAIT rprgtpistat SSTDPH goEo Eoundl If the process is waiting for event of low precedence ampwhich is not locked and state is swait or sstop but not sseep then goto swap out If image to be swapped In has been lt3 sec B O S w mt me um 399 vl m l Search for the process which p 1 A u m r mm Is loaded but not locke w ose state is srun or sseep waiting for high precedence and been in mem for longest time n a any quot7mm m px ltiwxwi m erp 1 mcx AUH 5mAn L szi st Process swapped out is lt2 sec situation B holds qua My Swap out using Xswap and process image is agged as not mp Hm n1 1am tmmdlt Hump my in mm 2tlxv x m y l a w laundry 9 mldr ead the text eo into mm m ad indicator up My mm xswap4368 4359 xswaplp if as in p7 4370 register up 57 71 If oldsize was not supplied then use current size ofdata fnrth t e e segmen o e numer o In mmquotprocesses which reference that ext segment Ifthe count becomes zero the mm area occupied by the text seg is simply returned to the runout 7 wakeupltkxunoutgtt 3V339 3 5 Space mfxeelcoremap cs rprgtpiaddrl u sLom l SLOCKli me iflxunout mm image is released except runout is set sched is ap is Had by nEmec something to swap in so wake it xalloc 4433 It is called by exec 3130 when new program is being initiated Its handle the allocation of or linking to the text segment xalloc ip the argument ip is a pointerto the mode of the code file xfree 4398 xfree is called by exit 3233 when a process is being terminated rrxprgtx count m lipgt1 lindekISVTX abandon the text segment in the disk swap area 691 An Introduction to Basic lO under Unix There are three les whose contents are essential to UNIX inputoutput buf h conf h confc g The File bufh This le declares two structures 0 w struct buf This structure is buffer header and serves as a buffer control block acme buE struct buE biforw Pusmun un free hst m gt Zgt transfer x ha that when char biblkno gt b ucmtun devme gt mm am No 1 warns nutwansreneu struct buf continued The J snrustxe n3 Le Jiwmc imo hee cCLICI io parameters List pointer devtab struct The devtab structure contains status information for the devices and serves as a list head for a the list of buffers associated with the device b the list of outstanding io requests for the device g The File confh The file confh declaresgt 00 int nhlkdev The File confc int bdevnwm in cdevnwl U cenn chopen iycclane chxaud kpcvxits mad p kklopan iklclane mud puma Lklngtny la 39 aVr nnulldev Lnullaav u kntzatsgy uktah l39zk 39 Fwy Pdue39 mam W m l mfgv39 md wv W dw39 M I P and Enodnv Lnodev Lnudnv zuoagv smodnv knodw madam 0 39 If 39 smaw and modem Lnudav Lnodev nnodevl nnadw madam 0 cm 39 madev inodnv mad and Enodav smodW nudev knodwy D 39 tc 39 mad Lnodnv nnodnv and mod inodiv Enadwv Enuduv 0 39 ht madnv no av Enadnv Encdav dnv motlav knedev knudev u hp anuudw inulldav Amraud mumno me x dw Enodav nnodev D h ht 39 mm u mundav inulldov manna Iakwrito me r moaw modv anodev Enadav modov nnodov nedv Enodev Enedov modov Inna v knoduv Rhodov anoaw do modv moaw nodov andv nneaov Anndov modv knodov anedw neaev knodav mmdov knodv knodev unedw a 39 11 s s wauf in bioc waufquot controls swapping input and output iflaga 51 idev dev wbui biwceunt countltlt5 h 32 wblnck wbui biblkno swbuitbiaddx 54 bhlnck g a tdistrateg39y Eswhuf splDU 1 5 5 BUSY aiwmm 2 return 1 prBikkkok wauf Continued An Example The code for swap displays some of the problems of race conditions when several processes are running together 5200 F aswbuLbjhgs Process mmates a SWEPD WQ 5201 splSK 099mm 5202 while fpkE BUSY 5202 t p 7 mm 5204 sleepfp pswy 5205 520s fp aiawsy 3712315 rdflg 5207 swim dev swapdev 520 um teenntltlt5 h 32 wblnck 5209 blkno 521 swbufdaiaddx coreaddzltlt5 h 54 bblock 5211 ewbuLbixmem oxeaddrgtgt10 amp 077 5212 lbdevswswapdevgtgt3 1d75trategy Eswbuf 5212 a 15 5214 whiletl fp iDDNB 0 5215 sleeptfp PSWF 5215 xi EpkEimeD 5217 wakeup tip 521 spun 5219 amp daiswsy BJNMH ED fp Flags BUSY 5220 returnl pr igkROR B PHYS I rd g wauf Continued An Example mmates a swappmg Flags BBUSY BPHYS rdflg operauon Ep Eswbu b7 1agsi 5 151 while 955 BUSY 5 57w ED sleep p PSWF 29 eiawsy pays ewbuLbidev swapdev ewbufJ wcount eunltlt5 x 32 wblack W rdflg b1 swbu b add coreaddr 5 s4 bblock swbufdzixman toxeaddrgtgt10 E 077 bdevswewapdevgtgt5 distrategy tnswbuf splGKL while n 5131257130 sleeplfp PSWF 1 speaiwmam wakeup p EplOU fp 57EUSY 57WANTBD return pamp573RROR wauf Continued An Example lags BBUSYBPHYS rdflg B WANTED 1p E75lls 1 p aiwmim sleep 9 pswy splS while fp z fp a 511 37121115 rdflg swbu b7dev swapdev a huf b wcoun cauntltlt5 h 32 wb1ack swbufJ blknn blkno coreaddzltlt5 h 54 bblock swbuLbimem coreaddrgtgt10 5 077 1bdevswswapdevgtgt3 1d75trategy Eswbuf a 151 whileH prE 700m sleeptfp PSWP 1 EpkEimeD wakeup Kip 39 sp101 fp amp 45750517 BJNMH ED return 1 p igRRORJ wauf Continued An Example lags BUSYBPHYS rdflg B WANTED 5200 g Eswbu 7 5201 5p5Ui 5202 while pampsjusy 5202 fp 1 Emu12m 5204 sleep p PSWF 5205 5205 29 5750511 57mm rdflg 5207 swbuf bidev swapdev 5203 EwbufJ wcount eunltlt5 h 32 wblnck W 5209 swbufb7hlknn 1 I 5210 swbufb7aldr cozeaddrltlt5 s4 bblock merrupt 5211 swbufdaixman oxeaddrgtgt10 amp 077 5212 bdevswswapdevgtgt5 distrategy nswbufl 5212 spun 5214 m lt 5215 PSWF 5215 5217 FINISHED 521a EplO 5219 2 12 waiawsy siwmxm 522 a return spnizruwn 5221 wauf Continued An Example What happens next depends on the order in which A and B are reactivated Flags Y BPHYS rd g BDONE p eewbuLbjhgs splSU while prE BUSY p Jimmy s1eep p PSWF fp aiawsy 3712315 i rdflg epde A ewzeusb dev 5w v uh eunltlt5 h 32 wblnck kno b swbufdaiaddx coreaddzltlt i h 54 bblock swbuf b xui coreaddrgtgt10 amp n77 em 1bdevswswapdevgtgt3 tdistrategy Eswbuf e 15 while1 pEE7DDNB o sleeptzp PSWF xi EpkEJriANTED wekeupiip splat fp amp 43713 BJNMH ED return u pl igRROR wauf Continued An Example Case 1 A goes Process B can now nish fp kswbu b7 1ag spl 0 2 while prEBiBUSYD up ei a WANT i 5204 sleepisp PSWF 5205 fp siawsy aim1Y5 rdilg 5207 Ewhuf b dev ewepdev 520 swbufJ quotcount cauntltlt5 b 32 wblock 5 h 54 bblock ddrgtgt10 us 177 exacegyi Eswbuf s 87D NE is Set so no more 5215 sieeping is needed 5217 gt wagWANTED is reset 521 SEN Sothereisno Wakeup fp e in 7 u s 522 reeuxmwpeajnmn 7 Process A nishes up and 5221 sets FlagFB PHYS rdflg a DONE Q wauf Continued An Example Process B wakes up and nishes Case 2 B oes first Goes to sleep again leaving FlagsBBUSY BPHYS rdflg BDONE BWANTED Ep Eewhu b7 1ags a 16K 2 V 5 eep p A Fmd5 BiBU SY 5205 ELWANTED on 5205 fp BUSY aim1Y5 rdElg 5207 Ewbufb7dev swapdev 520 Ewbu b t aunltlt5 32 wblock 5209 swbuf bihlk b snu s buf b7 dd coreaddzltlt5 s4 hblock 5211 awhu bixmun coreaddrgtgt10 R L177 5212 bdevswlswaydevgtgt31 tdistzategy stbuf 5213 spls U t 5214 while fpk ino 0 rocess quot starts agai it nds 5215 WANTED on 5522113 Process A calls Wakeup 521 Ep10 5219 BUSY EJNANTEID fp a sun re turn u fp EigkllOR a Summary Sched procedure makes the decision of swap in and swap out Four procedures manipulate the text array structure 1 xswap 2 xccdec 3 xalloc 4 xfree The buff structure in bufhquot is buffer header and serves as a buffer control block wauf an instance of the buf structure controls the process of swapping input and output g References quotLions39 Commentary on Unix 6th edition with source codequot by John Lion httpllwwwuwsgiuedulUAUmemorylpageswaphtml Unix For Advanced Usersquot The Unix 0 Systemquot by Dennis Ritchie o UNIX in a Nutshell 3rd Editionquot by Arnold Robbins A Fast File System for UNIX Presented by Sean Mondesire Subramanian Kasi 39 1 Outline I Introduction I Old File System I New File System I Performance Improvement I File System Functional Enhancements I Conclusion I References 39 I Introduction I The Fast File system was developed by the Computer Systems Research Group C SRG at the University of California Berkeley I The work was done under grants from the NSF and ARPA I The main goal was increase the throughput of old 512byte UNIX file system by changing the underlying implementation 39 I Old File System I Each disk drive is divided into one or more partitions I Each partition has one File System I File system consists of Boot Area Super block Inode list Data blocks Old File System Partition Partition Partition a is InOd rliitst joata blocks Boot Area Superblock 391 Old File System Drive Physical and Lngjcal Organilztinn I One or more sectors in a track are grouped to form a data block 39 I Old File System I The boot area stores objects that are used in booting the system I If a file system is not to be used for booting the boot area is left blank I Superblock contains basic parameters of the file system number of data blocks in the file system maximum number of files pointer to the ee listi A link list of all free blocks in a system Traversed during block allocation for a file 1 39 I Old File System I Within the file system are les I Each file is described by an inode I An inode contains information about ownership information time stamps array of indices to data blocks Old le System struct inode uishort diimode mode and type of le short diinlink number of links to le short diuidlsb owner39s user id short digidlsb owner39s group id quad disize number of bytes in file timet diatime time last accessed long diiatspare timet dimtime time last modi ed long diimtspare timeit diictime time of last le status change long diictspare daddrt didbNDADDR disk block addresses Old File System I An inode may contain references to single double and triple indirect blocks 3 16324 3 lmirect blacks 2113673 Old File System I Certain files are distinguished as directories which contain a list of file names and their corresponding inodes l8 ls l iIist l data data directoryldata directory data data datal I I 39 I Old file SystemLayout Problems I A 150MB traditional file system contains 4 NB of inodes 146 MB of data blocks modes 4MB causes long seek time from files inode to its data files within a directory are not Data blocks 146MB allocated sequential slots in the 4lIB of inodes 39 1 Old File System I Disk transfers were only 512 byte block size I Next sequential data block not on the same cylinder causes seek time between transfers I Reason Due to suboptimum allocation of data blocks to files I Problem with free list Scrambled ee list i A link list of all free blocks in a file system stored in the superblock Old File System I Free list initially ordered I As files created and deleted became scrambled I Eventually became totally random I Files had their block randomly distributed over the disk I Caused seek time for every block access I 175 kbs initiallgt30kbs 39 1 Old File System Summary of problems I Long seek time from inode to actual data I Files in a directory not allocated consecutive slots in the inode list I Small block size 512 bytes I Allocation of blocks to a file suboptimum I Resulted in too many seeks between block transfers 39 1 Old File System I First work at Berkeley was to increase the block size from 512 to 1024 I File system performance doubled though it was only using 4 of disk bandwidth I Reason Each disk transfer twice as much data most files described without need to access indirect blocks I Good indication that increasing block size helps 39 I New File System I Like the Old file system each disk drive contains one or more file systems I The file system is described by the superblock I Superbock is replicated to protect against failure I Since information present in superblock is static no need to access copies unless default superblock becomes unusable 39 I New File System I Larger block size 4096 bytes I Block size for each file system recorded in superblock I File system with different block sizes can be accessed on the same system I Decision of the block size made at time the file system is created 39 1 New File System track3 Sector 0 hack Sectnu 4 track1 head 0 C linder 0 lunatl 1 head 2 I Tracks with the same radius on different platters form a cylinder I New file system diVided a disk partition into one or more cylinder groups consecutive cylinders 39 I New File System I Each cylinder group contains bookkeeping information Redundant copy of superblock bit map of available blocks in the cylinder group replaced the free list summary information describing the usage of data blocks New File System I All cylinder group information could be kept at the top platter 1 all copies of superblock information on top platter I Failure of top platter causes loss of all copies of the superblock I Solution Bookkeeping information for each cylinder group at an offset from the previous group spiral structure 1 New File System Optimizing storage utilization I Problem with large block size iUnix systems are composed of many small les Spat 5m mste Oigamzaumx 775 1 Mb 0 o Dam only in scpm won hemm les 507 3 MI 4 2 Dam only each m ms Em 1 we hmindsiy 323 7 Mb 0 9 Dam T ln d s s 398 mm mix 51 systn m MI u a Daiar mode v IIX k sysmm ms 3 Mb Dam r ln d s in mm UNIX le n I Space wasted is calculated as of space on the disk not containing user data I As block size increases 7 waste increases I 456 waste for 4096 byte file system blocks 22 New File System I Need to use large blocks without waste Solution divide the blocks into one or more fragnlents Fragnlent size speci ed at the time le system is created Block can be broken into 24 or 8 fragnlents Lower bound of fragnlent size is the sector size Each individual fragnlent is addressable New File System I The bit map present for each cylinder group contains the status of the fragnlents Figure 1 r Examplz 125m uf blacks and fwgmmns m a mas104 m swam I X fragment in use I O fragnlent is available I Fragments of adjoining blocks cannot be used as one block 6 9 cannot be used as one block I 12 15 can be used as one block 39 I New File System I Example 11000 byte file stored in a 40961024 file system I Stored in two full size blocks 4096 X 2 8192 I One three fragment portion 1024 X 3 3072 I Total space allocated 11264 as opposed to 12288 39 I New File System I Space is allocated to a file every time a program executes a write system call I When a file needs to be expanded to hold new data one of the three condition exists 1 There is enough space in an already allocated block of fragment new data written to available space 39 1 New File System 2The file contains no fragmented blocks the last block has insufficient space to hold new data part of the data is written into the block If the remainder of the new data contains more than a full block a full block is allocated rst data is written repeated until less than a full block remains if remaining data can fit in less than a block a block with necessary fragments is located 39 I New File System 3 The file contains one or more fragments if sizeof newdata data in fragments gt Size of a block the data in the fragment the new data moved to a new block process continues as in 2 Problem with expanding a file one fragment at a time data may be copied too many times Solution user program writes one full block at a time except for a block at the end of a file 39 I New File System I In order for the layout policies to be effective the file system cannot be kept full I Free space reserve acceptable percentage of file system blocks that should be free I Reason If the number of free blocks falls to zero the system throughput is cut to half 39 I New File System File System Parameterization I Old file system ignores the parameters of the underlying hardware I The new file system parameterizes the processor capability and the mass storage characteristics I Enables Blocks to be allocated in a con guration dependent way 39 I New File System Parameters considered I Speed of the processor I Hardware support for mass storage transfers I Characteristics of the mass storage deVice 39 1 New File System I Mass storage on disks I Tries to allocate blocks on the same cylinder as the previous block in the file I These blocks need to be rotationally well positioned I Could mean consecutive blocks or rotationally delayed blocks 39 I New File System I If a processor with an IO channel requires no processor intervention two consecutive blocks can be accessed without any delay I A processor without an IO channel will require processor intervention between the disk transfer to prepare for the next disk transfer 39 I New File System I Uses the physical characteristics of a disk like number of blocks track rate at which disk spins I Processor characteristics time to service an interrupt time to schedule next disk transfer 39 I New File System I Using the processor and the disk characteristics I Allocation routine calculates the number of blocks that needs to be skipped so that next block in the file will come under the disk head at the appropriate time I Minimizes the time spent waiting for the disk to position itself 39 I New File System I The cylinder group summary information includes a count of available blocks at different rotation positions I Superblock contains a vector rotational layout table I Each component in this table lists the index into the block map for every data block in its rotational position 39 1 New File System I When looking for a block first looks through the summary counts for a rotational position with a non zero block count uses the rotational position to index into the rotational layout table to nd the list to use to find a free block cylinder group summary information national 1W0ut table Rotational Number of Rotational List of blocks at position blocks position this position available 1 5 2 15 gt f312 339ona 2 2 1 27891o the group Uses the summary rotauonal information 3 1 position to 3 1 1 layout table 39 1 New File System I If a file system is parameterized to lay out blocks with a rotation separation of 2 ms I If the processor requires 4 ms to schedule disk operations then wasted disk revolutions on every block throughput drops I In the new file system the rotation layout delay can be reconfigured based on the target machine 39 1 Layout Policies I The policies improve performance by Increasing the locality of reference to minimize seek latency Improving the layout of data to make larger transfers possible I Two types of policies Global policy routines 39 Local allocation routines 39 1 Global Layout Policies I Attempt to improve performance by clustering related information quot Make decisions about the placement of new inodes and data blocks i Decide the placement of new directories and les I Distribute unrelated data among different cylinder groups i For the fear of too much localization 39 I Local Allocation Routines I Called by the global policy routines with requests for specific blocks I Always allocates the requested block if it is free I lfthe requested block is not free then the four level allocation strategy must be used 21 39 I Four Level Allocation Strategy 1 Use the next free block that is rotationally closest to the requested block on the same cylinder I C lindeI 0 39 quotl 39 1 Four Level Allocation Strategy 2 If there are no free blocks on the same cylinder a free block in the same cylinder groug is used lt lf E I C lindeIO 39 E I Cylinder Group I I I a I I C Met 1 I 22 39 1 Four Level Allocation Strategy 3 Ifthe cylinder group is full use the guadratic hash function to hash the cylinder group number to find another cylinder group to look for a free block 4 Ifthe hash fails use an exhaustive search on all cylinder groups Functional Enhancements I A few additional functional enhancements have been introduced to UNIX Long file names wSymbolic links 1Filelocking Rename Quotas 23 1 Functional Enhancements cont I Long File Names J File names can be at most 255 characters I Symbolic Links A le that contains the pathname ofanother le Gives the illusion a remote le is actually local The speci ed path can be either absolute or relative pathnames l Absolute CSchoolWorkSprin92005COP5611HW1EXE Relative COP5611HW1EXE 39 I Rename I In the old file system a file rename required 3 calls to the system to I Create a new copy of the existing file I Rename the temporary file I Posed a threat if the system crashed or interrupted I FFS added the rename system call Guarantees the target name le will be created Handles renaming of les and directories 24
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'