Design CS 3210
Popular in Course
Popular in ComputerScienence
This 0 page Class Notes was uploaded by Alayna Veum on Monday November 2, 2015. The Class Notes belongs to CS 3210 at Georgia Institute of Technology - Main Campus taught by Staff in Fall. Since its upload, it has received 37 views. For similar materials see /class/234110/cs-3210-georgia-institute-of-technology-main-campus in ComputerScienence at Georgia Institute of Technology - Main Campus.
Reviews for Design
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 11/02/15
Accessing Files Reading and writing files VFS level complex involves VFS disk caches block devices filesystemdependent functions pagebased pagebuffer cache unification generic read and write implementations ext2 etc genericfieread readahead logic genericfiewrite special logic for regular file vs device file access Memory mapping types of mappings data structures mmap munmap msynch paging memorymappings Direct raw lO Reading Basically find file page pages containing requested bytes read into page cache copy to userspace buffer dogenericfilereadfilp buf count ppos direct io special case filpgtflag ODRECT readdescriptort read operation descriptor get addressspace object fipgtfdentrygtdinodegtimapping get owning inode in addressspace object derive file page index from ppos for each page in request check page cache if not in cache allocate page frame and add genericfiereadahead activate readaheads if necessary invoke addressspace specific readpage function copytouser read page addressspacespecific function eg ext2 version just a wrapper return blockreadfullpagepage ext2getbock ext2getbock maps file block to disk block via inode blockreadfupage allocate buffers if necessary for each buffer in page setup buffer head for transfer file holes may mean no block fill buffer with zeroes set io completion operation submitbh submit io request for device files special block device inode bdinode file block disk block blkdevgetblock ReadAhead read several file pages ahead if sequential sequential means anywhere in readahead window window grows and shrinks number of readahead pages increases decreases speedup or slowdown readahead stop readahead when small set of pages repeatedly accessed readahead group pages in last readahead request group is always a subset of window read outside window stop readahead read inside group readahead doubling group size devicespecific min and max values procsysvmminmaXreadahead min3 max31 Writing similar to reading but more complex copy to kernel from user mark page dirty file may grow requiring disk block allocation some details check for OAPPEND check user and filesystem limits update file and inode mod times page may not already be in page cache eg writing to a new file preparewrite copy commitwrite if OSYNC flush buffers preparewrite commitwrite wrappers for asspecific functions eg ext2 blockpreparewritepage from to ext2getbock blockcommitwritepage from to blockpreparewrite allocate buffer heads if necessary for each buffer head special case for simultaneous regular direct io if writing partial buffer read block from disk into buffer blockcommitwrite for each buffer mark dirty call balancedirty too many dirty buffers Memory File Mapping works for regular or device file mapping memory access effects file two types of mapping SHARED for readwrite updates disk a views PRIVATE for read more efficient write creates private copy copyonwrite mechanism Memory Mapping Data Structures complex mesh of data structures vma for mapping points to file object file object points to addressspace object addressspace object points to oinode all vmas in all processes mapping file pages in page cache via page descriptors special function for page fault of mapped page filemapnopage mmap mmapfd offset len flags perms address flags PRIVATE SHARED ANONYMOUS FIXED address start or hint or null wherever some details check permissions is file mapable setup linkages setup vma including vops filed filemapnopage munmap may destroy shrink or split mapping Memory Mapping Paging mapping pages faulted in when touched generic processing donopage mappingspecific processing filemapnopage in page cache if not readpage do file mapping readahead file mapping readahead writebehind madvise RANDREAD SEQREAD etc procsysvmpagecluster larger readahead window writebehind logic msync explicit flush of mapped pages Memory Addressing Segmentation and paging background Segmentation Intel segmentation hardware Linux use of Intel segmentation Paging lntel paging hardware Linux use of Intel paging Memory layout Kernel layout in physical memory Process virtual memory layout Kernel virtual memory layout Segmentation Background Motivated by logical use of memory areas Code heap stack etc Base offset Segment registers contain base Segments variable size usually large Process as collection of segments No notion of linear contiguous memory Similar to multistream files Mac Requires segment table Paging Background Motivated by notion of linear contiguous quotvirtualquot memory space Every process has it39s own quotzeroquot address Uniform sized chunks pages 4K on Intel Virtually contiguous pages may be physically scattered Virtual space may have quotholesquot Page table translates virtual quotpagesquot to physical quotpage framesquot Intel SegmentationPaging Intel address terminology Logical segment offset Linear virtual o 4GB 64GB with PAE Physical Logical gt Linear gt Physical Paging can be disabled Segmentation required Though you can just have one big segment Intel Segmentation Hardware Segment registers cs ss ds es fs gs Indices quotselectorsquot into quotsegment descriptor tablesquot Segment descriptor tables GDT LDT Global and local per process in theory Each holds about 8000 descriptors Segment descriptors 8 bytes each Baselimit codedata privileges type Cached in ro registers when seg registers loaded Task segment descriptor TSS Special segment for holding process context Segmentation in Linux History Early versions segmented now paged Using quotsharedquot segments simplifies things Some RISC chips don39t support segmentation Linux only uses GDT LDTs allocated for Windows emulation Each process has T88 and LDT descriptors T88 is perprocess LDT is shared Linux Descriptor Allocation GDT holds 8192 descriptors 4 primary 4 APM 4 unused 8180 left 2 gt 4090 processes limit in 22 Primary shared segments Kernel code kernel data User code user data Segments overlap in linear address space 24 removes 4K process restriction Intel Paging Hardware Page table quotmapsquot linear address space Some pages may be invalid Address space growsshrinks mapping New regions mapped by DLLs mmap brk Pages linear vs page frames physical Page may map to different frame after swap Page tables stored in kernel data segment lntel page size 4K or 4M quotjumbo pagesquot Jumbo pages reduce table size Paging quotenabledquot by writing crO register Intel TwoLevel Paging quotPage tablequot is actually a twolevel tree Page Directory root Page Tables children Pages grandchildren Linear address carved into three pieces Directory 10 Table 10 Offset 12 Entries frame bookkeeping bits Bookkeeping bits Present Accessed Dirty ReadWrite UserSupervisor Page Size Jumbo pages Map entire 4GB address space with just toplevel Directory No need for Tables Kernel uses this technique ThreeLevel Paging How big are page tables on 64 bit arch Sparc Alpha Itanium Assume 16K pages gt 32 M per process Better idea threelevel paging trees Page Global Directory pgd Page Middle Directory pmd Page Table pt Carve linear address into 4 pieces Conceptual paging model On Intel page middle directory is compiled out Aside Caching Exploit temporal and spatial locality L1 and L2 caches on chip Cache quotlinesquot 32 bytes on Intel Kernel developer goals Keep frequently used struct fields in the same cache line Spread lines out for large data structures to keep cache utilized The quotCache Policequot TLB Translation Lookaside Buffer Virtual to physical cache Must be flushed invalidated when address space mappings change A significant cost of context switch Paging in Linux Threelevel scheme Middle directories quotcollapsedquot w macro tricks No bits assigned for middle directory On context switch address space change Saveload registers in T88 Install new toplevel directory write cr3 Flush TLB Lot39s of macros for Allocatingdeallocating altering querying Page directories tables entries Examples setpte mkpte pteyoung newpagetables Kernel Physical Layout Kernel is quotpinnedquot nonswappable Usually about 1M mapped in first 2M Some quotholesquot in low memory Zero frame always invalid avoids null dereference ISA quotio aperturequot 640K 1M Kernel layout symbols Systemmap text code start usually about 1M etext initialized data start edata uninitialized data start end end of kernel data Remaining frames allocated as needed by kernel page allocator Process Virtual Layout 4GB virtual space on a 32 bit system Possible to install Linux on systems with more than 4GB of physical memory but you must use tricks to access quothighquot memory Kernel macro PAGEOFFSET Division between user and kernel regions Typically 3GB user 1 GB kernel adjustable We will look at details of user space later Kernel Virtual Layout Kernel page tables initialized in two stages during startup Phase one only maps 4MB provisional Phase two maps all memory Provisional Static PGD just one PT swapperpgdir Low 4M of user amp kernel map low 4M physical Allows kernel to be addressed virtually or physically Final Kernel Virtual Layout Kernel code Operates using linear virtual addresses Macros to go from physical to virtual and back pavirual vaphysical Kernel mapped using jumbo pages Kernel virtual address space Low physical frames containing kernel mapped to virtual addresses starting at PAGEOFFSET Remaining frames allocated on demand by kernel and user processes and mapped to virtual addresses CS 3210 Fall 2005 26 Scheduler Goals 0 01 scheduling 24 scheduler iterates through 0 run queue on each invocation o task queue at each epoch o quotperfectquot SMP scalability o perCPU run queues SMPammw Interactivity boost Fairness Optimize for one or two runnable processes Scale well on multiple processors CS 3210 Operating System Design 2 Hutto College of Computing Georgia Tech O O O O struct runqueue o perprocessor runnable processes on one runqueue 0 primary fields pointers to active and expired quotpriority arraysquot spinlock nrrunning nruninterruptible nrswitches currently running task struct idle task some timestamps last array quotswapquot0 migration fields related to queue balancing o macros for manipulating runqueues o cpurqprocessor taskrqtask rqocklunlock o douberqock locks two q39s in address order CS 3210 Operating System Design 3 Hutto College of Computing Georgia Tech Priority Arrays o 140 element array of taskstructs Iistheads a bitmap array array of 5 ints o to find highest priority runnable processes o schedfindfirstbit like ffz o nractive of runnable tasks CS 3210 Operating System Design 4 Hutto College of Computing Georgia Tech Recalculating Timeslices o No ON calculation at epoch required 0 Amortizes replenishment across expiration o On expiration recalculate new timeslice 0 Move replenished task from active to expired priority arrays o Inserting is 01 0 When last runnable expires swap arrays CS 3210 Operating System Design 5 Hutto College of Computing Georgia Tech schedue a get runqueue for current processor 0 find highest priority runnable process 0 contextswitch if prev next Hutto College of Computing Georgia Tech Calculating Priority o taskstruct fields staticprio base prio dynamic o effectivepriority o computes dynamic priority 0 base interactivity bonuspenalty 5 5 o interactivity heuristic sleep ratio 0 mostly sleeping io bound 0 mostly running cpu bound o sleep ratio approximation 0 taskstruct sleepavg O MAXSLEEPAVG 10 ms o add sleep time up to MAX when made runnable o decrement to zero for every tick while running CS 3210 Operating System Design 7 Hutto College of Computing Georgia Tech Calculating Timeslice o tasktimesice o scaling of static priority into timeslice range 0 min 19 5 ms MNTMESLCE o default 0 100 ms DEFTMESLCE 0 max 20 800 ms MAXTMESLCE o parent and child split timeslice on fork a one more interactivity boost 0 highly interactive tasks reinserted in active q o skipped if there are very old expired processes o if iTASKINTERATIVEtask EXPIREDSTARVINGrq enqueuetasktask rqgtexpired 0 else enqueuetasktask rqgtactive CS 3210 Operating System Design 8 Hutto College of Computing Georgia Tech Runqueue Load Balancer o periodically pull processes from loaded to empty runqueues o loadbalance o called by schedule if rq empty and via timer o every 1 ms when system is idle 0 every 200 ms otherwise 0 characteristics o only balances if some queue has 25 more processes 0 prefers processes in expired 0 prefers higher priority tasks o avoids processes that are running pinned or quotcache hotquot CS 3210 Operating System Design 9 Hutto College of Computing Georgia Tech Odds and Ends o needresched could be global but is per task for faster access 0 hard affinity o schedgetsetaffinity 0 used to define subset of allowed processors a child inherits parent affinity 0 support for NUMA machines a scheduler quotdomainsquot CS 3210 Operating System Design 10 Hutto College of Computing Georgia Tech Modules DESl lTlE BEING Mtworirnic 1N THF smsr or THE viiou kernel running in a sing protection domain the Linux kernel is modular allowing the dynamic insertion and removal otcode from the kernel at run time Related subroutines data and entry and exit points are grouped together in a single binary image a loadable kernel object called a module Support for modules allows 39stems to have only a minimal base kernel image with optional features and drivers supplied via module Modules also provide easy removal and reloading ofkeruel code facilitate debugging and allow for the loading ol new drivers on demand in response to the hotpluggiug ol39new ClCV39lCCS This Chapter looks at the magic behind module in the kernel and how you can write your very own module Hello World Unlike development on core subsystems ot the kernel much ot what has been dis cussed thus Far module development is more like writing a new application at least in that modules have entry points and exit points and live in their own les It might be trite and Cliche but it would be a travesty ol39tlie highest order to have the opportunity to write a HelloKX orldl program and not capitalize on the ocuasion Ladies and gentlemen the kernel module llelloWorld hello c Hello World As a Kernel Module include ltlinuxinithgt include lt1inuXmodulehgt lnclude ltlinuxkcrnelhgt k hello init the inlt function called when the module is loaded Returns zero if successfully loaded nonzero otherwise Chapter 16 Modules static int hellonir iit void printleERNALERT quotI bear a charmed life n return 0 i helloexit the exit function called when the module is removedi static void helloexitvoidl printkKERNALERT quotOut out brief candlenquot moduleinit helloinit moduleexitihelloexit MODULEALICENSE quotGPLquot MODULE AUTHOR l quot Shakespeare quot l This is as simple a kernel module as one can getThe helloginit function is regis tered via module init as this module s entry point It is invoked by the kernel when the module is loadedThe call to modulehinit is not really a function call at all but a macro that assigns its sole parameter as the initialization function for this module All init functions must have the form int myinit void Because init functions are typically not directly called by external code you need not export the function and it can be marked as static Init functions return an int If initialization or whatever your init function does was successful the function returns zero On failure it returns nonzero This init function morer prints a simple message and returns zero Init functions in real world modules typically register resources allocate data structures and so on Even if this file were compiled statically into the kernel image the init function would be kept and run on kernel boot The hellovexit function is registered as this module s exit point Via moduleAexi39t The kernel invokes helloexit when the module is removed from memory Exit functions might clean up resources ensure that hardware is in a consistent state and so on before returningAfter the exit function returns the module is unloaded Exit functions must have the form void myexit void As with the init function you probably want to mark it static Building Moritth iii lfthis file were compiled into the static kernel image the exit function would no llt included and it would never be invoked because ifit were not a module the code could never be removed from memory The MODULELICENSE macro speci es the copyright license for this file Loading a uon GPL module into memory results in the tainted flag being set in the kernelTliis ag is also just for informational purposes but many kernel developers give bug reports less credence when the tainted flag is set in the oops Further non GPL modules cannot invoke CPL only symbols see the section Exported Symbols later in this chapter Finally the MODULEAUTHOR i macro specifies this le s author The value of this macro is entirely for informational purposes Building Modules In 26 building modules is easier than ever thanks to the new kbuild build systemffhe rst decision in building modules is deciding where the module source is to liveiYou can add the module source to the kernel source proper either as a patch or by eventually merging your code into the official tree Alternatively you can maintain and build your module source outside the source tree At Home in the Source Tree Ideally your module is an official part of Linux and thus lives in the kernel source tree Getting your work into the kernel proper may require more maintenance at first but it is usually the preferred solution The first step is deciding where in the kernel source tree your module is to live Drivers are stored in subdirectories of the drivers directory in the root ofthc kernel source tree Inside they are further organized by class type and eventually speci c driver Character devices live in driverschar whereas block devices live in driversblock and USB devices live in driversusbLThe rules are not hard and fast because many USB devices are character devices But the organization is fairly understandable and accurate Assume you have a character device and want to store it in driverschari Inside this directory are numerous C source files and a handful of other directories Drivers with only one or two source files might simply stick their source in this directory Drivers with multiple source files and other accompanyingjunk might create a new sub directory There is no hard and fast rule Presume that you want to create your own subdirectory In this fantasy your driver is for a shing pole with a computer interface the Fish Master XL 2000 Titanium so you need to create a fishing subdirectory inside driverschad Now you need to add a line to the Makefile in driverschar So you edit driverscharMakefile and add objm fishing Chapter 16 Modules This causes the build system to descend into the fishing subdirectory whenever it compiles modules More likely your driver s compilation is contingent on a specific con guration option for example perhaps CONFIGFISHINGPOLE see the section Managing Con guration Options later in this chapter for how to add a new con guration option In that case you would instead add the line ObjSltCONFIGFISHINGPOLE fishing Finally inside driversCharfishing you add a new Makefile with the following line obj rm 4 fishingo The build system will now descend into fishing and build the module fishingko from fishingcYes you write an extension of 0 but the module is compiled as ko Again more likely your fishing pole driver s compilation is conditional on a configu ration option So you probably want to do the following objCONFIGFISHING POLE fishing0 One day your shing pole driver may get so complicated antodetection of shing line test is just the latest must have that it grows to occupy more than one source file No problem anglers You simply make your Makefile read objCONFIGFISHINGNPOLEl fishing 0 fishingobjs fishing maino fishinglineo Now fishing mainc and fishingline c will be compiled and linked into f i shing l ko Finally you may need to pass to gcc additional compile ags during the build process solely for your leTo do so simply add a line such as the following to your Make le EXTRAACFLAGS DTITANIUMPOLE If you opted to place your source files in driverschar and not create a new subdi rectory you would merely place the preceding lines that you placed in your Make le in driverscharfishing into driverscharMakefile To compile run the kernel build process as usual If your module s build was condi tioned on a con guration option as it was with CONFIGvFISHING POLE make sure that the option is enabled before beginning Living Externally If you prefer to maintain and build your module outside the kernel source tree to live the life of an outsider simply create a Make le in your own source directory like the single line objm fishingo This compiles fishing c into fishingko If your source spans multiple les then two lines will suffice Loading Modules 283 objm fishingo fishingobjs fishingmain o fishingline o This compiles fishing mainc and fishinglinec into fishingrko The main difference in living externally is the build process Because your module lives outside the kernel tree you need to instruct make on how to find the kernel source files and base Make leThis is also easy make C kernelsourCelocation SUBDIRSSPWD modules where kernelsource1ocation is the location of your configured kernel source tree Recall that you should not store your working copy ofthe kernel source tree in usrsrC1inux but somewhere else easily accessible in your home directory Installing Modules Compiled modules are installed into libmodulesversionkernell For example with a kernel version of 2610 the compiled fishing pole module would live at libmodules2610kerneldriverSCharfishingkoifyou nuckitdnecdy nldriverschar The following build command is used to install compiled modules into the correct location make modulesinstall Naturally this needs to be run as root Generating Module Dependencies The Linux module utilities understand dependenciesThis means that module Chum can depend on module bait and when you load the chum module the bait module is automatically loadedThis dependency information must be generated Most Linux dis tributions generate the mapping automatically and keep it up to date on each bootlTo build the module dependency information as root simply run depmod To perform a quick update rebuilding only the information for modules newer than the dependency information itself run as root depmod A The module dependency information is stored in the file 1ibmodulesversion modules idep Loading Modules The simplest way to load a module is via insmodThis utility is very basic It simply asks the kernel to load the module you specify The insmod program does not perform any dependency resolution or advanced error checking Usage is trivialAs root simply run Chapter 16 Modules insmod modul e where module is the name of the module that you want to loadTo load the shing pole module you would run insmod fishing In like fashion to remove a module you use the rmmod utility As root simply run mmod module For example rmmod fishing removes the shing pole module These utilities however are trivial and unintelligentThe utility modprobe provides dependency resolution intelligent error checking and reporting and more advanced fea tures and options Its use is highly encouraged To insert a module into the kernel via modprobe run as root modprobe module module parameters J where module is the name of the module to loadAny following arguments are taken as parameters to pass to the module on load See the section Module Parametersquot for a discussion on module parameters The modprobe command attempts to load not only the requested module but also any modules on which it depends Consequently it is the preferred mechanism for load ing kernel modules The modprobe command can also be used to remove modules from the kernelAgain as root run modprobe r modules where modules speci es one or more modules to remove Unlike rmmod modprobe also removes any modules on which the given module depends if they are unused Section eight of the Linux manual pages provides a reference on their other less used options Managing Con guration Options An earlier section in this chapter looked at compiling the shing pole module only if the CONFIGFISHINGPOLE con guration option was set Con guration options have been discussed in earlier chapters too but now let s look at actually adding a new one continuing with the shing pole example Thanks to the new kbuild system in the 26 kernel adding new configuration options is very easy All you have to do is add an entry to the Kconfig le responsible for the neck of the kernel source tree For drivers this is usually the very directory in Managing Con guration Options 2M which the source lives If the fishing pole driver lives in driverschar then you use driverscharKconfig If you created a new subdirectory and want a new Kconfig le to live there you h to source it from an existing KoonfigYou do this by adding a line such as source quotdriverscharfishingKconfigquot ZlVC to the existing Keonfig le Say driverscharKconfig Entries in Kconfig are easy to addiThe shing pole module would look like config FISHING POLE tristate Fish Master XL supportquot default n help If you say Y here support for the Fish Master XL 2000 Titanium with computer interface will be compiled into the kernel and accessible via device node You can also say M here and the driver will be built as a module named fishingko If unsure say N The first line defines what con guration option this entry represents Note that the CON FIG prefix is assumed and not written The second line states that this option is a m39stme meaning that it can be built into the kernel Y built as a module M or not built at all NTo disable the option of building as a module say if this option represented a feature and not a modulewuse the directive bool instead of tristateThe quoted text following the directive provides the name of this option in the various configuration utilities The third line speci es the default for this option which is oil The help directive signi es that the rest of the test indented as it is is the help text for this eritrszhe various con guration tools can display this text when requested Because this text is or users and developers building their own kernels it can be suc cinct and to the point Home users do not typically build kernels and ifthey did they could probably understand the con guration help as is There are other options tooThe depends directive speci es options that must be set before this option can be set lfthe dependencies are not met the option is disabled For example if you added the directive depends on FISHTANK to the con g entry the module could not be enabled until the CONFIG FISH TANK mod ule is enabled The select directive is like depends except it forces the given option on if this option is selected It should not be used as frequently as depends because it automatical ly enables other con guration options Use is as simple as depends select BAIT Chapter 16 Modules The con guration option CONFIGJEAIT is automatically enabled when CONFIG FISHINGWPOLE is enabled For both select and depends you can request multiple options via asiWith depends you can also specify that an option not be enabled by prefixing the option with an exclamation mark For example depends on DUMBWDRIVERS ampamp lNOFISHINGALLOWED speci es that the driver depends on CONFIG DU39MBDRIVERS being set and CONFIGNO FISHING ALLOWED being unset The tristate and heel options can be followed by the directive if which makes the entire option conditional on another con guration option lfthe condition is not met the con guration option is not only disabled but does not even appear in the con guration utilities For example the directive bool Deep Sea Mode if OCEAN instructs the configuration system to display this option only if CONFIGAXBG is set Presumably deep sea mode is available only if CONFIGOCEAN is enabled The if directive can also follow the default directive enforcing the default only if the conditional is met The con guration system exports several meta options to help make con guration easier The option CONFIGEMBEDDED is enabled only if the user speci ed that he or she wishes to see options designed for disabling key features presumably to save precious memory on embedded systemsThe option CONFIGBROKENONSMP is used to specify a driver that is not SMP safe Normally this option is not set forcing the user to explicitly acknowledge the brokenness New drivers of course should not use this ag Finally the CONFIGEXPERIMENTAL option is used to ag options that are experimental or otherwise of beta quality The option defaults to off again forcing users to explicitly acknowledge the risk before they enable your driver Module Parameters The Linux kernel provides a simple framework allowing drivers to declare parameters that the user can specify on either boot or module load and then have these parameters exposed in your driver as global variablesThese module parameters also show up in sysfs see Chapter 17 kobjects and sysfsquot Consequently creating and managing module parameters that can be speci ed in a myriad of convenient ways is trivial Defining a module parameter is done via the macro moduleparam moduleparamlname type perm where name is the name of both the parameter exposed to the user and the variable holding the parameter inside your module The type argument holds the parameter s data type it is one ofbyte short ushort int uint long ulong charp bool or inv boolThese types are respectively a byte a short integer an unsigned short integer an Module Parameters 287 integer an unsigned integer a long integer an unsigned long integer a pointer to a char a Boolean and a Boolean whose value is inverted from what the user speci es The byte type is stored in a single char and the Boolean types are stored in variables of type intThe rest are stored in the corresponding primitive C types Finally the perm argument speci es the permissions of the corresponding file in sysfsThe permissions can be speci ed in the usual octal format for example 0644 owner can read and write group can read everyone else can read or by ORing together the usual SIfoo defines for example SIRUGO SIWUSR everyone can read user can also writeA value of zero disables the sysamp entry altogether The macro does not declare the variable for youYou must do that bcforc using the macro Therefore typical use might resemble module parameter controlling the capability to allow live bait on the pole static int allowklivebait l default to on module paramallowlivebait bool 0644 a Boolean type This would be in the outermost scope of your module s source file In other words allowlive bait is global It is possible to have the internal variable named differently than the external parame ter This is accomplished via moduleparamnamed moduleparamnamedname variable type perm where name is the externally viewable parameter name and variable is the name of the internal global variables For example static unsigned int max test DEFAULTMAXLINETEST moduleparamnamedmaximumlinecest maxtest int 0 Normally you would use a type of eharp to de ne a module parameter that takes a string The kernel copies the string provided by the user into memory and points your variable to the string For example static char name moduleyarammame charp O Ifso desired it is also possible to have the kernel copy the string directly into a character array that you supply This is done via module pararthtring modulejaram stringlname string len perm where name is the external parameter name string is the internal variable name len is the size of the buffer named by string or some smaller size but that does not make much sense and perm is the sysfs permissions or zero to disable a sysfs entry altogether For example static char species BUFLEN moduleparamstringspecifies species BUFLEN O Processes Process descriptor taskstruct Static properties of processes State id relationships wait queue limits Process switching context switch Hardware context TSS switchto Saving floatingpoint registers Creating processes cone fork vfork Kernel threads vs user threads Destroying processes Termination vs removal Process Descriptor Process dynamic program in motion Kernel data structures to maintain quotstatequot Descriptor PCB control block taskstruct Larger than you think about 1K Complex struct with pointers to others Type of info in taskstruct Registers state id priorities locks files signals memory maps locks queues list pointers Some details Address of first few fields hardcoded in asm Careful attention to cache line layout Process State Traditional textbook view Blocked runnable running Also initializing terminating UNIX adds quotstoppedquot signals ptrace Linux TASKwhatever Running runnable RUNNING Blocked INTERRUPTIBLE UNINTERRUPTIBLE lnterruptible signals bring you out of syscall block EINTR Terminating ZOMBIE Dead but still around quotliving deadquot processes Stopped STOPPED Process Identity Users pid Kernel address of descriptor Pids dynamically allocated reused 16 bits 32767 avoid immediate reuse Pid to address hash 22 static taskarray Statically limited tasks This limitation removed in 24 currentgtpid macro Descriptor StorageAllocation Descriptors stored in kernel data segment Each process gets a 2 page 8K quotkernel stackquot used while in the kernel security taskstruct stored here rest for stack Easy to derive descriptor from esp stack ptr Implemented as union taskunion Small 16 cache of free taskunions freetaskstruct alloctaskstruct Descriptor Lists Hashes Process list inittask prevtask nexttask foreachtaskp iterator macro Runnable processes runqueue inittask prevrun nextrun nrrunning wakeupprocess Calls schedule if quotpreemptionquot is necessary Pid to descriptor hash pidhash hashpid unhashpid findhashbypid Process Relationships Processes are related Parentchild fork siblings Possible to quotreparentquot Parent vs original parent Parent can quotwaitquot for child to terminate Process groups Possible to send signals to all members Sessions Processes related to login Wait Queues Blocking implementation Change state to TASKUNINTERRUPTBLE Add node to wait queue All processes waiting for specific quoteventquot Usually just one element Used for timing synch device io etc Structure is a bit optimized struct waitqueue usually allocated on kernel stack sleepon wakeup sleepon sleeponinterruptible See code on LXR wakeup wakeupinterruptible See code on LXR Process can wakeup with event not true If multiple waiters another may have resource Always check availability after wakeup Maybe wakeup was in response to signal 24 wakeone Avoids quotthundering herdquot problem A lot of waiting processes wake up fight over resource most then go back to sleep losers Bad for performance very bad for bus cache on SMP machine Process Limits Optional resource limits accounting getrlimit setrlimit user control Root can establish rimmin rimmax Usually RLIMITINFINITY Resources RLI M Twhatever CPU FSIZE file size DATA heap STACK CORE RSS frames NPROC processes NOFILE open files MEMLOCK AS Process Switching Context Hardware context Registers including page table register Hardware support but Linux uses software About the same speed currently Software might be optimized more Better control over validity checking prev next taskstruct pointers Linux TSS threadstruct Base registers floatingpoint debug etc lO permissions bitmap Intel feature to allow userland access to io ports ioperm iopl Intelspecific Process Switching switchto Invoked by schedule Very Intelspecific mostly assembly code GCC magic makes fortough reading Some highlights Save basic registers Switch to kernel stack of next Save fp registers if necessary Unlock TSS Load ldtr cr3 paging Load debug registers breakpoints Return Process Switching FP Registers This is pretty weird Pentium on chip FPU Backwards compatibility ESCAPE prefix Not saved by default MMX Instructions use FPU FP registers Saved quoton demandquot reload quotwhen neededquot lazily TS Flag set on context switch FP instructions cause exception device unavailable Kernel intervenes by loading FP regs clearing TS unlazyfpu mathstateretstore Creating Processes clone fork vfork Fork duplicates most parent resources Exec tears down old address space and installs a new one corresponding to process image on disk Most forks are part of a forkexec sequence Wasteful to copy resources that are then ovenvritten Old solution hack vfork Parentchild share parent blocks until child ends New solution COW copyonwrite Share rw pages ro until write fault then copy Linux solution clone Specify what resources to shareduplicate CLONEVM FS FILES SIGHAND PID Processes vs Threads Traditional processes big quotheavyquot Lot39s of data takes time to startup Newer idea threads small quotlightweightquot Share address space files sockets etc Basically context stack Usually quotuserlevelquot many per process But kernel can benefit from threads as well Kernel threads in the kernel address space Benefits of Threading Lightweight context switch blocking Increases CPU quantum utilization Logically structures concurrent activities Avoids uglier quotasyncquot code eg sig handlers Possible to exploit parallelism SMP If you have kernel threads Kernel doesn39t quotknow aboutquot user threads Kernel threads are quotunit of schedulingquot Linux Processes or Threads Linux uses a neutral term tasks Traditional view Threads exist quotinsidequot processes Linux view Threads processes that share address space Linux quotthreadsquot tasks are really quotkernel threadsquot Thread Models Manytoone Userlevel threads kernel doesn39t know about them Onetoone Linux standard model each userlevel thread corresponds to a kernel thread Manytomany mton m gt n Solaris Next Generation POSIX Threads Large number of user threads corresponds to a smaller number of kernel threads More flexible better CPU utilization clone fork is implemented as a wrapper around clone with specific parameters conefp data flags stack means quotdon t call this directlyquot fp is thread start function data is params flags is or of CLONE flags stack is address of user stack clone calls dofork to do the work dofork Highlights alloctaskstruct Copy current into new findemptyprocess geLpidO Update ancestry Copy components based on flags copythread Link into task list update nrtasks Set TASKRUNNNG wakeupprocess Kernel threads Linux has a small number of kernel threads that run continuously in the kernel daemons No user address space only kernel mapped Creating kernelthread Process 0 idle process Process 1 Spawns several kernel threads before transitioning to user mode as sbininit kflushd bdflush Flush dirty buffers to disk under quotmemory pressurequot kupdate Periodically flushes old buffers to disk kswapd Swapping daemon kpiod No longer used in 24 Timing and Timers Accurate timing crucial for many aspects of OS Devicerelated timeouts File timestamps created accessed written Timeofday gettimeofday Highprecision timers code profiling etc Scheduling cpu usage etc Intel timer hardware RTC Real Time Clock PIT Programmable Interrupt Timer TSC TimeStamp Counter cycle counter Local APIC Timer percpu alarms Timer implementations Kernel timers dynamic timers User interval timers alarm setitimer Hardware Clocks Intel RTC battery backed packaged with CMOS RAM registers to access current datetime ports 0x70 0x71 includes programmable timer 28192Hz accessible as devrtc sampled by kernel only on startup set by clock command synched at shutdown TSC aka MSR microprocessor specific register 64 bit counter increments at CPU cycle speed accessible via userspace assembly instruction rdtsc provides highresolution timing capability kernel determines frequency at boot calibratetsc Hardware Clocks Intel PIT heartbeat timer drives timer interrupt tick 100 Hz on PC 1024 Hz on fast chips alpha itanium patches to change clock speed via proc jiffies of ticks since boot xtime struct with secs usecs since Jan 1 1970 epoch CPU Local APIC Timers when available does percpu timing eg quantum if not available driven by PIT 32 bit instead of PIT 16 bit so lower frequency possible decrements in multiples of bus cycles 1 2 4 8 128 Uniprocessor Architecture sample RTC on startup to init xtime timerinterrupt IRQO interrupt handler executes with interrupts disabled if TSC sample TSC lasttsclow sample PIT register to determine usecs since interrupt delaysincelastinterrupt call dotimerinterrupt dotimerinterrupt calls dotimer updates kernel profiling info if activated performs gradual clock adjustment adjustex Uniprocessor Architecture dotimer updatejiffies jiffies wraps after 467 days but kernel handles it check for quantum expiration mark TIMERBH for execution timerbh executes with interrupts enabled possibly delayed for several tophalf execution BHs execute only when last stacked interrupt completes updatetimes xtime load average runtimerlist dynamic timers Multiprocessor Architecture some percpu operations update quantum check for cpu resource limit kernel profiling use local APIC timer calibrateAPlCtimer BUT don t execute all at the same time stagger during interval also avoid PIT handler smplocaltimerinterru pt CPU Time Sharing Slicing currentgtcounter ticks remaining in current quantum initialized by default to 21 210 ms 15 second updateprocesstimes executed by PIT or APIC timer if current gtpid current gtcounter if current gtcounter lt 0 current gtcounter O current gtneedresched l why outer if Why lt 0quot Updating DateTime xtimetvsec xtimetvusec seconds since Jan 1 1970 updatetimes wajiffies time of last xtime update updatewalltimeticks handles usec wrap cacoadticks load average void updatetimesvoid unsigned long ticks writelockirqampxtimelock ticks jiffies walljiffies if ticks jiffies walljiffies updatewalltimeticks writeunlockirqampxtimelock calcloadticks Updating System Statistics checking cpu resource limits update user and kernel mode ticks for times percpuutime percpustime over cpu limit send SIGXCPU SIGKILL updating system load averages average tasks in run queue last 1 5 15 minutes includes UNINTERRUPTIBLE but not pid 0 less than 10 is good on uniprocessor kernel profiling samples eip on each interrupt activated by kernel option profile results exported via procprofile readprofile command NMI watchdogs detecting system freeze clever use of APIC to detect freezes failure to reenable interrupts broadcast NMI periodically check for increasing interrupt count Software Timers kernel requires lots of highperformance timers particularly networking code simple implementation linkedlist sorted by expiry in jiffies look at head of list on each timer interrupt too expensive for longish lists cau on handlers executed at some point AFTER expiry can t really bound delay up to 1003 of milliseconds not suitable for realtime processing Dynamic Timers struct timerlist linkage expires jiffies function handler data int or pointer each timer linked into one of 512 doubly linked lists various functions and macros addtimer modtimer deltimer synch versions to avoid SMP problems Dynamic Timers Race Conditions on uniprocessor if timer touches resource remove timer before disposing of resource timer could execute during after disposal on multiprocessor timer may be running on another CPU remove from list wait until no longer executing then return similar problems with updating expiration field Dynamic Timers Handling very clever data structure avoid Oog N heap insert priority queue struct timervec tv1 array of linked lists 0255 current list index list of timers thatwill expire in next 256 ticks increment index on each tick execute handlers what happens when index reaches end of array replenish with timers expiring in next 256 seconds these are maintained in the NEXT vector tv2 each element timers that expire in next 256 ticks includes it s own current index one global array of pointers to five vectors of lists tv1 replenish every 25 seconds tv2 every 3 minutes tv3 every 3 hours tv4 every 7 daysl tv5 never replenished Dynamic Timer Example sleep for 2 seconds in kernel setcurrentstateTASKNTERRUPTIBLE remaing scheduletimeout2 HZ scheduletimeout struct timerist timer expire timeout jiffies timerexpires expire timerdata unsigned long current timerfunction processtimeout addtimeramptimer schedule detimersyncamptimer timeout expire jiffies return timeout lt O O timeout void processtimeoutunsigned long data struct taskstruct p struct taskstruct data wakeupprocessp System Calls gettimeofday sec usec xtime plus delay since last bottom half xtime update delay since last interrupt jiffies update samples TSC if available for highprecision settimeofday update xtime not RTC requires root adjtimex gradual clock time change part of NTP implementation gradually spread out drift adjustments alarm setitimer user mode interval timers three different timers CS 3210 Spring 2005 Process Schedu ng Process Scheduling o scheduler who gets to run next o scheduler policy vs dispatcher mechanism 0 process abstraction scheduler vin uaizes CPU 0 other quotschedulersquot also exist in kernel o io schedulers packet scheduler disk scheduler o quotprotoOS switched between apps while waiting for io o goal maximize CPU utilization a lots of theory but o realworld schedulers need to be general purpose o must meet many competing demands job mixes o throughput responsiveness priorities no starvation CS 3210 Operating System Design 2 Hutto College of Computing Georgia Tech O 0 O 0 Process Scheduling 52 o specialized domain realtime scheduling o hard realtime people die if deadline not met o soft realtime try really hard to meet deadline o example playing audio video o hard realtime is very very difficult o apps must predeclare resource needs 0 admission control say no if necessary 0 UNIX keeps accepting jobs 0 degrades performance but o tries to be relatively fair spread the pain o must eliminate lots of standard 08 stuff like disk o modern UNIXLinux support soft realtime CS 3210 Operating System Design 3 Hutto College of Computing Georgia Tech Process Scheduling o preemption o higher priority process takes CPU from lower priority even if time slice has not yet expired 0 timesharing each process given a time slice or quantum once scheduled run until 0 blocking o termination o slice expires check at timer interrupt 0 higher priority process becomes runnable preemption select another task to run context switch CS 3210 Operating System Design 4 Hutto College of Computing Georgia Tech Process Priorities and 5555 Fairness 0 priority and fairness are antagonistic a Linux provides priorities but still tries to be fair 0 general rules 0 decrease priority of processes that run a lot o increase priority of processes that sleep a lot 0 avoid starvation aging o boost quotinteractivequot processes CS 3210 Operating System Design 5 Hutto College of Computing Georgia Tech Process Characterization 0 one way to look at things 0 io bound spends lots of time waiting for io o cpu bound little io quotnumber crunching 0 another way to look at things 0 batch c interactive c realtime o interactive processes tend to be io bound a batch processes tend to be cpu bound CS 3210 Operating System Design 6 Hutto College of Computing Georgia Tech O 0 O 0 Scheduling System Calls 35 0 nice o lower priority be nice to other people 0 root can increase 0 niceness is opposite of priority o increasing nice decrease priority o decreasing nice increases priority 0 getsetpriority o schedgetsetscheduler o schedgetsetparam o schedyied CS 3210 Operating System Design 7 Hutto College of Computing Georgia Tech Quantum Length 0 how long similar to timer interrupt frequency 0 too short too much time spent switching o like quotthrashingquot spending too much time swapping 0 too long processes feel unresponsive 0 you run longer but you also wait longer a priority interrupts also increase response 0 24 uses about 10 ticks 105 ms CS 3210 Operating System Design 8 Hutto College of Computing Georgia Tech Scheduling Algorithm 0 O Q 0 o scheduling epoch o can39tjust replenish quantum on expiration 0 lower priority processes would never get to run a only replenish when ALL quanta expired 0 base quantumpriority DEFCOUNTER o altered by nice setpriority 0 dynamic priority base remaining quantum o priority decreases towards end of quantum CS 3210 Operating System Design 9 Hutto College of Computing Georgia Tech Realtime Processes 0 O Q 0 0 higher priority than quotregularquot processes 0 can be created only by root 0 can disrupt system if poorly programmed 0 Windows has three levels 0 regular system realtime o realtime processes have priority too 199 0 so called quotstatic priority though can be changed o used to differentiate rt processes 0 always higher than regular process priorities CS 3210 Operating System Design 10 Hutto College of Computing Georgia Tech Scheduling Related taskstruct Fields 0 O Q 0 nice base quantum 20 to 19 counter ticks remaining in quantum rtpriority 0 regular to 99 policy regular or realtime o SCHEDOTHER regular 0 SCHEDFIFO realtime quotrun to completionquot 0 SCHEDRR realtime quantum but no epoch needresched o processor cpusaowed cpusrunnable o aligneddata perprocessor struct 0 cu rr taskstruct of current 0 lastschedule TSC value of last process switch CS 3210 Operating System Design 11 Hutto College of Computing Georgia Tech Law of Conservation of Quantum Avoiding Fork Bombs On fork split parent quantum with child 0 Avoids quotstealingquot cycles during the current epoch o Quanta of all processes replenished at epoch CS 3210 Operating System Design 12 Hutto College of Computing Georgia Tech Invoking schedule a Direct invocation o when a process blocks in the kernel 0 sometime by device drivers to yield CPU o Lazyinvoca on needresched bit checked on transition from kernel to user on quantum expiration for preemption wakeupcreation of higher priority on schedsetscheduler or schedyield CS 3210 Operating System Design 13 Hutto College of Computing Georgia Tech schedule preliminaries 0 goal 0 set next to taskstruct of next process a call switchtoprev next prev 0 acquire spinlock disable interrupts to examine runqueue 0 special case for exhausted SCHEDRR process 0 update state and needresched CS 3210 Operating System Design 14 Hutto College of Computing Georgia Tech schedue find max goodness o initialize next to idle just in case 0 iterate through runqueue o remember task with max goodness o skip tasks running on other processors o skips tasks disallowed from current processor 0 goodness returns 0 if quantum expired 0 epoch if all quantum expired 0 1000 CS 3210 Operating System Design 15 Hutto College of Computing Georgia Tech schedule epoch o iterate through tasklist not runqueue o replenish quantum if expired 0 add to nonzero quanta of blocked processes a io boost bounded CS 3210 Operating System Design 16 Hutto College of Computing Georgia Tech schedule after epoch 0 prev next special case 0 update some kernel statistics 0 address space switch for kernel threads 0 invoke switchto a code after switchto executed when prev is rescheduled again in the future o uniprocessor not much o multiprocessor try to reschedule prev CS 3210 Operating System Design 17 Hutto College of Computing Georgia Tech good ness 0 higher is better a for regular processes base counter o advantage if previously on same CPU o advantage if thread in same address space 0 for realtime 1000 rtpriority CS 3210 Operating System Design 18 Hutto College of Computing Georgia Tech rescheduleide 3 0 called in quottailquot of schedule on MP 0 tries to reschedule prev on another CPU o is last processor prev ran on idle o what39s the quotoldest idle processor o is there a process with lower priority running 0 uses lPl to initiate reschedule CS 3210 Operating System Design 19 Hutto College of Computing Georgia Tech CS 3210 Spring 2006 Page Cache Page Cache an o Fundamental disk cache o critical to system performance a included since earliest UNIX 0 Two functions a cache disk reads a defer disk writes o Unified systemwide cache CS 3210 Operating System Design 2 Hutto College of Computing Georgia Tech Disk Cache History 0 Originally block cache preVM 0 aka quotbuffer cache 0 Then page granularity cache for VM 0 Then unified page cache 2410 CS 3210 Operating System Design Hutto College of Computing Georgia Tech What39s in the Page Cache 555539 o quotPagesquot of regular file data 0 including memorymapped ro executables o quotPagesquot of directory file data 0 Pages read from block device files 0 Swap pages a Shared memory pages 0 Blocks containing inodes and super blocks CS 3210 Operating System Design 4 Hutto College of Computing Georgia Tech What39s a File quotPagequot 5553 0 Let39s say page size 4K block size 1K 0 File quotpagesquot are logical chunks of the file 0 starting at byte positions 0 4K 8K 12K 0 But files are composed of possibly discontiguous disk blocks 1 K chunks 0 So a file page is composed in this case of 4 logically related disk blocks that may be physically scattered across the disk CS 3210 Operating System Design 5 Hutto College of Computing Georgia Tech What is a Block Device Page as l 0 Block device files represent quotrawquot devices a eg cat devhda 0 reads device bits independent of disk structure 0 eg ignores any lesystem format a just returns consecutive blocks 0 So if page size 4K block size 1K o block device quotpagequot is just 4 contiguous blocks CS 3210 Operating System Design 6 Hutto College of Computing Georgia Tech What39s Not in the Page Cache an 0 Pages without any sort of disk backing o eg anonymous pages stack heap 0 Pages involved in quotdirect ioquot ODRECT o used by databases etc to bypass cache CS 3210 Operating System Design 7 Hutto College of Computing Georgia Tech Page Cache Operations an 0 Add remove pages descriptors o read page from disk into frame if necessary o write page from memory to disk when dirty o Quickly determine if a page is in the cache 0 Perform pagetype specific operations a eg read from swap vs read from regular file CS 3210 Operating System Design 8 Hutto College of Computing Georgia Tech Page Cache Structure an 0 Every cached page has an owning inode o owning inode quotbackingquot file or partition 0 Key data structure addressspace 0 unique addressspace page offset o inode ltgt addressspace ltgt page descriptors 0 Remember memmap o array of page descriptors a mapping gt addressspace o index logical page offset CS 3210 Operating System Design 9 Hutto College of Computing Georgia Tech addressspace an a host owning inode o pagetree root of radix tree of pages 0 nrpages o aops address space ops o backingdevinfo block device 0 immap o for memory mapping and a quotreverse mappingquot find everyone sharing a page CS 3210 Operating System Design 10 Hutto College of Computing Georgia Tech addressspace Details an o addressspace is actually embedded in inode o idata field 0 Normally imapping points to idata 0 Block device file inodes are different 0 different device files may reference same device a bdev special filesystem 0 contains quotmasterquot inode for each device a imapping of device file points to master idata CS 3210 Operating System Design 11 Hutto College of Computing Georgia Tech Address Space Ops an o writepage readpage writepages readpages o syncpage setpagedirty o preparewrite commitwrite o bmap o invalidatepage releasepage o directO CS 3210 Operating System Design 12 Hutto College of Computing Georgia Tech Radix Trees an 0 Each addressspace object has its own radix tree for fast acess to a potentially large number of pages 0 think of a terabytesized file o Pages are uniquely defined by page offset within a particular addressspace object o Radix tree multiway search structure for finding page descriptor given page offset 0 64 way branching at each node 6 bits 0 Possible to index up to 16 terabytes in six levels CS 3210 Operating System Design 13 Hutto College of Computing Georgia Tech How Do Radix Trees Work as l o Radix tree nodes contain 64 pointers o to other radix tree nodes or page descriptors o For files lt 64 pages just one node a use rightmost 6 bits to index into node array 0 For files lt 4096 pages two levels 0 consider rightmost 12 bits upper 6 lower 6 0 use upper 6 bits to index into toplevel node 0 use lower 6 bits to index into 2ndIevel node 0 Expand tree when necessary as file grows CS 3210 Operating System Design 14 Hutto College of Computing Georgia Tech Page Cache Functions as l o findgetpageaddressspace offset 0 radixtreeookup o addpagecachepagedescriptor addressspace offset a removefrompagecachepagedescriptor o readcachepageaddressspace offset 0 update page CS 3210 Operating System Design 15 Hutto College of Computing Georgia Tech Radix Tree Tags an o How to efficiently find all dirty pages 0 Page descriptors have dirty bits 0 Include an array of dirty bits in tree nodes 0 Set tree node bit dirty if ANY page in subtree is dirty CS 3210 Operating System Design 16 Hutto College of Computing Georgia Tech Blocks in the Page Cache an 0 Most disk access these days is pagebased 0 Some things still require block access 0 superblock o inode blocks 0 How to track both blocks and pages 0 buffer pages page of contiguous disk blocks o 4K pages 1K blocks read 4 contiguous blocks CS 3210 Operating System Design 17 Hutto College of Computing Georgia Tech Blocks Buffers Buffer Heads 0 buffer memory containing a disk block 0 buffer head buffer descriptor o bstate bcount bsize o bblocknr LBN bdata memory bbdev o bendio io completion method o bpage pointer to buffer page descriptor 0 flags BH o Uptodate Dirty Lock Mapped Ordered o bhcachep buffer head slab cache allocator CS 3210 Operating System Design 18 Hutto College of Computing Georgia Tech Block Device Buffer Pages an 0 Two types of buffer pages actually o discontiguous file pages Ch 18 Accessing Files 0 pages containing blocks from block devices 0 aka quotblockdev pagesquot 0 Linkages Fig 152 0 page descriptor gt circular list of buffer heads o buffer head gt buffer in page page descriptor o Remember higher structure 0 superblocks gt inode lists 0 inodes gt address space object o addressspace object gt page descriptors via radix tree CS 3210 Operating System Design 19 Hutto College of Computing Georgia Tech Managing Buffer Pages an o Allocation o kernel subsystems just ask page cache for block o if not present new page is allocated filled o growbuffers growdevpage o allocpagebuffers initpagebuffers o Deallocation o released under memory pressure o trytofreebuffers trytoreleasepage CS 3210 Operating System Design 20 Hutto College of Computing Georgia Tech Finding Blocks in Page Cache l 0 blocks identified by bdev block a get addressspace from bdev o bdevgtbdinodegtimapping o get block size compute page containing block 0 bdevgtbdblocksize 0 search in radix tree for block device by page 0 actually o bhrus array small percpu LRU block cache 0 sorted by access time 0 is most recent CS 3210 Operating System Design 21 Hutto College of Computing Georgia Tech 323 33 Finding Blocks 0 Three different functions o bh findgetbockbdev nr bsize 0 find or fail returns NULL o bh getblkbdev nr bsize 0 always succeeds o returns bh but data need not be valid o used for allocation o bh breadbdevnr bsize o like getblk but reads disk data as necessary cs 3210 Operating System Design 22 Hutto College of Computing Georgia Tech Generic Block Layer Interface 5 o submitbhbh direction o allocates io request descriptor o enqueues request 0 rwbockbh nblocks direction a multiple block transfers CS 3210 Operating System Design 23 Hutto College of Computing Georgia Tech Writing Pages to Disk an 0 Three reasons 0 memory pressure 0 old dirty pages 0 explicit request sync 0 Optimization o only dirty blocks in dirty buffer pages written CS 3210 Operating System Design 24 Hutto College of Computing Georgia Tech pdflush Kernel Threads an o 24 two daemons o bdflush memory pressure 0 kupdate old pages 0 26 dynamic pool of pdflush daemons o multiple threads allow parallel io o grows and shrinks based on idleness o one thread assigned different tasks o pdflushwork descriptor o backgroundwriteout memory pressure o wbkupdate old pages CS 3210 Operating System Design 25 Hutto College of Computing Georgia Tech 0 3 Looking for Dirty Pages an O Q o scan radix trees of all addressspace objects o do a little at a time then yield 0 avoid disk io storms 0 free a specific of dirty pages c or reach a threshold for fraction of dirty pages o procsysvmdirtyratio usually 40 a how to find addressspace objects 0 scan superblocks o scan inode lists in superblocks o inodes point to addressspace objects o addressspace objects point to page descriptors cs 3210 Operating System Design 26 Hutto College of Computing Georgia Tech Flushing Dirty Old Pages a timer based activation of pdflush o procsysvmdirtywritebackcentisecs o usually 500 centisecs 5 secs o continue until a fixed have been written a or until all dirty old pages have been flushed o syncs a superblocks on each activation a flush a dirty pages older than 30 seconds CS 3210 Operating System Design 27 Hutto College of Computing Georgia Tech CS 3210 Fall 2003 Virtual FileSystem VFS Ets 333 File Systems 3 O 0 o old days quotthequot filesystem 0 now many filesystem types many instances 0 need to copy file from NTFS to Ext3 0 original motivation NFS support Sun 0 idea filesystem op abstraction layer VFS o Virtual File System aka Virtual Filesystem Switch o filerelated ops determine filesystem type 0 dispatch via function pointers filesystemspecific op cs 3210 Operating System Design 2 Hutto College of Computing Georgia Tech O 0 File System Types 3 I I 0 lots and lots of filesystem types 0 26 has nearly 100 in the standard kernel tree 0 examples a standard ufs Solaris svfs SysV ffs Berkeley 0 network RFS NFS Andrew Coda Samba Novell o journaling Ext3 Veritas ReiserFS XFS JFS o mediaspecific jffs ISOQ66O cd UDF dvd 0 special proc tmpfs sockfs etc o proprietary o MSDOS VFAT NTFS Mac Amiga etc o new generation for Linux 0 Ext3 ReiserFS XFS JFS CS 3210 Operating System Design 3 Hutto College of Computing Georgia Tech O 0 333 Common File Model o standard api basically UNIX file semantics 0 doesn39t fit perfectly with NT etc 0 example directory is a file with specific structure o not true for some filesystems MSDOS etc 0 File Allocation Table FAT o VFS layerjust dispatches to fsspecific functions 0 libc read gt sysread 0 what type of filesystem does this file belong to 0 call filesystem fs specific read function 0 maintained in open file object file 0 example filegtfopgtread o similar to device abstraction model in UNIX CS 3210 Operating System Design 4 Hutto College of Computing Georgia Tech 0 333 VFS System Calls 55 0 fundamental UNIX abstractions 0 files everything is a file 0 ex devttySO device as a file o ex proc123 process as a file 0 processes 0 users 0 lots of syscalls related to files 100 0 most dispatch to filesystemspecific calls o some require no filesystem action o example Iseekpos change position in file o others have default VFS implementations cs 3210 Operating System Design 5 Hutto College of Computing Georgia Tech VFS System Calls cont o familiarize yourself with API Table 121 page 377 filesystem ops mounting info flushing chroot pivotroot directory ops chdir getcwd link unlink rename symlink file ops openclose readwrite stat permissions seek o chmod chown stat creat umask dup fcntl truncate o readwrite readvwritev preadpwrite NFS memory mapping files mmap munmap madvise mlock wait for input poll select flushing synch fsync msync fdatasync file locking flock CS 3210 Operating System Design 6 Hutto College of Computing Georgia Tech O 0 3333 VFSrelated Task Fields o taskstruct fields 0 fs includes root pwd o pointers to dentries o files includes file descriptor array fd 0 pointers to open file objects CS 3210 Operating System Design 7 Hutto College of Computing Georgia Tech 0000 0000 333 Big Four Data Structures 55 0 one open file object 0 information about an open file 0 includes current position file pointer 0 two dentry 0 information about a directory entry 0 includes name inode 0 three inode 0 unique descriptor of a file or directory 0 contains permissions timestamps block map data 0 inode integer unique per mounted filesystem 0 four superblock o descriptor of a mounted filesystem o ok one more filesystem type c pointer to implementing module 0 including how to read a superblock CS 3210 Operating System Design 8 Hutto College of Computing Georgia Tech Data Structure Relationships 3 inode cache fds open file object open file object open file object superblo ck dentry cache CS 3210 Operating System Design 9 Hutto College of Computing Georgia Tech Sharing Data Structures 3 o calling dup o shares open file objects 0 example 2gtamp1 0 opening the same file twice o shares dentries 0 opening same file via different hard links o shares inodes o mounting same filesystem on different dirs o shares superblocks CS 3210 Operating System Design 10 Hutto College of Computing Georgia Tech Superblock 3 o mounted filesystem descriptor 0 usually first block on disk after boot block a copied into similar memory structure on mount 0 distinction disk superblock vs memory superblock 0 dirty bit sdirt copied to disk frequently 0 important fields sdev sbdev device devicedriver sblocksize smaxbytes stype sflags smagic scount sroot sdquot sdirty dirty inodes for this filesystem sop superblock operations u filesystem specific data CS 3210 Operating System Design 11 Hutto College of Computing Georgia Tech
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'