COMP ORG AND DESIGN
COMP ORG AND DESIGN CIS 371
Popular in Course
Popular in Computer & Information Science
This 25 page Class Notes was uploaded by Jackson Will on Monday September 28, 2015. The Class Notes belongs to CIS 371 at University of Pennsylvania taught by A. Roth in Fall. Since its upload, it has received 7 views. For similar materials see /class/215416/cis-371-university-of-pennsylvania in Computer & Information Science at University of Pennsylvania.
Reviews for COMP ORG AND DESIGN
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/28/15
CIS 371 Computer Organization and Design Unit 10 Virtual Memory amp IO CIS 371 RomMartin Virtual Memory ampI0 This Unit Virtualization amp IO 0 The operating system OS SYStem 5 Ware o A superapplication 0 Hardware support for an OS Mem CPU IIO 0 Virtual memory 0 Page tables and address translation 0 TLBs and memory hierarchy issues 0 IO 0 Magnetic storage disks 0 Solidstate storage flash memory CIS 371 RomMartin Virtual Memory ampI0 Rest of the Semester Topics yet to cover 0 Virtual memory 0 DisksampIO 0 Power amp energy 0 Reliability 0 Datalevel parallelism SIMD Vectors and GPUs 0 XBOX 360 case study Perennial problem 0 Only three lectures remaining 0 By design we re actually on schedule 0 Will try to hit the highlights give some context 0 Try not to be overwhelmed CIS 371 RothMartin Virtual Memory ampI0 Readings o PampH 0 Virtual Memory 54 0 IO amp Disks 6165 68 CIS 371 RothMartin Virtual Memory ampI0 A Computer System Hardware 0 CPUs and memories 0 Connected by memory bus 0 IO peripherals storage input display network 0 With separate or builtin DMA 0 Connected by system bus which is connected to memory bus C15 371 RomMartin Virtual Memory 8110 5 A Computer System OS 0 Operating System OS virtualizes hardware for apps 0 Abstraction provides services eg threads files etc Simplifies app programming model raw hardware is nasty 0 Isolation gives each app illusion of private CPU memory 10 Simplifies app programming model Increases hardware resource utilization Memory bus System lO bus bndge CPU opu C15 371 RomMartin Virtual Memory 8110 A Computer System App Software 0 Application software computer must do something 39A lic atio n s39ofwa re Memory bus System lO bus bndge CPU opu C15 371 RomMartin Virtual Memory 8110 Operating System OS and User Apps o Sane system development requires a split 0 Hardware itself facilitatesenforces this split 0 Operating System OS a superprivileged process 0 Manages hardware resource allocationrevocation for all processes 0 Has direct access to resource allocation features 0 Aware of many nasty hardware details 0 Aware of other processes 0 Talks directly to inputoutput devices device driver software 0 Userlevel apps ignorance is bliss o Unaware of most nasty hardware details 0 Unaware of other apps and OS 0 Egtltplicitly denied access to resource allocation features 215 371 RomMartin Virtual Memory 8110 8 System Calls 0 Controlled transfers tofrom OS 0 System Call a userlevel app function call to OS 0 Leave description of what you want done in registers o SYSCALL instruction also called TRAP or INT 0 Can t allow userlevel apps to invoke arbitrary OS code 0 Restricted set of legal OS addresses to jump to trap vector 0 Processor jumps to OS using trap vector 0 Sets privileged mode 0 OS performs operation 0 OS does a return from system call 0 Unsets privileged mode C15 371 RomMartin Virtual Memory 8110 9 Typical IO Device Interface 0 Operating system talks to the IO device 0 Send commands query status etc 0 Software uses special uncached loadstore operations 0 Hardware sends these readswrites across IO bus to device 0 Direct Memory Access DMA o For big transfers the IO device accesses the memory directly 0 Example DMA used to transfer an entire block tofrom disk 0 Interruptdriven IO 0 The 10 device tells the software its transfer is complete 0 Tells the hardware to raise an interrupt door bell 0 Processor jumps into the OS 0 Inefficient alternative polling C15 371 RomMartin Virtual Memory 8110 11 Interrupts o Exceptions synchronous generated by running app 0 Eg illegal insn divide by zero etc o Interrupts asynchronous events generated externally o Eg timer IO requestreply etc o Interrupt handling same mechanism for both 0 Interrupts are onchip signalsbits 0 Either internal eg timer exceptions or connected to pins Processor continuously monitors interrupt status when one is high Hardware jumps to some preset address in OS code interrupt vector 0 Like an asynchronous nonprogrammatic SYSCALL o Tlmer programmable onchip interrupt o Initialize with some number of microseconds o Timer counts down and interrupts when reaches 0 C15 371 RomMartin Virtual Memory 8110 10 Virtualizing Processors 0 How do multiple apps and OS share the processors 0 Goal applications think there are an infinite of processors 0 Solution timeshare the resource 0 Trigger a context switch at a regular interval 1ms o Preemptive app doesn t yield CPU OS forcibly takes it Stops greedy apps from starving others 0 Architected state PC registers 0 Save and restore them on context switches 0 Memory state 0 Nonarchitected state caches branch predictor tables etc o Ignore or flush 0 Operating responsible to handle context switching 0 Hardware support is just a timer interrupt C15 371 RomMartin Virtual Memory 8110 12 Virtualizing Main Memory 0 How do multiple apps and the OS share main memory 0 Goal each application thinks it has infinite memory 0 One app may want more memory than is in the system c App s insndata footprint may be larger than main memory 0 Requires main memory to act like a cache 0 With disk as next level in memory hierarchy slow 0 Writeback writeallocate large blocks or pages o No notion of program not fitting in registers or caches why 0 Solution 0 Part 1 treat memory as a cache 0 Store the overflowed blocks in swap space on disk 0 Part 2 add a level of indirection address translation C15 371 RomMartin Virtual Memory ampIO 13 Virtual Memory VM 0 Virtual Memory VM Level of indirection like register renaming Application generated addresses are virtual addresses VAs 0 Each process thinks it has its own 2 bytes of address space Memory accessed using physical addresses PAs VAs translated to PAs at some coarse granularity 0 OS controls VA to PA mapping for itself and all other processes 0 Logically translation performed before every insn fetch load store 0 Physically hardware acceleration removes translation overhead OS App l App2 I II ll 08 controlled VAaPA mappings PAS physical memory 215 371 RomMartin Virtual Memory ampIO 15 Virtual Memory VM Program 0 Programs use virtual addresses VA code heap stack 0 02N 1 0 VA size also referred to as machine size 0 Eg 32bit embedded or 64bit server 0 Memory uses physical addresses PA 0 02quotquot 1 typically MltN especially if N64 0 2M is most physical memory machine supports 0 A gtPA at page granularity VP gtPP o By system 0 Mapping need not preserve contiguity 0 VP need not be mapped to any PP o Unmapped VPs live on disk swap C15 371 RothMartin Virtual Memory ampIO 14 VM is an Old Idea Older than Caches 0 Original motivation singleprogram compatibility 0 IBM System 370 a family of computers with one software suite Same program could run on machines with different memory sizes Prior programmers egtltplicitly accounted for memory size 0 But also fullassociativity software replacement 0 Memory tmES is high extremely important to reduce miss Parameter L2 Zns 10ns 10ns 30ns 8 64 KB 128 KB ZMB Block size 16 32B 32 256B N M RU N MRU C15 371 RothMartin Virtual Memory ampIO 16 Uses of Virtual Memory o More recently isolation and multiprogramming 0 Each app thinks it has 2N B of memory its stack starts 0gtltFFFFFFFF 0 Apps prevented from readingwriting each other s memory 0 Can t even address the other program s memory 0 Protection 0 Each page with a readwriteexecute permission set by OS 0 Enforced by hardware 0 Interprocess communication 0 Map same physical pages into multiple virtual address spaces 0 Or share files via the UNIX mmap call 08 App l App2 I C15 371 RomMartin Virtual Memory 8110 17 Address Translation Virtual Memory The Basics virtual address310 POFS 150 donttouch POFS 150 physical address250 o VA gtPA mapping called address translation 0 Split VA into virtual page number VPN amp page offset POFS 0 Translate VPN into physical page number PPN o POFS is not translated VAaPA VPN POFS a PPN POFS 0 Example above 0 64KB pages a 16bit POFS 0 32bit machine a 32bit VA a 16bit VPN 0 Maximum 256MB memory a 28bit PA a 12bit PPN C15 371 RomMartin Virtual Memory 8110 19 0 Programs use virtual addresses WA 0 VA size N aka machine size eg Core 2 Duo 48bit 0 Memory uses physical addresses PA 0 PA size M typically MltN especially if N64 0 2quotquot is most physical memory machine supports 0 VA gtPA at page granularity VP gtPP 0 Mapping need not preserve contiguity 0 VP need not be mapped to any PP o Unmapped VPs live on disk swap or nowhere if not yet touched OS App l App2 II lllll C15 371 RomMartin Virtual Memory 8110 18 Address Translation Mechanics I o How are addresses translated o In software for now but with hardware acceleration a little later 0 Each process allocated a page table PT 0 Software data structure constructed by OS 0 Maps VPs to PPs or to disk swap addresses 0 VP entries empty if page never referenced 0 Translation is table lookup PT struct int ppn int isvalid isdirty isswapped PTE struct PTE pagetab1eNUMVIRTUALPAGES int translateint vpn 39 pagetab1evpnisvalid return pagetab1evpnppn C15 371 RomMartin Virtual Memory 8110 Page Table Size 0 How big is a page table on the following machine 0 32bit machine 0 4B page table entries PTEs o 4KB pages 0 32bit machine a 32bit VA a 4GB virtual memory 0 4GB virtual memory 4KB page size a 1M VPs 0 1M VPs 4B PTE a 4MB How big would the page table be with 64KB pages How big would it be for a 64 bit machine Page tables can get big c There are ways of making them smaller C15 371 RomMartin Virtual Memory ampI0 MultiLevel Page Table PT 0 20bit VPN 0 Upper 10 bits indegtlt 1st level table 0 Lower 10 bits indegtlt 2ndlevel table 2ndlevel PTEs pt root struct int ppn int isvalid isdirty isswapped struct struct PTE ptes1024 LZPT struct LZPT pagetable1024 int translateint vpn indexl vpn gtgt 10 upper 10 bits index2 vpn amp exaff lower 10 bits struct L2PT 12pt pagetableindex1 if 12pt 1 NULL 5 12ptgtptesindex2 isvalid return 12ptgtptesindex2 ppn 23 MultiLevel Page Table PT 0 One way multilevel page tables 0 Tree of page tables 0 Lowestlevel tables hold PTEs o Upperlevel tables hold pointers to lowerlevel tables 0 Different parts of VPN used to index different levels Example twolevel page table for machine on last slide 0 Compute number of pages needed for lowestlevel PTEs o 4KB pages 4B PTEs a 1K PTEspage 0 1M PTEs 1K PTEspage a 1K pages 0 Compute number of pages needed for upperlevel pointers 0 1K lowestlevel pages a 1K pointers 0 1K pointers 32bit VA a 4KB a 1 upper level page 215 371 RomMartin Virtual Memory ampI0 MultiLevel Page Table PT 0 Have we saved any space 0 Isn t total size of 2nd level tables same as singlelevel table ie 4MB 0 Yes but 0 Large virtual address regions unused 0 Corresponding 2ndlevel tables need not exist 0 Corresponding 1stlevel pointers are null 0 Example 2MB code 64KB stack 16MB heap 0 Each 2ndlevel table maps 4MB of virtual addresses 0 1 for code 1 for stack 4 for heap 1 1st level o 7 total pages 28KB much less than 4MB C15 371 RomMartin Virtual Memory ampI0 A Computer System IO Subsystem 0 IO subsystem kind of boring kind of important 0 IO devices storage input display network 0 IO bus 0 Software 0 Virtualized by OS 0 Device drivers 0 Presents synchronous interface for asynchronous devices C15 371 RomMartin Virtual Memory ampIO 33 One Instance of IO Swap Space 0 Have brie y seen one instance of IO 0 Disk bottom of memory hierarchy 0 Use as swap space l W 0 Exploits large capacity of disks 0 Cheaper per bit than DRAM CPU o Other use of disk 0 Nonvolatile storage Retains state when power is off C15 371 RomMartin Virtual Memory ampIO 35 10 Devices 0 Primary characteristic data rate bandwidth 0 Latency really only an issue for disk and network 0 Contributing factors inputoutputboth partner 0 Interesting devices have high data rates C15 371 RomMartin Virtual Memory ampIO 34 Anatomy of a Disk Drive ad 0 Disk rotates like a CDDVD player 0 Except magnetic not optical 0 Collection of platters 0 Each with readwrite head 0 Platters divided into concentric tracks 0 Head seeks to track 0 All heads move in unison 0 Each track divided into sectors 0 More sectors on outer tracks 0 Sectors rotate under head 0 Controller 0 Seeks heads waits for sectors 0 Turns heads on o 0 May have its own cache a few M85 0 Egtltploit spatial locality C15 371 RomMartin Virtual Memory ampIO 36 Disk Latency 0 Disk readwrite latency has four components 0 Seek delay tmk head seeks to right track 0 Average of 5ms 15ms 0 Less in practice because of shorter seeks o Rotational delay twta on right sector rotates under head 0 On average time to go halfway around disk 0 Based on rotation speed RPM 0 10000 to 15000 RPMs o 3ms 0 Transfer time thansm data actually being transferred 0 Fast for small blocks 0 Controller delay twntmuu controller overhead on either side 0 Fast no moving parts tdisk Iseek trolalinn ttransfer tcnntrnller C15 371 RomMartin Virtual Memory ampIO 37 Disk Bandwidth Sequential vs Random 0 Disk is bandwidthinef cient for pagesized transfers 0 Sequential vs random accesses 0 Random accesses 0 One read each disk access latency 10ms o Randomly reading 4KB pages 0 10ms is 001 seconds a 100 access per second 0 4KB 100 accesssec a 400KBsecond bandwidth 0 Sequential accesses o Stream data from disk no seeks o 128 sectorstrack 512 Bsector 6000 RPM 0 64KB per rotation 100 rotationper sec 0 6400KBsec a 64MBsec o Sequential access is 10x or more bandwidth than random 0 Still no where near the lGBsec to loGBsec of memory 215 371 RomMartin Virtual Memory ampIO 39 Disk Latency Example Example time to read a 4KB chunk assuming 0 128 sectorstrack 512 Bsector 6000 RPM 10 ms geek 1 ms tcommller 6000 RPM a 100 Rs a 10 msRa tromth 10 ms 2 5 ms 0 4 KB page a 8 sectors a tIalnsfer 10 ms 8128 06 ms 39 tdisk tseek trolation thansfer tcommller 166 m5 tdi k105061166ms C15 371 RothMartin Virtual Memory ampIO 38 Some Old Example Disks Hitachi Diameter 300 40 Cache 8 2 RPM RPM 4200 RPM 3600 RPM Seek 45 12 12 Sustained Data Rate 100 40 10 Cost 0 Flash nonvolatile CMOS storage 0 The new disk replacing disk in many C15 371 RothMartin Virtual Memory ampIO 40 This Unit Shared Memory Multiprocessors Threadlevel parallelism TLP Shared memory model o Multiplexed uniprocessor Computer Organization and Design CPU 2 gjigggfegrgghreadmg Synchronization 0 Lock implementation Unit 10 Shared Memory Multiprocessors Locking gotchas Cache coherence o Busbased protocols 0 Directory protocols Memory consistency models C15 371 MartinRoth Shared Memory Multiprocessors 1 C15 371 MartinRoth Shared Memory Multiprocessors 2 Multiplying Performance Multicore Mainstream Multiprocessors Multicore chips IBM Power5 0 Two 2GHz PowerPC cores 0 Shared 15 MB L2 L3 tags AMD Quad Phenom 39quot 0 Four 25GHz cores 0 Percore 512KB L2 cache 0 Shared 2MB L3 cache Intel Core 2 Quad 0 Four cores shared 4 MB L2 0 Two 4MB L2 caches Sun Niagara Why multicore What else would 8 cores each 4Way threaded you do With 500 million tranSIstors Shared 2MB L2 Shared FP C15 371 MartinRoth Shared Memory Multiprocessors 0 For servers not dESktOp 4 o A single processor can only be so fast 0 Limited clock frequency 0 Limited instructionlevel parallelism 0 Limited cache hierarchy What if we need even more computing power 0 Use multiple processors 0 But how Highend example Sun Ultra Enterprise 25k o 72 UltraSPARC IV processors 1SGhz o 1024 GBs of memory 0 Niche large database servers C15 371 MartinRoth Shared Memory Multiprocessors Application Domains for Multiprocessors Scientific computingsupercomputing 0 Examples weather simulation aerodynamics protein folding 0 Large grids integrating changes over time 0 Each processor computes for a part of the grid Server workloads 0 Example airline reservation database 0 Many concurrent updates searches lookups queries 0 Processors handle different requests Media workloads o Processors compressdecompress different parts of imageframes Desktop workloads Gaming workloads But software must be written to expose parallelism C15 371 MartinRom Shared Memory Multiprocessors 5 Multithreaded Programming Model o Programmer explicitly creates multiple threads 0 All loads amp stores to a single shared memory space 0 Each thread has a private stack frame for local variables 0 A thread switch can occur at any time o Preemptive multithreading by OS 0 Common uses 0 Handling user interaction GUI programming 0 Handling IO latency send network message wait for response 0 Expressing parallel work via ThreadLevel Parallelism TLP C15 371 MartinRom Shared Memory Multiprocessors 7 But First Uniprocessor Concurrency Software thread 0 Independent flow of execution o Context state PC registers 0 Threads generally share the same memory space 0 Process like a thread but different memory space 0 Java has thread support built in CC supports Pthreads library Generally system software the OS manages threads 0 Thread scheduling context switching o All threads share the one processor 0 Hardware timer interrupt occasionally triggers 05 o Quickly swapping threads gives illusion of concurrent execution 0 Much more in CIS380 C15 371 MartinRom Shared Memory Multiprocessors 6 Hardware Multithreading Hardware Multithreading MT 0 Multiple threads dynamically share a single pipeline caches o Replicate thread contexts PC and register file 0 Coarsegrain MT switch on L2 misses Why 0 Simultaneous MT no explicit switching finegrain interleaving o Pentium4 is 2way hyperthreaded leverages outof order core MT Improves utilization and throughput 0 Single programs utilize lt50 of pipeline branch misses 0 MT does not improve singlethread performance Individual threads run as fast or even slower 0 C15 371 MartinRom Shared Memory Multiprocessors 8 Simplest Multiprocessor Shared Memory Implementations o Multiplexed uniprocessor 0 Runtime system andor OS occasionally preempt amp swap threads 0 Interleaved but no parallelism 0 Hardware multithreading o Tolerate pipeline latencies higher efficiency 0 Same interleaved sharedmemory model o Replicate entire processor pipeline 0 Instead of replicating just register file amp PC 0 Exception share caches we ll address this bottleneck later 0 Same shared memory or multithreaded model 0 Loads and stores from two processors are interleaved o Advantagesdisadvantages over hardware multithreading o Multiprocessing 0 Multiply execution resources higher peak performance 0 Same interleaved sharedmemory model 0 Foreshadowing allow private caches further disentangle cores 0 All have same shared memory programming model C15 371 MartinRom Shared Memory Multiprocessors 9 C15 371 MartinRom Shared Memory Multiprocessors 10 ThreadLevel Parallelism Example An Example Execution struct acctt int bal Thread 0 Thread 1 Mem shared struct acctt accts MAXACCT 39 addi r1acctsr3 500 addi r1acctsr3 ld 0r3 1 int id amt if acctsid bal gt amt ld 0r3r4 blt r4r26 sub r4r2r4 st r40r3 400 addi r1acctsr3 1d 0r3lr4 blt r4r26 sub r4r2r4 st r40r3 acctsid bal amt givecash 39 st r40r3 call givecash U11thl O III U H a H N H uh a U11thl O call givecash Threadlevel parallelism Clquot LP 0 Collection of asynchronous tasks not started and stopped together 0 Data shared loosely sometimes yes mostly no dynamically Example databaseweb server each query is a thread 11 give Cash o accts is shared can t register allocate even if it were scalar TWO 100 Withdrawas from account 241 at two ATMS 0 id and amt are private variables register allocated to r1 r2 Each transaction maps to thread on different processor Running example Track accts24l bal address is in r3 U11thl O C15 371 MartinRom Shared Memory Multiprocessors 11 C15 371 MartinRom Shared Memory Multiprocessors 12 aui A Problem Execution Thread 0 addi r1 accts r3 Thread 1 0 1 ld 0r3 r4 2 blt r4r26 3 sub r4r2r4 ltltlt Interrupt gtgtgt emu 0 addi r1acctsr3 1 1d 0r3lr4a 2 blt r4r26 3 sub r4r2r4 4 st r40r3 5 call givecash 4 st r40r3 400 5 call givecash 0 Problem wrong account balance Why 0 Solution synchronize access to account balance C15 371 MartinRom Shared Memory Multiprocessors 13 A Synchronized Execution Thread 0 Thread 1 W call acquire lock 500 m 0 addi r1acctsr3 1 ld 0r3 r4 2 blt r4r26 3 sub r4r2r4 ltltlt Interrupt gtgtgt call acquirelockSPinS ltltlt Interrupt gtgtgt 4 st r40r3 400 call releaselock 5 call givecash still in acquire 0 addi r1acctsr3 o Fixed but how do 1 1 0 r3quotr4 2 blt r4r26 we implement 3 sub irzlr n W 4 acquire amp release 5 c 215 371 MartinRom Shared Memory Multiprocessors l givecash Synchronization o Synchronization a key issue for shared memory 0 Regulate access to shared data mutual exclusion 0 Software constructs semaphore monitor mutegtlt o Lowlevel primitive lock 0 OpemUOnSacquirelockand releaselock 0 Region between acquire and release is a critical section 0 Must interleave acquire and release 0 Interfering acquire will bloc struct acctt int bal 39 shared struct acctt acctsMAXACCT39 shared Lnt lock int id amt acquire lock 39 critical section C15 371 MartinRom Shared Memory Multiprocessors 14 Strawman Lock Incorrect Spin lock software lock implementation 0 acquire lock while lock 0 lock 1 0 Spin while lock is 1 wait for it to turn 0 ld 0 amplock r6 A1 A2 addi r61r6 A3 st r60amplock 0 release lock lock 0 R0 st r00amplock r0 holds 0 C15 371 MartinRom Shared Memory Multiprocessors 16 Strawman Lock Incorrect Thread 0 Thread 1 Mem A0 1d 0amploek r6 1 0 A1 bnez r6AO A0 1d r60amploek A2 addi r61r6 A1 bnez r6AO A3 st r60amploek A239 addi r6 1 r6 1 CRITICALSECTION A3 5t 1550 amplDCk 1 CRITIQLSECTION 0 Spin lock makes intuitive sense but doesn t actually work 0 Loadsstores of two acquire sequences can be interleaved 0 Lock acquire sequence also not atomic 0 Same problem as before 0 Note release is trivially atomic C15 371 MartinRom Shared Memory Multiprocessors 17 Better Spin Lock Use Atomic Swap 0 ISA provides an atomic lock acquisition instruction 0 Example atomic swap swap r1 0 amplock mov rl gtr2 o Atomically executes ld rl0amplock st r2 0amplock 0 New acquire sequence value of rl is 1 A0 swap rl 0 amplock Al bnez rlA0 o If lock was initially busy 1 doesn t change it keep looping o If lock was initially free 0 acquires it sets it to 1 break loop 0 Insures lock held by at most one thread 0 Other variants exchange compareandswap testandset or fetchandadd C15 371 MartinRom Shared Memory Multiprocessors 19 A Correct Implementation SYSCALL Lock ACQUIRELOCK A l disableinterrupts atomic A2 1d r60amploek A3 bnez r6AO A4 addi r61r6 A5 st r60amploek A6 enable 39 A7 return 0 Implement lock in a SYSCALL 0 Only kernel can control interleaving by disabling interrupts Works Large system call overhead But not in a hardware multithreading or a multiprocessor C15 371 MartinRom Shared Memory Multiprocessors 18 Atomic UpdateSwap Implementation o How is atomic swap implemented 0 Need to ensure no intervening memory operations 0 Requires blocking access by other threads temporarily yuck o How to pipeline it 0 Both a load and a store yuck 0 Not very RISClike 0 Some ISAs provide a loadlink and storeconditional insn pair C15 371 MartinRom Shared Memory Multiprocessors 20 Lock Correctness Thread 0 Thread 1 A0 swap rl0amploek A1 bnez r1A0 A0 swap r10amploek CRITICALSECTION Al bnez rlA0 A0 swap r10amploek Al bnez r1 A0 Testandset lock actually works 0 Thread 1 keeps spinning C15 371 MartinRoth Shared Memory Multiprocessors 21 CoarseGrain Locks Correct but Slow Programming With Locks Is Difficult o Coarsegrain locks eg one lock for entire database Easy to make correct no chance for unintended interference No P in TLP no two critical sections can proceed in parallel struet aeett int bal shared struet aeett aeetsMAXACCT int id amt shared int lock acquire lock if aeetsid bal gt amt aeetsid bal amt C15 371 MartinRoth Shared Memory Multiprocessors 23 o Multicore processors are the way of the foreseeable future 0 TLP anointed as parallelism model of choice 0 Just one problem 0 Writing lockbased multithreaded programs is dif cult o More precisely 0 Writing programs that are correct is easy not really 0 Writing programs that are highly parallel is easy not really Writing programs that are both correct and parallel is difficult 0 Very difficult true 0 Unfortunate goal but that s the whole point after all 0 Locking granularity issues 215 371 MartinRoth Shared Memory Multiprocessors 22 FineGrain Locks Parallel But Difficult o Finegrain locks eg multiple locks one per record Fast critical sections to different records can proceed in parallel Difficult to make correct easy to make mistakes 0 This particular example is easy 0 Requires only one lock per critical section 0 Consider critical section that requires two locks struet aeett int balloek shared struet aeett aeetsMAXACCT int idamt acquire acets id loek if aeetsid bal gt amt aeetsid bal amt giveeash release acets id loek C15 371 MartinRoth Shared Memory Multiprocessors 24 Multiple Locks Multiple locks eg acctto acct transfer 0 Must acquire both idfrom idto locks 0 Running example with accts 241 and 37 o Simultaneous transfers 241 37 and 37 a 241 o Contrived but even contrived examples must work correctly too struct acctt int ballock shared struct acctt acctsMAXACCT int idfromidtoamt acquire accts idfrom lock acquire accts id to lock if acctsidfrom bal gt amt acctsidfrom bal amt acctsidto bal amt release accts id to lock release accts idfrom lock C15 371 MartinRom Shared Memory Multiprocessors 25 Correct Multiple Lock Program 0 Always acquire multiple locks in same order 0 Just another thing to keep in mind when programming 0 Ho hum struct acctt int ballock shared struct acctt acctsMAXACCT int id fromid toamt int idfirst minidfrom idto int idsecond maxidfrom idto acquire accts idfirst lock acquire accts idsecond lock if acctsidfrom bal gt amt acctsid from bal amt39 release accts idsecond lock s idfirst lock C15 371 MartinRom Shared Memory Multiprocessors 27 Multiple Locks And Deadlock Thread 0 Thread 1 idfrom 241 idfrom 37 idto 37 idto 241 acquire accts241 lock acquire accts37 lock wait to acquire lock wait to acquire lock 241 37 waiting waiting still waiting 0 Deadlock circular wait for shared resources 0 Thread 0 has ock 241 waits for lock 37 0 Thread 1 has lock 37 waits for lock 241 0 Obviously this is a problem 0 The solution is 215 371 MartinRom Shared Memory Multiprocessors 26 Correct Multiple Lock Execution Thread 0 Thread 1 idfrom 241 idfrom 37 idto 37 idto 241 idfirst min2413737 idfirst min3724137 idsecond max37241241 idsecond max37241241 acquire accts37 lock wait to acquire lock 37 acquire accts241 lock waiting do stuff release accts241 lock release accts37 lock acquire accts37 lock 0 Great are we done No 215 371 MartinRom Shared Memory Multiprocessors 28 More Lock Madness What if 0 Some actions eg deposits transfers require 1 or 2 locks o and others eg prepare statements require all of them 0 Can these proceed in parallel What if c There are locks for global variables eg operation id counter c When should operations grab this lock What if what if what if So lockbased programming is difficult wait it gets worse C15 371 MartinRom Shared Memory Multiprocessors 29 Research Transactional Memory TM And To Make It Worse o Transactional Memory Programming simplicity of coarsegrain locks Higher concurrency parallelism of finegrain locks 0 Critical sections only serialized if data is actually shared No lock acquisition overhead 0 Hottest thing since sliced bread 0 No fewer than 9 research projects Brown Stanford MIT Intel 0 Penn too C15 371 MartinRom Shared Memory Multiprocessors 31 0 Acquiring locks is expensive 0 By definition requires a slow atomic instructions 0 Specifically acquiring write permissions to the lock 0 Ordering constraints see soon make it even slower o and 99 of the time unnecessary 0 Most concurrent actions don t actually share data You paying to acquire the locks for no reason 0 Fixing these problem is an area of active research 0 One proposed solution Transactional Memory C15 371 MartinRom Shared Memory Multiprocessors 30 Transactional Memory The Big Idea 0 Big idea I no locks just shared data 0 Look ma no locks 0 Big idea II optimistic speculative concurrency o Execute critical section speculatively abort on conflicts 0 Better to beg for forgiveness than to ask for permission strucl acctl int bal shared strucl acctl acctsMAXACCT int idfromidloaml heginlransaclion if acctsidfrom bal gt amt acctsid from bal amt acctsidlo hal amt C15 371 MartinRom Shared Memory Multiprocessors 32 Transactional Memory ReadWrite Sets o Read set set of shared addresses critical section reads 0 Example accts37 bal accts24l bal 0 Write set set of shared addresses critical section writes 0 Example accts37 bal accts24l bal struct acctt int bal shared struct acctt acctsMAXACCT int idfromidtoamt hegintransaction if acctsidfrom bal gt amt acctsid from bal amt acctsidto hal amt C15 371 MartinRom Shared Memory Multiprocessors 33 Transactional Memory End Transactional Memory Begin 0 endtransaction 0 Check read set is all data you read still valid ie no writes to any 0 Yes Commit transactions commit writes o No Abort transaction restore checkpoint struct acctt int bal shared struct acctt acctsMAXACCT int idfromidtoamt hegintransaction if acctsidfrom bal gt amt acctsid from bal amt acctsidto hal amt C15 371 MartinRom Shared Memory Multiprocessors 35 begintransaction 0 Take a local register checkpoint 0 Begin locally tracking read set remember addresses you read 0 See if anyone else is trying to write it o Locally buffer all of your writes invisible to other processors Local actions only no lock acquire struct acctt int bal shared struct acctt acctsMAXACCT int idfromidtoamt hegintransaction if acctsidfrom bal gt amt acctsid from bal amt acctsidto hal amt C15 371 MartinRom Shared Memory Multiprocessors 34 Transactional Memory Implementation How are readsetwriteset implemented 0 Track locations accessed using bits in the cache Readset additional transactional read bit per block 0 Set on reads between begintransaction and endtransaction c Any other write to block with set bit triggers abort 0 Flash cleared on transaction abort or commit Writeset additional transactional write bit per block 0 Set on writes between begintransaction and endtransaction 0 Flash cleared on transaction commit o On transaction abort blocks with set bit are invalidated C15 371 MartinRom Shared Memory Multiprocessors 36 Transactional Execution Thread 0 Thread 1 idfrom 241 idfrom 37 idto 37 idto 241 begintransaetion begintransaetion ifaeets241 bal gt 100 ifaeets37 bal gt 100 aeets37 bal amt write aeets241bal aets241 bal i amt abort endtransaetion no writes to aeets241bal no writes to aeets37 bal commit C15 371 MartinRoth Shared Memory Multiprocessors 37 So Let s Just Do Transactions o What if c Readset or writeset bigger than cache 0 Transaction gets swapped out in the middle 0 Transaction wants to do 10 or SYSCALL notabortable o How do we transactify existing lock based programs 0 Replace acquire with begintrans does not always work 0 Several different kinds of transaction semantics 0 Which one do we want 0 That s what these research groups are looking at 0 Industry adoption 0 Sun s Rock processor has besteffort hardware TM 0 Speculative locking Azul systems and Intel rumor C15 371 MartinRoth Shared Memory Multiprocessors 39 Transactional Execution II More Likely Thread 0 Thread 1 idfrom 241 idfrom 450 idto 37 idto 118 hegintransaetion begintransaetion ifaeets241bal gt 100 ifaeets450 bal gt 100 aeets241bal amt aeets450 bal amt aets37 bal i amt aets118 bal i amt endtransaetion endtransaetion no write to aeets240bal no write to aeets450 bal no write to aeets37 bal no write to aeets118 bal commit commit 0 Critical sections execute in parallel C15 371 MartinRoth Shared Memory Multiprocessors 38 Roadmap Checkpoint System software CPU O o Cache coherence o Busbased protocols 0 Directory protocols 0 Memory consistency models 215 371 MartinRoth Shared Memory Multiprocessors 40 NoCache NoProblem CRUD cram Mem Processor 0 Processor 1 addi r3r1ampaccts 1w r40r3 blt r4r26 sub r4r4r2 sw r40r3 E 400 jal dispensecash addi r3r1ampaccts 1w r40r3 quotquotquotquotquot quot400 0 1 2 blt r4r26 3 4 musmNI o sub r4r4r2 sw r40r3 5 jal dispensecash Scenario I processors have no caches o No problem mm 215 371 MartinRoth Shared Memory Multiprocessors 45 WriteThrough Doesn t Fix It Processor 0 Processor 1 addi r3r1ampaccts 1w r40r3 blt r4r26 sub r4r4r2 5w r40r3 jal dispensecash addi r3r1ampaccts 1w r40r3 quotquotquotquotquot quot 400 0 1 2 blt r4r26 3 4 musmNI o sub r4r4r2 sw r40r3 5 jal dispensecash Scenario IIb processors have writethrough caches o This time only 2 different copies of accts 241 bal o No problem What if another withdrawal happens on processor 0 C15 371 MartinRoth Shared Memory Multiprocessors 47 Cache Incoherence PUV 39CRUJ Mem Processor 0 Processor 1 addi r3r1ampaccts 1w r40r3 blt r4r26 sub r4r4r2 5w mloma jal dispensecash addi r3r1ampaccts 1w r40r3 blt r4r26 sub r4r4r2 sw r40r3 5 jal dispensecash 0 Scenario IIa processors have writeback caches o Potentially 3 copies of accts 241 bal memory p0 p1 0 Can get incoherent inconsistent U11hLJNI O PmNi O C15 371 MartinRoth Shared Memory Multiprocessors 46 What To Do o No caches Slow 0 Make shared data uncachable Faster but still too slow 0 Entire accts database is technically shared 0 Definition of loosely shared 0 Data only really shared if two ATMs access same acct at once 0 Flush all other caches on writes to shared data 0 May as well not have caches 0 Hardware cache coherence 0 Rough goal all caches have same data at all times Minimal flushing magtltimum caching a best performance 215 371 MartinRoth Shared Memory Multiprocessors 48 VI a MSI I BRgtBWgt 0 VI protocol is inefficient Only one cached copy allowed in entire system Multiple copies can t exist even if readonly 0 Not a problem in example 0 Big problem in reality 0 MSI modifiedsharedinvalid o Figtltes problem splits V state into two states 0 M modified local dirty copy 0 S shared local clean copy 0 Allows either 0 Multiple readonly copies Sstate OR 0 Single readwrite copy Mstate BWgtSD WBgtSD C15 371 MartinRoth Shared Memory Multiprocessors 53 Exclusive Clean Protocol Optimization to o foPu i Mem Processor 0 Processor 1 addi r3rlampaccts lw r40r3 blt r4r26 sub r4r4r2 sw r40r3 Nomiss jal dispensecash addi r3rlampaccts 1w r40r3 0 l 2 blt r4r26 3 4 musmNHo sub r4 r4 r2 5w r40r3 M300 5 jal dispensecash Most modern protocols also include E exclusive state 0 Interpretation 1 have the only cached copy and it s a clean copy 0 Why would this state be useful C15 371 MartinRoth Shared Memory Multiprocessors 55 MSI Protocol WriteBack Cache ePuo oRjui Mem Processor 0 Processor 1 addi r3rlampaccts lw r40r3 blt r4r26 sub r4r4r2 sw r40r3 jal dispensecash 0 addi r3rlampaccts 1 1w r40r3 2 blt r4r26 3 4 U11hLJNI O sub r4r4r2 sw r40r3 m 5 jal dispensecash 1w by processor 1 generates a BR 0 Processor 0 responds by sending its dirty copy transitioning to S sw by processor 1 generates a BW 0 Processor 0 responds by transitioning to I 215 371 MartinRoth Shared Memory Multiprocessors 54 Cache Coherence and Cache Misses o A coherence protocol can effect a cache s miss rate miss 0 Requests from other processors can invalidate evict local blocks 0 4C miss model compulsory capacity conflict coherence o Coherence miss miss to a block evicted by bus event 0 As opposed to a processor event Cache parameters interact with coherence misses Larger capacity more coherence misses 0 But offset by reduction in capacity misses Increased block size more coherence misses 0 False sharing sharing a cache line without sharing data 0 Creates pathological pingpong behavior 0 Careful data placement may help but is difficult C15 371 MartinRoth Shared Memory Multiprocessors 56 Cache Coherence and Cache Misses o A coherence protocol can effect a cache s miss rate miss 0 Requests from other processors can invalidate evict local blocks 0 4C miss model compulsory capacity conflict coherence o Coherence miss miss to a block evicted by bus event 0 As opposed to a processor event 0 Example directmapped 4B cache 18 blocks 4bit memory Cache contenis Snooping Bandwidth Requirements Setoo Set01 Set10 Set11 Event Outcome S0000 M0001 S0010 S0011 Wr0011 Upgrade Miss S0000 M0001 S0010 M0011 Bust0000 Nothing S0000 M0001 S0010 M0011 BusWr0010 SI Invalidation S0000 M0001 I0010 M0011 Rd1011 Compulsory Miss S0000 M0001 10010 51011 Rdmom cnherence Miss S0000 M0001 50010 S1011 C15 371 MartinRom Shared Memory Multiprocessors 57 More Snooping Bandwidth Problems 0 Bus bandwidth is not the only problem 0 Also processor snooping bandwidth 0 001 eventsinsn 2 insncycle 002 eventscycle per processor 0 16 processors 032 busside tag lookups per cycle 0 Add 1 port to cache tags Sure o Invalidate over upgrade Tags smaller data ports less expensive 128 processors 256 bus side tag lookups per cycle 0 Add 3 ports to cache tags Oy vey Implementing inclusion L1 is strict subset of L2 helps a little 0 2 additional ports on L2 tags only 0 Processor doesn t use existing tag port most of the time o If L2 doesn t care 99 of the time no need to bother L1 Still kind of bad though 0 Upshot busbased coherence doesn t scale well C15 371 MartinRom Shared Memory Multiprocessors 59 Coherence events generated on 0 L2 misses and writebacks Some parameters 0 2 GHz CPUs 2 IPC 33 memory operations 0 2 of which miss in the L2 648 blocks 50 dirty o 033 002 15 001 eventsinsn o 001 eventsinsn 2 insncycle 2 cyclens 004 eventsns 0 Address request 004 eventsns 4 Bevent 016 685 0 Data response 004 eventsns 64 Bevent 256 GBs That s 25 GBs per processor 0 With 16 processors that s 40 685 0 With 128 processors that s 320 GBsll 0 You can use multiple buses but that hinders global ordering C15 371 MartinRom Shared Memory Multiprocessors 58 Scalable Cache Coherence I BRBW 0 Part I bus bandwidth 0 Replace nonscalable bandwidth substrate bus 0 with scalable one pointtopoint network eg mesh 0 Part II processor snooping bandwidth 0 Most snoops result in no action 0 Replace nonscalable broadcast protocol spam everyone 0 with scalable directory protocol only notify processors that care C15 371 MartinRom Shared Memory Multiprocessors 60 Scalable Cache Coherence o Pointtopoint interconnects o Glueless MP no need for additional glue chips Can be arbitrarily large 1000 s of processors 0 Massively parallel processors MPPs 0 Only government DOD has MPPs 0 Companies have much smaller systems 32 64 processors 0 Scalable multiprocessors 0 AMD OpteronPhenom pointto point glueless MP uses broadcast C15 371 MartinRoda Shared Memory Multiprocessors 61 MSI Directory Protocol 0 Processor side 0 Directory follows its own protocol obvious in principle 0 Similar to busbased MSI 0 Same three states 0 Same five actions keep BRBW names 0 Minus grayed out arcsactions 0 Bus events that would not trigger action anyway Directory won t bother you unless you need to act C15 371 MartinRoda Shared Memory Multiprocessors 63 Directory Coherence Protocols Observe address space statically partitioned Can easily determine which memory module holds a given line 0 That memory module sometimes called home Can t easily determine which processors have line in their caches o Busbased protocol broadcast events to all processorscaches i Simple and fast but nonscalable Directories nonbroadcast coherence protocol Egtlttend memory to track caching information For each physical cache line whose home this is track 0 Owner which processor has a dirty copy Ie M state 0 Sharers which processors have clean copies Ie S state Processor sends coherence event to home directory 0 Home directory only sends events to processors that care 215 371 MartinRoda Shared Memory Multiprocessors 62 Directory MSI Protocol P 0 P 1 P0 P1 Directory rocessor rocessor 0 addi r1acctsr3 1 1d 0 r3 r4 2 blt r4r26 s5ooso5oo 3 sub 154152154 4 st r40r3 5 call dispensecash addi r1 accts r3 MAOO I M2530 0 1 1d 0r3 r4 2 blt r4r26 3 sub 154152154 4 st r4 0 r3 5 call dispex158cas 1d by P1 sends BR to directory 0 Directory sends BR to P0 P0 sends P1 data does WB goes to S st by P1 sends BW to directory 0 Directory sends BW to P0 P0 goes to I 215 371 MartinRoda Shared Memory Multiprocessors 64 Directory Flip Side Latency Directory protocols Lower bandwidth consumption a more scalable Longer latencies 2 hog miss ho miss 3P Unshared get data from memory 39 o Snooping 2 hops POamemoryaPO Two read miss situations 9 0 Directory 2 hops PoamemoryaPO 0 Assume cachetocache transfer optimization 0 Snooping 2 hops POAPlaPO Directory 3 hops POamemoryaPlaPO 0 Common with many processors high probability someone has C15 371 MartinRom Shared Memory Multiprocessors Coherence on Real Machines 0 Many uniprocessors designed with onchip snooping logic 0 Can be easily combined to form multiprocessors o Eg Intel Pentium4 Xeon 0 Multicore o Eg Sun Wildfire NUMAQ IBM Summit 0 Eg CRAYT3DE 0 Shared data is uncachable Shared or exclusive get data from other processor P1 it 65 Larger scale directory systems built from smaller MPs Some shared memory machines are not cache coherent o If you want to cache shared data copy it to private data section 0 Basically cache coherence implemented in software 0 Have to really know what you are doing as a programmer C15 371 MartinRom Shared Memory Multiprocessors Directory Flip Side Complexity 0 Latency not only issue for directories o Subtle correctness issues as well 0 Stem from unordered nature of underlying interconnect 0 Individual requests to single cache must be ordered 0 Busbased Snooping all processors see all requests in same order 0 Ordering automatic 0 Pointtopoint network requests may arrive in different orders 0 Directory has to enforce ordering explicitly 0 Cannot initiate actions on request B 0 Until all relevant processors have completed actions on request A o Requires directory to collect acks queue requests etc 0 Directory protocols o Obvious in principle Complicated in practice C15 371 MartinRom Shared Memory Multiprocessors Best of Both Worlds 0 Ignore processor snooping bandwidth for a minute 0 Can we combine best features of snooping and directories c From snooping fast twohop cacheto cache transfers 0 From directories scalable pointtopoint networks 0 In other words 0 Can we use broadcast on an unordered network 0 Yes and most of the time everything is fine 0 But sometimes it isn t protocol race 0 Research Proposal Token Coherence TC 0 An unordered broadcast snooping protocol without data races C15 371 MartinRom Shared Memory Multiprocessors Roadmap Checkpoint o i lireaciwlevel parallelism TLP Shared memory model M quoto o Multiplexed uniproceeeor CPU 0 I iardwaremultihreacling o Multiprocessing Synchronization 0 Lock implementation 0 Locking gotchas Cache coherence o Busbased protocols 0 Directory protocols Memory consistency models 215 371 MartinRoth Shared Memory Multiprocessors 69 Recall Write Misses and Write Buffers o Read miss 0 Load can t go on without the data it must stall 0 Write miss o Technically no instruction is waiting for data why stall 0 Write buffer a small buffer W3 0 Stores put addressvalue to write buffer keep going 0 Write buffer writes stores to D in the background I o Loads must search write buffer in addition to D 395 Eliminates stalls on write misses mostly was 0 Write buffer vs writeback buffer 0 Write buffer in front of D for hiding store misses o Writeback buffer behind D for hiding writebacks C15 371 MartinRoth Shared Memory Multiprocessors 71 Nextlevel cache Hiding Store Miss Latency 0 Recall back from caching unit 0 Hiding store miss latency o How Write buffer 0 Said it would complicate multiprocessors 0 Yes It does 215 371 MartinRoth Shared Memory Multiprocessors 70 Memory Consistency 0 Memory coherence o Creates globally uniform consistent view 0 Of a single memory location in other words cache line Not enough 0 Cache lines A and B can be individually consistent 0 But inconsistent with respect to each other 0 Memory consistency o Creates globally uniform consistent view 0 Of all memory locations relative to each other 0 Who cares Programmers Globally inconsistent memory creates mystifying behavior C15 371 MartinRoth Shared Memory Multiprocessors 72 Coherence vs Consistency Aflag0 Processor 0 Processor 1 A1 while Eflag spin flagl print A o Intuition says P1 prints A1 o Coherence says absolutely nothing 0 P1 can see PO s write of flag before write of A How 0 Maybe coherence event ofA is delayed somewhere in network 0 Or P0 has a coalescing write buffer that reorders writes 0 Imagine trying to gure out why this code sometimes works and sometimes doesn t 0 Real systems act in this strange manner C15 371 MartinRoth Shared Memory Multiprocessors SC Doesn t Happen Naturally Why 0 What is consistency concerned with 0 P1 doesn t actually view PO s committed loads and stores 0 Views their coherence events instead 0 Consistency model how observed order of coherence events relates to order of committed insns o What does SC say 0 Coherence event order must match committed insn order 0 And be identical for all processors 0 Let s see what that implies C15 371 MartinRoth Shared Memory Multiprocessors Sequential Consistency SC Aflag0 Processor 0 Processor 1 A1 while Eflag spin flagl print A Sequential consistency SC 0 Formal definition of memory view programmers expect 0 Processors see their own loads and stores in program order Provided naturally even with outof order execution 0 But also processors see others loads and stores in program order 0 And finally all processors see same global loadstore ordering Last two conditions not naturally enforced by coherence Lampolt definition multiprocessor ordering o Corresponds to some sequential interleaving of uniprocessor orders 0 Ie indistinguishable from multiprogrammed uni processor CIS 371 MartinRoth Shared Memory Multiprocessors 74 SC Write Buffers 0 Store misses are slow 0 Global acquisition of M state write permission Multiprocessors have more store misses than uniprocessors 0 Upgrade miss I have block in S require global upgrade to M o Apparent solution write buffer 0 Commit store to write buffer let it absorb store miss latency 0 But a write buffer means 0 I see my own stores commit before everyone else sees them 215 371 MartinRoth Shared Memory Multiprocessors