Intro MachineOrganization w
Intro MachineOrganization w CS 240
Popular in Course
Popular in ComputerScienence
This 29 page Class Notes was uploaded by Jordon Hermiston on Thursday October 29, 2015. The Class Notes belongs to CS 240 at Wellesley College taught by Randy Shull in Fall. Since its upload, it has received 12 views. For similar materials see /class/230932/cs-240-wellesley-college in ComputerScienence at Wellesley College.
Reviews for Intro MachineOrganization w
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/29/15
Cold hard cache Measuring and improving cache performance 65240 Computer Organization Depar rmen r of CompuTer Science Wellesley College CPU Time CPU Time CPU execuTion cycles Memorys rall cycles x Clock cycle Time Memorys rall cycles Reads rall cycles WriTesTall cycles Reads Reads rall cycles x Read miss raTe x Read miss penalTy Program Wri res WriTesTall cycles X WriTe miss raTe x WriTe miss penalTy Program Wri re buffer s ralls Improving cache 212 This simplifies To CPU Time CPU eXecuTion cycles Memory sTall cycles X Clock cycle Time Mem accesses Memory sTall cycles X Read miss raTe X Read miss penalTy Program InsTrucTions Misses Memory sTall cycles 7 7 X Read miss penalTy Program InsTrucTion Improving cache 213 Nobody39s perfecT CPU Time CPU eXecuTion cycles Memory sTall cycles X Clock cycle Time InsTrucTions Misses 7 7 X Read miss penalTy Program InsTrucTion Memory sTall cycles For a given program we assume An insTrucTion cache miss raTe of 2 a daTa cache miss raTe of 4 a CPI of 2 wiThouT memory sTalls a miss penalTy of 100 cycles and The The frequency of loads and sTores is SPECinTZOOO of 36 o o o o o DeTermine how much fasTer The processor would run wiTh a perfecT cac e Improving cache 214 Amdahl39s aw sTr39ikes again 0 Suppose we speed up The compuTer by reducing The CPI from 2 To 1 wiThouT changing The clock raTe o The sysTem wiTh cache misses would Then have a CPI of 1 344 444 and The perfecT cache would be 444 Times fasTer o The amounT of execuTion Time spenT on memory sTalls would have one up by 344544 6 o Improving cache 215 WorsT sTiII 0 Suppose we double The clock raTe wiThouT changing memory speed 0 Measured in The fasTer clock cycles The new miss penalTy will be Twice The old ToTal miss cycles per insTr 2 x 200 36 x 4 x 200 39 6 88 FasT clock performance InsT counT x CPISIOW x Clock cycle Slow clock performance InsT counT x CPIfaSf x Clock cycle 2 544888 x 12 123 Improving cache 216 Fas rer39 CPI amp clock r39a re deliver a double whammy U E 100 8 6 E i lt 43 10 39039 Processor g E I DRAM k a 1 a E 01 8 D 001 VAX 1980 PPro 1996 2010 The processor vs DRAM speed dispari ry confinues To grow Improving cache 217 We seek To decrease miss ra re while not increasing hi r Time 0 We may be able To reduce cache misses by more flexible placemen r of blocks 0 In direct mapped cache a block can go in exac rly one place 0 Tha r makes i r easy To find bu r Improving cache 218 Memory references 0 1 2 3 4 3 4 15 0 miss 1 miss 2 miss 3 miss 4 miss 3 hi 15 miss Sfuri wifh an empty cache all blocks inifially marked as not valid Improving cache 219 Memory references 0 4 O 4 O 4 O 4 o miss 4 miss 0 miss 4 miss 0 miss 4 miss 0 miss 4 miss Sfuri wifh an empty cache all blocks inifially marked as not valid Improving cache 2110 Fully associa rive W T i 0 AT The o rher ex rreme Is a w scheme where a block can a mum My Tiiwiix Huh V AlN be placed anywhere In gs cache to j 39MEN NKK Am TNE HEfhltEE7X TFFVJERK 120sz AN 0 Bu r Then how do we find liJl s 535 ii 2223 3 a H k own m7 mil RURAN PEILALTHlHl REPEAT STEP l70 2 REPEAT THEM AGAIN REVEAT THEM time Airw W Lima 47 w Maw gtan We in was a Km Mm um Tug SnFA ae Sammie eug mm Hm W mtg 7 as Improving cache 2111 Middle ground nway se rassocia rive cache OHErWay set aSSDElaUVE direct mapped Black Tag Data Tvvuevvay set aSSDElaWE Set Tag Data Tag Data 1 2 3 4 5 E 7 Fuurevvay set aSSDElatWE Set Tag Data Tag Data Tag Data Tag Data u l Eightrvvay set assuciative fully assuciative Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data The set con l39aining a memory block is given by Block number modulo Number of sets in The cache Improving cache 2112 Memory references 0 1 2 3 4 3 4 15 0 miss 1 hi 2 miss i 00 i Mem1 iMemO i i 00 i Mem1 iMemO i i 00 i Mem1 iMemO i i i i i i i i i i 00 i Mama iMem2 i 3 hli m 4 miss 3h 00 Mem 1 Mem 0 Me J Mem a 01 Mem 5 Mem 4 00 Mem3 Mem2 00 Mem3 Mem2 00 Mem3 Mem2 4 PM 15 miss 01 Mem5 Mem4 l 01 Mem5 Mem4 00 Mem3 Mem2 SQ Me Mem Improving cache 2113 Performance of SeTassociafive caches 15 12 1 KB 3 9 2 KB E g 4 KB 5 6 8 KB 3 193 KB 32 KB 64 KB if 4128 KB n I I I I Oneway Twoway Fourway Eightway Associativity Improving cache 2114 Implementing a 4way sef associa ve cache Improving cache 2115 Choosing which block To replace a When a miss occurs in a directmapped cache The requested block can only go in one place a In an associative cache we oice o The most commonly scheme least r ecenfly used LRU Improving cache 2116 Reducing miss penalTy using mulTiple caches o A Twolevel cache allows The primary cache Ta focus on minimizing The hiT Time To yield a sharTer clock cycle 0 while The secondary mndwom a munAs cac focuses on miss quotmm mm he raTe Ta r d e uce The penalTy of long memory access Times Impruving cache 21717 Balancing cache accounTs 0 Suppose we have a processor wiTh a base CPI of 10 assuming all references hiT Assume main memory access Time of 100 ns including all miss handling 0 Suppose miss raTe per insTrucTion aT primary cache is 2 o How much fasTer would The processor 39 add a secondary cache wiTh a 5 ns access Time ThaT is large en ugh To reduce miss raTe To 057 7 Impruving cache 1413 If only Things were ThoT simple 2 so 25 Size x itemsm so 255 512 mu 2m was n 251532 so a 255 512 mu 2m was SiEKKnemsmm TheoreTical behavior of Radix sor r vs Quicksor r Observed behavior of Radix sor r vs Quicksor r Improving cache 2119 The real Scoop 0 Memory sys rem performance is offen cri rical facfor mm 3 s l i 3 2 i 1 o o A mm a a so 125 255 512 iuzozmamas 52 x mm mm o Mul rilevel caches pipelined processors make if harder To predicf ou rcomes Improving cache 2120 BeTTer living Through pipelining An overview 65240 CompuTer Organization m IDEA SN qu m4 Goad algsekeeping DeparTmenT of CompuTer Science 39M39rruz lfl lf fnm Wellesley College A n old saw H N A Place one load of cloThes in washer When The washer finishes place weT cloThes in dryer When dryer finishes place dry load on a Table and fold When folding is finished ask your roommaTe To puT The cloThes away An even older saw buT noT a pipeline 182 Pipelining The pipelining paradox o The Time from placing a single dirTy sock in The washer unTil iT lands up in you dresser drawer is exachy The same in boTh insTances o The s eedup comes from The ncT ThaT everyThing is done in parallel o If all The sTages Take The same Time and There39s enough work To do Then The speedup is equal To The number of sTages Pipelining 183 MIPS insTrucTions classically Take 5 sTeps FeTch insTrucTion from memory InsTr InsTr Reg ALU DaTa Reg ToTal class feTch read op access WriTe Time 2 Read re isTers while d d9 11 lw 200 100 200 200 100 800 quot 39quot9 quot sw 200 100 200 200 700 InsTrucTIons 3 ExecuTe The operaTion or calculaTe The address and 200 100 200 200 600 4 Access an operand in or daTa memory S 5 WriTe The resulTinToa beq 200 100 200 500 regIsTer Times in ps Pipelining 184 Singlecycle nonpipelined ver393us pipelined execuTion WWW a W mm m immaa M w mu m ps mm mm mm sun auu mm mm mm ngmm mman zuups zuups zuups zuups zuups Ideal speedup is number of sTages in The pipeline Do we achieve This Pipelining 185 More formally a If The sTages ar39e perfecle balanced Then Time beTween insTr ucTionsmPiPelned Time beTween insTrucTionsP Mined I Number of pipe sTages Thus under ideal condiTions and wiTh a large number of insTr ucTions The speedup from pipelining is approximaTely equal To The number of sTages 0 Pipelining 186 We didn39T do ThaT well in our example 0 o The formula suggesTs a fivesTage pipeline should offer nearly a fivefold speed improvemenT over 53mquot The 800 ps nonpipelined arm quot5 TIme or a 160 ps clock W W W quotm m E cycle l y W BuT as The example Mm m shows The sTages may be imperfechy balanced and quotW pipelining involves some overhead so speed is always less znn an m nu man man inn man mm man man i an Pipelining 187 Or even as well as ThaT o The raTio of clock cycle Times 4 00 800 ps 39 200 ps Egz gnm znn an nan m man iznn in man man owmms suggesTs fourfold l n mam 39 lw u 2WW 5 Mm D Speedupt u r mow example ITs 2400 ps versus 1400 ps which isn39T an even a Twofold speed up Wm i MW ii a 2mm a ll J e a Pipelining 188 MIPS was designed To be pipelined 1 2 All MIPS insTrucTions are The same lengTh making feTch is easy in The 1ST sTage and decoding easy in The 2quotd MIPS has only a few insTrucTion formaTs wiTh source fields locaTed in The same places 2quotd sTage can begin regisTer reads while iT is decoding mm my 6 mm M n W i u aim mp5 E i a mm vs mm vs znn an m nu man man inn man mm man um i an ii ii 1 mm Pipelining 189 MIPS was designed To be pipelined 3 4 Memory operands only appear in loads and sTores meaning ALU sTage can be used To calculaTe address and memory access can immediaTely follow Operands musT be aligned in memory Hence we need noT worry abouT a single daTa Transfer insTrucTion requiring Two daTa memory accesses Mm mgquot m in m m mm min iii m W W a i ii v m u mame mn H lw man T mm vs man um i an a H Tnn ps Tnn Tm v52nn v52nn p Pipelining 1810 However all is noT rosie in pipeland 0 There are siTuaTions when The nexT insTrucTion cannoT be execuTed in The following clock cycle 0 These evenTs are call hazards STrucTural hazards daTa hazards and conTrol hazards 1811 Pipelining STrucTural hazards o This occurs when The hardware cannoT supporT The combinaTion of insTrucTions o For example suppose we had a washer dryer combinaTion or our roommaTe was off sTudying c5235 0 MIPS would be Trouble is insTrucTion memory and daTa memory were one and The same Tmo Tath mum D G 01 P Task atom 30 6 PM Pipelining 1812 DaTa hazards o Occur when The pipeline cornered WMIKQMMquot musT sTall because one sTep musT waiT for anoTher To compleTe 0 Suppose you find only one sock aT The folding sTage o DaTa hazards occur when one insTrucTion depends on anoTher To compleTe add 50 t0 t1 5ub t2 50 t3 Pipelinmg 13713 Forwarding a We do noT always need To waiT for The insTrucTion To compleTe before Trying To resolve The daTa hazard o For example 7 n add 50 5w stl 32 i go 5ub t2 50 t3 as soon as The ALU creaTes The sum for The add we can supply iT as an inpuT for The subTracT E7 u ge ivm Pipelinmg 13714 Loaduse daTa hazard 0 Image The following code air lw 50 20stl sub t2 50 t3 The desired daTa is only available afTer The 4 1 sTagewhich is Too laTe 3 3 lvw lm for The ianIT of The Third 7 l l 7 sTage of sub 0 m m g Ml W En Pipelining 1815 Pipeline sTaII o Frankly There39s noT much we can do oTher Than sTall The pipe ine lm E7 in al E l b W h m Eh m7 Pipelining 1816 Of cour39Se a smurf compiler mighf help 0 Consider rheCcode segment A BE C BF o MIPSCode 1w tl 0t0 1w t2 4t0 add t3 tl t2 SW t3 12t0 lw t4 8t0 add 115 tl t4 sw t5 16t0 Pipelining 1817 By moving up The 339 d lw insTr39ucTion o Consider rheCcode 0 Consider TheCcode segmen r segment A BE ABE C BF CBF o MIPS Code 0 MIPS Code 1w tl 0t0 1w tl 0t0 1w t2 4t0 1w t2 4t0 add t3 tl t2 1w t4 8t0 sw t3 12t0 add t3 tl t2 1w t4 8t0 sw t3 12t0 add t5 tl t4 add t5 tl t4 sw t5 16t0 sw t5 16t0 Pipelining 1818 ConTr39ol hazards o ConTr39ol hazards arise from The need To make a decision based on The r39esulTs of one insTr39ucTion while oTher39s ar39e execuTing 0 You are washing The Wellesley Rugby Team39s uniforms buT are unsure of The amounT of bleach To use Pipelining 1819 STaII on branch 0 We could simple sTall unTil O The r39esulT is known However39 on average 13 of insTr39ucTion execuTed inSPEinTZOOO ar39e br39anches Exam 2 we ago 39071ng Moo nimmhrs OTher39 InsTr39ucTIons r39un amiswiwm M f law have a CPI of 1 and W s EggV H i lml br39anches Took one exTr39a y in clock cycle for39 The sTall A so The CPI r39ises To 113 inhale lwhu buLM39 um bum hrau m Kiwi Pipelining 1820 10 The Amazing Kr39eskin o If you39re pr39eTTy sure you have The r39ighT formula To wash uniforms Then JusT pr39edicT ThaT iT will work and wash The second load while waiTing for39 The fir39sT To dry 0 When you are wrong you need To r39edo The load ThaT was washed while guessing The decision Pipelining 1821 The road noT Taken 0 CompuTer39s use pr39edicTion To handle br39anches am 400 600 am 1030 INC 1401 uncutvi me ml 0 One Simple approach is To WWW 5 i 3 quot3973quot A1 1 always pr39edicT ThaT a r 220 branches Will be unTaken ms mm quot5 o If r39ighTgr39eaT If wr39ong 525 W m m ear sun m we um mgr 9 quot 39 quot 39 39 go back and do iT again wwva andSA SEjE Wl mu m quot1 n Mbszlu u mu m 3 him nuLva muaw Jimur but 11 w 5 lt Nuns ugh T l 7 u aw MU F0 Pipelining 1822 11 Branch predicTion o For example aT The 0 A more sophisTicaTed version would have some branches predicTed as Taken and some as unTaken boTTom of loops are branches ThaT Jump back To The Top of The loop They are very likely To be Taken many Times They are only noT Taken once Pipelining 1823 Dynamic hardware predicTors 0 0 One approach is To keep a o AmounT amp Type of hisTory make Their predicTions based on The pasT behavior of The branch hisTory for each branch as Taken or unTaken and Then use recenT behavior To predicT The fuTure kepT may be exTensive buT resulTs are impressive Cams SwANSoM G MUMVERSPICLHTXCDH Pipelining 1824 The CPU39s workhorse Constructing a basic ALU C5240 CompuTer Organization DeparTmenT of CompuTer Science Wellesley College STarTing ouT slow 0 Logical operaTions are easiesT becau5e They map direchy onTo The hardware 7 HIV1 U llII U The Unabridged Devi 39 Dictionarr Operation 0 Of courSe we will need To array This To a full 32 biTs 3s Addi rion is nex r ALU 1273 CarryOu r quotA CurryOut occurs as either as Cameeneme CurryProprogute ALU 1274 2bi rs 4bi rs 6bi rs a dollar a We have a 1bi1 ALU fhaf performs AND OR and if i on l emmm a We crzafz a full 32bif adder by arrayng The resulfs ALU 1275 The ripple car r y adder CD a mum 4 Ramm 4 mm f 4 Carrym m2 gnaw 24 wyom l asp Fawn a nap ALU 1276 Adding subTr39acTion o SubTr acTion is The same as adding The negaTive version of an operand This is how adders perform subTr acTion and w y TwoscomplemenT is so very nice Binven Ovammn Carryin i T ALU 1277 NOR and NAND o A MIPS ALU also needs a NquuncTion We implemenT This by noTing ThaT a b a b mm opmm i am am i ALU 1er We 96ng There o buf we39re no f home 2 0mm 24mm cwquot a One instrucfirm fhaf sHII produces 1 if alt b We sfar f by expanding D The inpuf To The mux for LESS A Fawn a nu 1 mm ALU 1279 Top 1bi r ALU ALU 12710 32bi r ALU Illustrates what we do with that blasted Set out an ow the Less values get set ALU 12711 Finally Adds support for MIPS brunch if eqqu operation equu 5 zero ALU 12712 Black box r39epr39esen ring ALU ALU npemiion Zam Remit overvaw cmm ALU 1213 Car39r39y lookahead o o o 0 Curry gener ufe 39 cii i Curry propugufe on Using These define ci1 gi pi ci Now unwind The recursion To gef a sequence of equations only four eep a ALU 1214
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'