Computer Architecture CS 3853
Popular in Course
Popular in ComputerScienence
verified elite notetaker
This 43 page Class Notes was uploaded by Mireya Heidenreich on Thursday October 29, 2015. The Class Notes belongs to CS 3853 at University of Texas at San Antonio taught by Richard Whaley in Fall. Since its upload, it has received 70 views. For similar materials see /class/231414/cs-3853-university-of-texas-at-san-antonio in ComputerScienence at University of Texas at San Antonio.
Reviews for Computer Architecture
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/29/15
1 Chapter 2 Exploiting ILP 0 Simple static techniques for finding ILP Basic scheduling renaming and loop unrolling data name 84 control Hardware dynamic techniques for reducing deps dep limit ILP 1 Branch result prediction control 2 Branch target prediction control 3 Dynamic scheduling with renaming data 84 name 4 Speculation control 0 Hardware techniques for finding ILP l Pipelining 2 Multiple instruction issue Approaches for multiple instruction issue 1 Superscalar 2 VLIW 0 Limitations of ILP 2 Dependences Limit ILP If two inst have dependence some ordering required for correctness 0 Types of dependence Data dependence inst requires data computed by preceeding ins ts Name dependence Two insts one or both of which write access same regmem Control dependence whether or not an inst is reached is deter mined by previous insts Caused by flow of data values amount insts 3 Data 84 Name Dependencies NOTE inst 1 before instj in program data dependence name dependence Caused when two inst use same storage location regmem o WAW output dep inst i Writes to same loc as instj o WAR antidependence inst j Writes loc Read by 1 Can be fixed by using differing loca tions for inst i and j o This is known as renaming 0 RAW true dep inst 1 produces result Write required as input Read for instj gt inst i must compute result before inst j is allowed to continue No figtlt value prediction but can be ameliorated by dy namic scheduling 4 Static Scheduling 0 Instead of using hardware to do dynamic sched Tomasulo have the compiler pre arrange inst stream so that hazards are miminized Wo dyn egtltec compiler must schedule code to avoid dep Means pipe stage information must be available to compiler When machine is updated programs must be recompiled o If a hazard is detected inst stalled in ID with no new inst fetched or issued until the dependence is cleared Some machines even rely on compiler to insert stalls in the face of hazards so hardware does not have to do checks 0 Compiler techniques are used statically schedule the inst to avoid or minimize stalls Schedule independent inst between inst with true dep Statically perform other transforms to tackle other dependence types eg egtltplicit register renaming to avoid name dependen cies 5 Assumed Effective Latencies We will examine some transforms that a compiler might make and unless otherwise noted we will assume the following latencies between dependent set and use 0 Simple 5 stage MIPS pipeline InStrUCtion Latency 0 effective latency dep on inst Producer User cycles pair Boazd tssre Cl Need to sched l inst between iop iop 0 Id and dep use to avoid stall iop branch 1 Branch computes cond in ID FP op FP op 3 PP op Store 2 ILP available in basic block is typically small variety of techniques egtlt ploit it across bblocks Most common way to increase ILP egtltploits parallelism amongst iterations of loop 6 Instruction Scheduling Original C Code Original With Stalls 1 fori999 lgt0 lquot 1LOOP LD F00R1 gtltigtltis 2 stall 3 ADDD F4F0F2 Original Assembly Code 4 Stall 1LOOP LD F00R1 5 stall 2 ADDD F4F0F2 6 SD F40R1 3 SD F40R1 7 DADDUI R1R178 4 DADDUI R1R178 s stall 5 ENE R1 R2 LOOP 9 ENE R1 R2 LOOP 10 stall Would need nop in delay slot Branch calc cond in ID so delay between dep iop and br Reordered Code 1LOOP L D 1 O O A EU H v 2 DADDUI R1 R178 a offset of SD changed to fill de 3 ADDD Hypo V F2 lay slot 4 stall i Cycles reduced from 10 to 6 5 ENE R1 39 RQ39LOOP 6 SD F48R1 i We Will see software plpellnlng can reduce further 7 Loop Unrolling Loop Unrolling duplicates the body of the loop multiple times which can reduce loop overhead and increase scheduling 84 other optimization opportunities 0 Must have loop cleanup and guard 0 After unrolling want to remove duplicate loop control Here39s an example diff frm asmbly at source level 1 fori0 iltn i 1 fori0 iltn4 i 2 aiais 2 aiais 3 for iltn 4 4 alilais 1 for i0 iltn4i 5 a391ai1s 2 iai 6 al2ai2s 3 for iltn i 7 ai3ai3sl 4 aiais l 8 5 alilallls i 6 alilallls i 7 alilas 8 8 Loop Unrolling at the Assembly Level for i999 igt0 i xi xi s We unroll reg rename each iter do induction var elimination to get ignoring cleanup and guard Unrolled Loop Loop LD F00R1 ADDD F4FOF2 Removed Name 84 Data Dep Loop LD F00R1 ADDD F4 F0 F2 ADDIU R1R178 L D so F878R1 LD F10716R1 ADDD F12 F10 F2 L F00R1 SD F12716R1 10 ADD D F4F0F2 10 LD F14724R1 11 S F40R1 11 ADDD F16F14F2 12 ADDIU R1R178 12 SD F16724R1 13 L D F00R1 13 ADDIU R1R178 14 ADD D F4 F0 F2 14 ADDIU R1 R178 15 F4 0 R1 15 ADDIU R1 R178 16 ADDIU R1 R178 16 ADDIU R1 R178 17 ENE R1R2Loop 17 ENE R1R2Loop 9 Loop Unrolling without Scheduling 10 Loop Unrolling with Scheduling Unrolled Loop Unrolled Loop wt Stalls Unrolled Loop wt Stalls Sched Loop no stalls 1 Loop LD F00R1 1 LOOP L D F00R1 1 LOOP L D F00R1 1 Loop LD F00R1 2 sta 2 sta 2 ADDD F4F0F2 3 DDDHVFOVFQ 3 DDDHVFOVFQ 2 LD F678R1 3 SD F40R1 4 stall 4 stall 3 LD F10716R1 4 L D F678R1 5 SstSII F R 5 stall F R 4 LD F14724R1 6 40 1 6 40 1 5 ADDD F8F6F2 7 Poison 7 F6V78R1 5 ADD D F4FO F2 6 SD F878R1 8 stall 8 stall 6 ADDD F8F6F2 7 LD F10716R1 9 ADDD F8F6F2 9 ADDAD F8F6F2 7 ADDD F12F10F2 s ADDD F12F10F2 10 Staquot 1 5 quot s ADDD F16F14F2 11 stall 11 stall 9 5D F12vi16R1 12 s D F878R1 12 s D F878R1 9 5D F410R1 1o LD F14724R1 13 L D F10716R1 13 L D F10716R1 10 3D F88R1 14 stall 14 stall 11 ADD39D F16 39 F1439 F2 15 ADD D F12F10F2 15 ADD D F12F10F2 11 ADDIU R1 39 R139 32 12 SD F16724R1 16 stall 16 stall 12 SD F1232716R1 13 ADDIU R1R1732 17 stall 17 stall 13 ENE R1R2Loop 18 S D F12716R1 18 S D F12716R1 7 14 ENE Rl39RQ39LOOp 19 L D F14724R1 19 L D F14724R1 14 S39D F1632 24R1 20 stall 20 stall 21 ADD D F16F14F2 21 ADDD F16F14F2 39 Moved 2 5D paSt ADDIU 22 stall 22 stall 0 Loop now takes 144 2 35 cy 23 stall 23 stall I 24 s D F16724R1 24 s D F16724R1 ces Ier ra er an 2 l t th th 28 4 7 25 ADDIU R1 R1 732 25 ADDIU R1 R1 732 26 stall 39 39 26 stall 39 39 a Took 6 cycles after sched 27 ENE R1R2L00p 27 ENE R1R2L00p 28 stall 28 stall before unrolling 11 Reducing Branch Costs with Prediction 0 Branches cause control or branch hazards Saw that need to compute branch target and condition can cause stalls in pipeline We have seen name hazards can be fixed by renaming We will see that data dep can be helped by scheduling 84 OOE We saw that the compilerprogrammer can help with control hazards via unrolihg 0 Hardware can further improve branch performance using prediction 1 Branch prediction predicts whether branch is taken or untaken In static prediction compilerprogrammer indicates if br is taken 12 Branch Prediction Buffers AKA Branch History Table Dynamic branch prediction done by special cache called branch predic tion buffers o Cache indexed by lower portion of branch address branches far apart may overwrite each other Unlike data cache don39t check if this was the right branch at all 0 Contains a bit indicating if the branch was taken or fell through last time Can increase bits used to capture more complex behavior or not In dynamic prediction hardware br predictor guesses if br is taken G39Ves decent accuracy W39th Very we memory or not 0 If branch is taken no help unless branch target is also known BPB alone help if we calc target address very early Schemes for branch target help discussed later 2 Branch Target Prediction predicts where a taken branch will go Only untaken easily done statically Can have front end precompute simple branch targets Dynamically predicted using branch target buffer a Need both if branches are not to cause pipe flushes 13 1 Bit Branch Prediction 1 bit branch prediction 0 Only 1 bit per entry 0 If prediction wrong invert bit 0 Will typically mispredict loop br twice possibly only once for initial example How often will these two branches be mispredicted for i0 i lt 100 i if i amp 1 A else B 14 2 Bit Branch Prediction Use egtlttra bits for prediction 0 Provides a stickiness diction doesn39t thrash Branches that favor one way mispredicted less 0 Must update bits every cycle so will read m write every cycle l bit only on mispred so pre Comp Arch Henn 81 Patt Fig 24 pg 83 Predict taken 10 ta ken 11 A Predict L I l egtltec depending on init Unknowable as it may be an other br entirely Essentially halves of mis predicts in example basically bounce back and forth h tally Now as good as static orizon for i0 i lt 100 i if i amp 1 A else B 15 Correlating Two Level Branch Prediction Buffers Try to improve prediction by examining recent behavior of preceeding branches An mn pred uses behavior of last m dynamic branches to choose between 2 predictors each of which uses n bits of stickiness 0 Records most recent m dynamic branch results using an m bit shift register Requires 2 gtlt n gtlt Ne bits of storage N5 entries Machines almost never use n gt 2 01 conventional l bit predictor 02 conv 2 bit predictor Require 20 gtlt n gtlt N8 n gtlt Ne bits storage In our example last two dynamic branches are if and loop control so m 2 will allow us to perfectly predict the if m 1 would capture only loop control which is not correlated with if behavior 16 mn Branch 22 predictor 0 N8 10 shirt 00 01 10 11 entr for i0 i lt 100 i if i amp 1 A else B ll Prediction Buffer with N8 Entries Each column is n bits wide Each column corresponds to a On local pre dictor select among m columns using branch history stored in shift register entry used e W mod Ne Said to have m bits of global history and n bits of local history During loop will flip between 11 84 Ol predic tors If branch changing loop branch always taken predictors which will remain in steady state of correct prediction 17 Advantages of Global Prediction Comp Arch Henn amp Patt Fig 27 pg 87 0 Little advantage to 1 18 Tournament Branch Prediction 4096enlries 2 bits per emry I D Unlimited enivies 0 y 2 predictor t 300 4 marix a Emsperentry 2Y2 pred beats 02 of l 1024 eniries I 239 same size Result of some branches correlate with prior branches but not always In which case local predictor may indeed be better Therefore can tomcaiv or a Beats 02 of in nite combine local and global predictor In a tournament predictor doduc 53 Si ze 0 Uses multiple predictors usually one local eg 02 and one global 0 eg 22 SPECEg We a 0 Uses sticky selector like 02 selector to choose between local amp benchmaris Imp global next slide 0 Makes better use of bits than just pumping up entries on strictly local or global predictors 0 Must read amp update all pred in parallel ems 5 0 Tournament predictors are the most accurate branch predictors in E 1133 use today 6 39W 1 0 eqnioii 5 6 8 10 12 14 3 18 Frequency oi misprediclions 39 State Diagram for Tournament PrediCtor 20 Misprediction Rate for Three Predictor Types reSlreS2 resx0i predX wrong gomp Arch Henn amp Patt Fig 28 pg 88 0 Based on SPEC89 resX1 predX right 3 benchmarks 0 Never change state if both right or both Tournament pred best wrong Small size 32K does 0 Strengthen pred if current right other WWW pretty well wrong a 0 Little improvement be Weaken pred if current wrong other WW WW yond 128K Why pumping Ne have so little effect Tolai Diediclur size 21 Is Branch Prediction Sufficient o The idea behind branch prediction is to know where to fetch next inst from to avoid draining the pipe If we predict taken still don t know where to fetch from until the branch target BT is computed Pred takenuntaken helps hide late stage condition detection but taken branches still penalized until BT computing stage Still helpful for long running cond like floating point comp 39 We need a way to predict the BT as well as whether the branch will be taken if we are to avoid branch penalties even assuming perfect prediction i Branch prediction alone may be very helpful with multi issue amp Instruction Fetch Unit which may precompute BT only indirect jumps require reading non PC register 22 Simple Branch Target Buffer branch target buff limii 0 Stores only taken br untaken behave like non br Replaces 01 predictor 0 Checked during IF stage for earliest resolution 0 Must store relevant bits of PC tag to avoid flushing the pipe on non br e r m EEEIKI milmm WWW m latlain cam Requires more bits than branch prediction PC has implied bits due to alignment and possibly entry Keepingchecking PC of inst high overhead extra bits make updates more expensive as well Some BT buffers store actual BT instructions rather than just 2 Hardware can do branch folding If keep pred bits n gt 1 in buff must store taken w untaken Can combine with indep br pred then keep BT buff small 23 Simple Branch Target Buffer Access amp Penalties Comp Arch Henn amp Patt Fig 223 pg 124 0 Without delay slot 2 CyCIe penalty for mispredict 1 cycle of fetch of wrong inst Discover error in ID stage 1 cycle to update BT buffer penalty table Send PC to memory and branchlarger buffer Entry found in branchtarget J In buff Pred Actual Penalty Yes ta ke n ta ke n 0 micro Yes ta ke n u nta ke n 2 mm N o ta ke n 2 D N o u nta ke n O N mal branch instruclion execution Mispredlcied branch Branch correctly kill etched insirucllon predicted Enter branch instruction EX u I 4 PC IMO branch wiih no stalls iargei buffer targel delete eniry from urge buffer 24 Integrated Instruction Fetch Unit To feed wide backends many architectures use an Integrated Instruction Fetch Unit which performs 0 Integrated branch prediction When inst is fetched it is decoded so that branches are identified If they are at least simple br targets are computed PCimmed Branch is predicted so that we can fetch correct instructions Instruction prefetch Using br pred instructions are fetched into buffer before they are accessed to provide a pool of schedulable inst for backend Instruction memory access and buffering multiple inst may cross cache boundaries so buffer them in known queue using prefetching and feed them to the backend as needed gt IFU even more important in x86 where x86 inst are of varying length and so getting one inst may take multiple reads 26 Dynamic Scheduling Out of Order Execution 25 Dynamic Scheduling o To support dyn sched in our pipe must split ID into two stages 1 Issue Decode inst check for structural hazards 2 Read Operands Wait until no data hazards then read operands o While issue is inorder execution and completion need not be Hardware is designed to reorder instructions on the fly as needed o Inst are issued in order but possibly executed out of order o Can schedule unrelated inst between RAW insts o Advantages avoids many data stalls allows greater use of multiple FU does not require good compiler easily combined with renaming to solve name dep Disadvantages Complicates hardware possibly slowing clock rate makes exec less predictable Different ways of doing dynamic scheduling we will use Tomasulo s Approach DIVD FOF2F4 o No dep so start SUB while DIV ADDD F10FOF8 still running why no work for SUBD F12F8F14 inorder SUBD F3F5F6 o If multiple fadders EX both SUB same time o Later inst may finish before DIVD OO completion gt OOEOOC complicates exception handling inorder commit dis cussed later 28 Tomasulo s Algorithm for MIPS FPU 27 Tomasulo Approach Comp Arch Henn 23 Patt Fig 29 pg 94 o Inst queue buffers inst FIFO from inst unit Each FPU has res stat that has multiple entries RS entry hold inst ops and flow info enema LD buffers hold EA compo b nants until calc EA while wait 1 ing for mem bus value while 1 I waiting for CDB sched r 7 Jam I ST buffers hlold EAlcomp until Store buffs can tell which EA are truly different dyn mem disamb amenuni i l mem calc EA Whlle awfaltmg Idlata to 39 ST hold both while waiting for allowing independent LD to complete before preceeding ST I Commendaiabusicaei mem unit Invented by Robert Tomasulo of IBM Hommsimawnn Uses lvl of indirection called reservation stations to track when l operands are available defeat RAW haz to enable DOE and reg mgjggg Fpiegisiml ister renaming to avoid WARWAW hazards As inst issued register specifiers are renamed to reservation sta 3333 tion or load buffer entry 1 Fioangpoim I operations Store buffers Res Statbuffers updated via common data bus CDB coming I 1 I Operationbus Inst wait until CDB provides data begin EX when available dynamic I 29 Tomasulo Algorithm Steps TA requires new way of viewing inst from our ipipe each step may take 30 Bookkeeping for Tomasulo s Algorithm In order to detect 84 eliminate hazards need bookkeeping info multiple Cycles and be pipelinedl l Reservation Stationbuffer fields 39 39 O o eration to erform on source 0 erands S1 and S2 Set 1 Issue Get inst from queue issue to appropriate res statbuffer duprin pissue p p if there is one available otherwise stall on structural hazard If Q Qg Res39 Stat entries that will roduce Si and 2 0 means 0 39 operands avail in reg write vals to res stat else add ptrs to res u heezed or actual value in V Setpin issue and exec statbuff that will have results Thus input is renamed defeating 39 39 V V The value of S1 and S2 I nored unless IS 0 Set In WAW and WAR hazards isis uekand exec g Q 2 Execute Monitor CDB until alloperands available RAW and write A EA info f0 IdSt Initially imm eld issue after EA calc them to res stat entry When all operands are available egtltecute e39xec EA 39 operation assuming prior branches completed Mult inst may be 39 Bus Is this reservation station currentl ein used Set In ready at same time LDST first compute EA then perform op issu and write results y b 9 ST require ordering with other ST and LD 39 I I 2 New field for register file 3 Write result Write result to CD8 and from there to registers reser Q I Res stat entry whose operation will produce a result for this 0 39 vation stations and ST buffers ST write data to mem when both rel If 0 no currentl active inst has 4th re as out ut Sim I 7 EA and value available they are sent to MU and ST completes ovgerwritten by later it to affect writebackg Squashp Dy 32 Computing Tomasulo Algorithm Information 31 Tomasulo Algorithm Information Tables Res Stat load5 store5 fadd3 fmu2 after only first load completed Fig 210 in book wrong Assume Iatencies loadzl faddZQ fmulzlo fdiv40 I t t iquot5trUCtIi quot hisgory Wt W 0 Computer steps of TA given in book but as human can easily n5 l UC IOFI ssue xec H e eg i L39D FG B MRZ Yes Yes Yes F0 Mum compute table below using res stat dep between Inst and cycle LD F245R3 Yes Yes No F2 Load2 times for ops MULD F0F2F4 Yes No No F4 0 SUBD F8F2F6 Yes No No F6 0 DIVD F10F0F6 No No No F8 Add1 Instruction ADDD F6F8F2 No No No F10 0 Stat39on at FU Reservation station info m Op V Vk Q Qk A 1 ngd2 yes LD 45RegsR3 1 ye su d Meml34Regis2H Load2 0 2 m 0 Mm g3 mu39d Regslm Loadz 0 0 Now we know that load 1 completes at cycle 4 Fill in TA state tables as they would be at cycle 4 to get prev slide CDB write adds 1 cycle latency to all inst 33 Tomasulo Algorithm Information Tables 2 Compute for when MUL is ready to write it39s result instruction history regfile tags Instruction Issue Exec Write Reg Q LD F634R2 F0 LD F245R3 F2 MULD F0F2F4 F4 SUBD F8F2F6 F6 DIVD F10F0F6 F8 ADDD F6F8F2 F10 34 Compute Tomasulo Algorithm Information for Loop Res stat load5 store5 fmul2 bul alul Assume cycles cachel fmul4 BU comp BT during issue verifies 39taken39 prediction in Exec Any xfer between units must use CD8 1 cycle latency Note inst from loop ALUBU not really res stat In stru ctio n 35 Hardware Based Speculation In order to get reasonable ILP must reduce time wasted due to con trol hazards which requires speculation This chapter concerned with hardware based speculation 0 Combines three key ideas 1 Dynamic branch prediction what inst past br 2 Speculation execute inst past br 3 Dynamic scheduling out of order execution work around data m control hazards o Inst on such a machine must 1 Issue in order 2 Can execute 84 produce results out of order 3 Update state of machine commit in order L 36 Speculative Tomasulo Approach Allows speculative execution with dynamic scheduling based on Tomasulo39s algorithm Like Tomasulo39s algorithm it allows instructions to complete out of order Unlike Tomasulo39s algorithm it forces the instructions to commit update machine state such as registers or memory in order This means Precise exceptions are supported Supports speculative execution Will add additional step instruction commit to Tomasulo39s alg Must have hardware support to buffer out of order results until ready for in order commit 37 Speculative Tomasulo s Algorithm for MIPS FPU Henn 84 214 1 0 Addition is that CDB writes to the Re order buffer ROB rather than directly to FP reg39SterS I Issue Get inst from queue issue to appropriate res statbuffer Holds results amp prOVIdes results to I if there is one available otherwise stall on structural hazard If Data SUbsequent Operat39ons as reIS Stat operands avail in reg write vals to res stat else add ptrs to res quoteg s e s used to d between COImpletlon Of statbuff that will have results Thus input is renamed defeating and mower commI39t WAW and WAR hazards L39KeI Inst queue may be Implemented Execute Monitor CDB until all operands available RAW and write as C39rCUIar queue I I them to res stat entry When all operands are available execute used to aCCOImpl39ShI renammg dIef operation assuming prior branches completed Mult inst may be namIe haz amp Insure 39morder comm ready at same time LDST first compute EA then perform op Reg39SterS have reorder DUfr nOt res ST require ordering with other ST and LD d Stat I Write result Write result to CDB and from there to registers reser MIPS FP mt 5mg Store DUfrerS replaced W39th reorderen vation stations and ST buffers ST write data to mem when both Specmat39ve Tomasu39o 5 try ROE hows amp d ta EA and value available they are sent to MU and ST completes algorithm Res stat stlll buff op until operands amp FU available data amp struct hazards 38 Repeat Outof order Commit Tomasulo Algorithm Steps Load buffers 39 Speculative InOrder Commit Tomasulo Steps Issue Get inst from queue Issue it if there is an empty res stat 40 Speculative Tomasulo Approach empty slot in ROB otherwise stall issue wt struct haz Mark both as busy If operands avail in regs or ROB write them to res res stationsrob entries id5I fpadd3I fpmui2I robzg stat Write output s ROB entry to res stat Oniy one of each izu ALU handles EA Execute Monitor CDB for operands to become available avoids iiviUiI FADE pipelinedI division done by iiviUi RAW hazards Then execute op when FU available struct haz 1 cycle in EX iop including EA branch or cache Stores need only base reg and imm EA calc Loads have additional o 2 cycles for ADDD 10 for MULD amp 40 for DIVD MEM step in this stage Write result Write result to CDB including ROB tag provided during issue From CDB write to indicated ROB entry and any res stat waiting for result Mark res stat as available Store monitors res mem MUL D F0 F2 4 39 39 39 39 SUB D F8 F6 F2 CDB for value writes it to ROB when it arrives DIVD F10 F0 F6 Commit When result at head of ROB is available commit it If ADM F6 F8 store or op write mem or reg free ROB entry If mispredicted branch flush ROB restart at correct branch successor A A 41 Speculative Tomasulo Information Tables When MULD ready to commit 4 end of cycle 16 Only way inst stays in issue is unavailable ROB or res stat Egtltecute of timing table 7b egtltecute of steps which happens cycle after successful issue Write step of alg cont until inst reaches head of ROB write of CDB only on cycle A A A 41 Speculative Tomasulo Information Tables When MULD ready to commit 4 end of cycle 16 Only way inst stays in issue is unavailable ROB or res stat Egtltecute of timing table 7b egtltecute of steps which happens cycle after successful issue Write step of alg cont until inst at ROB head Book may assume V written cycle 17 but we won39t 42 Speculative Tomasulo Example 2 Res stat oad5 store5 fadd2 bul aul Assume cycles cachel fadd4 single issuecommitCDBcache 16 entry ROB Stores now do MEM during commit 43 Issuing Multiple Instructions in Parallel Must have multi issue to achieve CPI lt 1 IPC gt 1 Requires simultaneous fetching decoding and executing of multiple instructions Two approaches to multi issue l Superscalar hardware egtltamines inst stream for parallel inst 2 VLIW Compiler puts parallel inst in VLIWpacket Intel modified VLIW idea slightly call it EPIC VLIWEPIC covered in Appendix G Most of this chap concentrates on multi issue using superscalar 44 TOD Five Mul ssue Approaches Issue Hazard Dmmgmsm g ADWaacn structure meteman scheam charamenmc Imwemenzauan sn n7aynmc nzmwzve m womevexe sunu visvzrclv ARMMIPS WWW ammo executmn mums WWW specuwtwe 5 MIPS am we speeumwe supersa zr 2y pp e Wm Dne y WeMeW VLlWEPIC m next wup e efsnaes 45 Stan Munnssue with Very Long xnsuucuon Word VLIW Tne meHer n25 respensnpmty to package mump e mdeD mst m7 Qether m vuw c manymstmcuonsm pzrzuex pastmzmnesnzyeyznea my I by 2 cunt o LP meHerczn msoeyer wnn many funcuonz urns to feed ms reqwes very eempneatea c mmg support 7 Superb otx sdweduwvg 46 vuw cnauenges As mzcmne QeE men 10 must keep pace mump e ports to reg me must be zb e to ssue mump e ozds at once etc 7 Too mudw wmtn may restrzm mzxnnmn dockSDeed Any mncuonz umtstzH eg meacnewm causeennreproctostzH 7 Czdwe mnsses are not as premeupxe so dyan 2 2 Word xengtn wmdw mean correcuy May sn nmcznuy mcrezse ode snze nd 29 essnye comDHer oDmmzzUons eg n 2 you must recomDHe executab es to run a a due to msertmg MP W yuw onQ ooD am my Pzr2He Instrucnon Computer aesngn at the cost efsnnpnaty 47 All Aboard the Good 5th ltamc wum n mm a Mum C aanuw 1Apnnaus 7 mm 4quot nm 77mmquot 44mm mm mm m nu ma zaps mu m 48 EPICItanium Remarks I Itanium does not appear to be a major factor in the marketplace Still increasing perf of DSS seems to indicate hardware complex ity has not yet hit a ceiling Loss of Mhz a slow CMPs a opportunity to late Itanium has not shown that software oriented approach necessar ily yields simpler or faster processors Egtltcellent compiler support has remained elusive i IA 64 arch does Lot appear to have major architectural advan tage as did caches and RISC in its day I Software 84 hardware approaches are instead intermingling gt I would bet on hardware with some software tricks Software tricks in hardware centric Hardware tricks in software centric I Conditional inst eg move I Scoreboard scheduling of inst I Prefetch inst amp cache hints I Dynamic br pred I Br pred hints I Rollback or trapeandefixeup support for I Spec noneexcepting loads speculation I Hardware for checking spec load core rectness 49 Static Superscalar Approach I Issue multiple inst that are quickly detected as independent Most SS allow 4 inst at most dep checking available ILP If one inst in set has some dependency only prior inst are issued maintains in order execution Must increase inst fetch rate widen icache bus pipeline IF often by having integrated inst fetch unit Longer pipeline and multi issue result in more hazards Q Why does multi issue result in more hazards I No increase in code size and less dependence on compiler than VLIW I Programs compiled for nonsuperscalar machines may still get benefit 50 Simple Static Superscalar Pipeline in Operation I Most superscalar procs have issue restriction which limits the types of instructions that may be issued simultaneously Parallel inst called issue packet May have O n 2 gn g 4 inst in issue packet I An easy issue rest for dual issue is fp comp and other ldstiop Dependencies limited since they use different WB and EX ex cept for fp loads Will become single issue for all integerfp code Since fp computations are long running will need pipelined and perhaps multiple FPUs I Need for multiple inst Inst type Pipe stage intinst IF ID EX M B cycle often requires rp inst IF ID FEX FWB Superpipelining IFID Int inst IF ID EX MEM we stages fp Inst IF ID FEX FWB I intinst IF ID EX MEM we New FUS Instruc fp inst 1F ID FEX FWB tion Fetch Unit Issue Unit etc 51 Simple Dynamic Superscalar with Tomasulo Other than multiple instruction fetch and issue Tomasulo39s Algorithm works unchanged for dynamic superscalar and defeats some issue probs I fp cycles 2 3 iop cyclesl I CDB write adds 1 cycle lat I Branches req In stru ctio n single issue I BCU ALU spec for br cond 84 BT I ldst use ALU for EA Must watch for struct haz a ADDIU delayed by SD s use of ALU in EX stage gt Later inst delayed until branch is resolved 52 Resource Usage Table fig 326 o ADDD only shows up in 1St EX cycle since unit pipelined c Any cycle without entry in both ALU 84 FPU is missed opportunity 0 Any resource including CDB can cause struc tural hazard 53 Dynamic Superscalar with Tomasulo Ex 2 0 Had repeated struct stalls on ALU ADDIUSD If we add ALU alone will then stall on CD8 Assume 2 d ALU for EA calculation 84 2 d CDB o fp cycles 2 3 iop cycles 2 1 Insm Ct39O stat at EU BE 0 CD8 write adds 1 cycle lat o Branches req single issue 0 ECU ALU spec for pr cond 84 BT o EAU ALU for EA A Last inst done at end of cycle 16 rather than 19 54 Resource Usage Table fig 328 Assume 2nd ALU for EA calculation 84 2nd CDB 0 So we see that widening one stage ALU may require us to widen others CD8 0 With less dep or more ALUFPUs might need more D cache ports 55 Timing Table for Dual Issue Pipeline wo Speculation A Sep ALU 84 EA units allow parallel egtltec eg cycle 3 A 2nd SW egtltec on cycle 9 since LW using EA unit on 8 i final instruction finished at cycle 19 1 Appendix A Concepts 2 Pipelining 0 Basics of pipelining o Hazards o Pipelining like an assembly line Structural Fine grained parallelism between pipe Data stages Control 0 Each step pipe step takes 1 cycle 0 Pipeline implementation 0 Different steps from different instructions are Data path processed in parallel Control 0 Pipelining improves throughput Exceptions Individual instructions will take slightly 0 Multicycle operations longer due to pipeline overhead 0 Precise exceptionsinterrupts 4 Simple RISC Pipeline 3 Simple Integer Pipeline 0 Time moves left to right inst order top to bottom 1 IF instruction fetch fetch inst 3 EX executioneffective 0 If we handle each inst sequentially unpipelined frm i cache and increment PC 0 For mem calc EA 0 Notice at clock 5 we have 5 inst issuing in parallel 2 ID inst decode regimm Ideally will give 5 fold speedup over unpipelined code 0 Decode inst 0 Perform ALU oper on two in o Read vals from register file put regs 0 check for equal on those vals 0 Perform ALU oper on reg Inst 0 C numzer for branCh and 39mm inst 1 IF ID EX MEM we 0 Sign egtlttend Immed val 4 MEM memory access ld val Calc pc rel target address us from or store val to d cache ing immed WB write back update reg file with result of op or Id 01 5 Pipeline as series Of time shifted data paths Cycle 5 shows what happens when pipe is full 0 Each pipe stage processing diff inst 0 Running all stages in parallel may require duplicate hardware 0 Reg read in ID stage occurs in last half of cycle reg write in WB occurs in first so no conflict Computer Architecture Henn amp Patt Fig A2 pg A 8 Time iin Clock Dvclesl CC CC 2 C l l 4 22 7 r 3 lM i Reg IC 1 CC 7 CC a F i ii 4l 6 Reducing Branch Stall Understanding Pipelineing o MUX multiplexor Used to decide which of N inputs gets passed thru 0 Data amp control must be passed between stages a Internal regs pipeline registers or latches are used Named for they connect Computer Architecture Henn amp Patt Fig A24 pg A 38 stages 0 Pipe regs are state elements amp retain vals between pipe stages Other state elts include memory general regs the PC etc Any retained info must be passed thru pipe regs as long as needed Dest reg passed through to last pipereg 0 Branch haz red by moving zero test amp branch targ calc into ID stage Requires extra hardware adder zero test 7 Pipelining Limitations lower bounds on cycle time 0 Overhead from info flow in piperegs o Imbalance between stages IF MEM EX often fat 0 Minimum realizable cycle time due to clock skew N deep pipe give NfOId speedup pipedepth 1 p lpe zstallszperzmst speedup o Pipeline hazards Structural resource con flicts Data result dependencies Control changing PC branch 8 Structural Hazards Structural hazard Resource conflict due to hardware not supporting all possible combinations of overlapping instructions 0 Why do structural hazard occur FU not fully pipelined div A resource not duped enough mem 0 Why not design hardware so they never occur If case is rare not worth expense For rare inst pipe will be empty so non pipelined inst better 0 We will see that MIPS detects all stalls in ID which requires extra hardware 9 Memory Port as a Structural Hazard Computer Architecture HampP Fig A4 pg A14 O Stl UC haz from LD S MEM amp CLi DIP CL J CN I39le CCC 37 ill 3 s A Hji 0 Why most pipelined machines 2 have separate ID caches quoti i 141 2 Mg a J i E i Del i it ENE117 9 Enforcing stage execution avoids structural hazards LW R1 4R2 IF ID EX MEM WB Liiii R1 4R2 IF ID EX MEM WB ADD R4 R2 R3 IF ID EX MEM WB ADD R4 R2 R3 IF ID EX WB 10 Effects of Memory Structural Hazard on Pipeline Clocknumber Inst 1 2 3 4 5 6 7 8 9 10 Load IF ID EX MEM WB inst i l 1 IF ID EX MEM WB inst i l 2 IF ID EX MEM WB inst il 3 stall IF ID EX MEM WB inst i l 4 IF ID EX MEM WB insti5 IF ID EX ME inst il 6 IF ID EX o Stalls called bubbles as in liquid bearing pipe 0 Nothing useful goes through in bubble No inst started on cycle 4 No inst finished on cycle 8 11 Dependences Data amp control hazards arise from dependences between inst 0 Types of dependence described in detail in Ch 2 Data dependence inst requires data computed by preceeding insts Name dependence Two insts one or both of which write access same regmem Control dependence whether or not an inst is reached is deter mined by previous insts 0 Dependence indicate Possibility of hazard Constraint on ordering of results Constraint on parallelism 12 Data Hazards Data hazards overlapping of piped inst causes different writeread access patterns 0 Access patterns of interest RAW read after write most common hazard true dependence gtllt In our ipipe only loads can cause RAW stalls assuming for warding RAR never a hazard WAW hazard cannot occur in our ipipe since all writes occur at same pipe stage MEMWB WAR hazard cannot occur since all reads take place before any writes in our pipe 13 Example of RAW Data Hazard Which of these insts have data hazard with DADD DADD R1 R2 R3 DSUB R4 R1 R5 AND R6 R1 R7 0R R8 R1 R9 XOR R10 R1 R11 Computer Architecture HampP Fig A 6 pg A716 mnW ni 39 lall 14 Ameliorating Data Hazards through Forwarding Forwarding AKA bypassingsnort circuiting overcomes many data hazards o Vals forwd frm piperegs di Comp utELmhitecture HampP Fig A7 pg A718 rectly after computation simple egtlt forwarded only I I a I I mm v a to same unit I I1 39 cum1m W n I131ng i 0 Data paths forward frm 5 I I g w EXMEM 84 MEMWEB n 39 I1 lgquot piperegs s 39 B I g quotaquot h o Comparators check if reg n I I III Er src of inst matches dest of a prev inflight inst If 50 control logic mm V III chooses forwarded val Why fwdng not needed for 15 Forwarding from MEM 0 Can forward between any Sta 9 es 0 If fwd from MEM succ data access does not stall DADD R1 R2 R3 LD R4 0R1 SD R4 12R1 Would DADD stall LD R4 0R1 IF ID EX DM WB DADD R5 R4 R2 IF ID EX DM WB Computer Architecture HampP Fig A 8 pg A719 iii ll I ll rl I OR 16 Forwarding Limitations LD R1 0R2 IF ID EX DM WB Computernfrcuhitecture HampP Fig A9 pg A720 DSUB R4 R1 R5 IF ID EX DM WB quot 39 AND R6 R1 R7 IF ID EX DM wa 39 quot quot39 39 0R R8 R1 R9 IF ID EX DM wa M n I 39 I I 0 Cannot forward into the past I I 39 mm a I39 umuw 39 End cycle 4 to beg 4 will not work DSUB must stall 7Mnnmmmimm 17 stalling for Correct Execution 18 Forwarding Adds Complexity to Hardware Forwarding to ALU requires 3 extra inputs V MUX and 3 new paths to these inputs 0 Must stall DSUB Q 4 to produce LD result Why does stall affect subsequent insts Why does stall not affect prior insts 0 value comes from Computer Architecture HampP Fig A23 pg A 37 number reg frm reg m LHlLH quot 5quot 1 2 3 4 5 6 7 8 9 2 EXMEM ALU result of quot LD R10R2 IF ID EX MEM WB prev inst DSUB R4R1R5 ID MEM WB AND 363137 EX MEM WB MEMWB ALU or load re sult of prev inst 0R R8R1R9 EX oControl of MUX reQLnres LD R10R2 DSUB R4 R1 R5 AND R6R1R7 0R R8R1R9 compare src reg of current inst with dest regs of prev inst in IDEX EXMEM and MEMWB piperegs stall ID stall IF 19 Load RAW Interlock for Integer Pipeline Pass frm ID to EX inst is issued All data haz det in ID Comparators det if two reg the same Code LD R145R2 DADD R5R6R7 DSUB R8R6R7 DR R9R6R7 LD R1 45 R2 DADD R5 R1 R7 Result No dep Action R1 not used after EX so no action 20 Instruction Scheduling Some dependency stalls can be defeated by instruction scheduling o Simple scheduling may be done in hardware 0 More complex scheduling like this may be done by compiler Stall for depend comparators det use of R1 in DADD stall DADD Only prob comes with load in EX and DSUB R8 R6 R7 R9R6R7 and succ inst before DADD enters EX use in ID as shown in table gt Insert bubble if read LD R145R2 R5R6R7 R8R1R7 R9R6R7 Depend defeated by for warding Comp detect use of R1 in DSUB forward Id val in time for DSUM to enter EX in ID load in EX and read matches dest R145R2 R5R6R7 R8R6R7 R9R1R7 Depend but accesses in order Read of R1 by OR in 2nd half of ID while write oc cured in 1St WB of LD LW LW DADD SW DADDI R4 R1 R2 R3 R3 0 R4 4R4 R2 R1 8R4 R4 4 0 R4 4R4 R4 4 R2 R1 4R4 LW R1 LW R2 DADDI R4 DADD R3 SW R3 21 Control Branch Dependencies An inst is control dependent on a branch if the inst will only be executed when the branch has a specific result o An inst that is control dep cannot be moved before the branch Execute unnecessary statement Why bad An inst that is not control dep cannot be moved after branch Fail to execute required inst Which statements are control dep DADD R1R2R3 1b DIFF DADD R9R3R8 2b SUB R5 R1 R6 BNE R4R5DIFF 3b AND R6 R1 R8 SUB R5 R1 R6 4b DONE 0R R6 R1 R8 5b XOR R7 R7 R7 B DONE 22 Control Branch Hazards Occur when CPU does not know soon enough Whether branch is taken Target of the jump To avoid stalling CPU due to control hazard can Determine above info early in pipe we do so in ID Delay branch execution until info is known delay slots Branch hazards typically impact integer performance worse than delay or structural Our machine uses delay slot amp early ID detection 23 Extra Hardware Needed for Early Branch Discovery hardware costs of early branch computation Note extra adder amp zero test in ID Zero test all bits same i easier than magnitude we may complete by end of ID compute immPC V inst Due to complexity many archs resolve branches later Computer Architecture Henn amp Patt Fig A24 pg A 38 24 Techniques for Addressing Branch Hazards Predict not taken 0 Can t change machine state until prediction verified Predict taken 0 Can t change machine state until prediction verified 0 Must know target Q Delayed branches 0 Place non control dep inst in delay slots Why do we need only 1 delay slot in MIPS pipe Branch pred amp target buffers gt More complex techniques disscussed in Chapter 2 25 Pipeline for Predict Untaken wo Delay Slot 0 If pred wrong change op to noop before state change 0 Predict 39taken39 requires delay slot or computing branch target in IF Why is this better than always delaying What are advantages of predicting takenuntaken Clocknumber 5 6 Inst 1 2 3 4 7 8 9 untaken br IF ID EX MEM WB inst i1 IF ID EX MEM WB inst i2 IF ID EX MEM WB inst i3 IF ID EX MEM WB taken br IF ID EX MEM WB inst i1 IF nop nop nop nop targ IF ID EX MEM WB targl IF ID EX MEM WB 26 Delay Slots Instruction in delay slot executed regardless of branch outcome 0 If delay slot fu no bubble in pipe regardless of prediction 0 Compiler must find valid insts Best choice is to use independent inst from above branch Otherwise use inst from fall thru or target but only if inst doesn39t change program behavior when executed uselessly no interrupt no side effect 0 Some archs allow for a cancelling or nulli ing branch which nops delay slot inst if branch contained prediction is wrong 0 Additional restrictions on delay slot inst branch 0 Can cause complications return from interrupt o of delay slots pipeline detail visabIe in ISA 0 Less common these days 27 Exceptionsinterrupts Exceptions AKA interrupt Unscheduled events that change the nor mal execution order of instructions eg by calling an exception handler 0 Examples IO call to system space int overflow FP anomaly memory protection violation etc 0 Must save machine state handle exception 84 if possible restart normal execution o Exceptions more complex with pipelining as mult inst in flight o Precise exceptions guarantee inst before the fault are completed and inst following it are not allowed to change machine state until after exception is handled 28 Classifying Exceptions 4 Within Excep occurs during Synchronous event occurs at the internal exec of an inst not same place every execution i 2 User requested Prog directly in between separate inst asks for exception 5 Resume execution resumes af 3 Maskabe User can override ex ter handler else terminate ception Sync User User Within Resume Exception Async Coerced Mask Between Term IO dev req async coerced NO between resume OS call sync user NO between resume Inst trace sync user YES between resume breakpoint sync user YES between resume ioverflow sync coerced YES within resume fp over und sync coerced YES Within resume page fault sync coerced NO Within resume misalign mem sync coerced YES within resume mem prot viol sync coerced NO within resume Undef inst sync coerced NO within resume Hardware malfunc async coerced NO within terminate within terminate Power failure async coerced NO 29 Exceptions in MIPS Integer Pipe o Most exceptions involve memory segfault 0 Exception raised by inst i I a could occur before the exception from 30 Multiple Cycle Operations 0 Many arithmetic operations traditionally performed in multiple cy cles lnSt i Integer amp floating point multiplies Npipelined gt Start to see why precise exceptions hard Integer amp floating point divides Nunpipelined fp adds and subtracts Npipelined o Multicycle ops tend to require too much hardware andor too long a clock cycle to be worth making single cycle If expense no object can we make all ops single cycle without slowing down clock 0 Multicycle uhpipelined ops even more so and often relatively rare Stage Possible Exceptions IF Page fault inst fetch misaligned mem access mem prot viol ID Undefinedillegal opcode EX Arithmetic exception MEM Page fault data misaligned mem access mem prot viol WB None 31 Adding the Floating Point Pipes o latency of intervening cycles between an inst that produces a result and inst that uses it Since ops consume data in EX 2 of stages in EX 1 pipelenfu 1 o repeatinitiation interval cycles that must elapse between issuing two inst of same type c Fully pipelined muladd unpiped div why How many new preg 32 Pipelining of Independent FP Instructions o No stalls indep operands units 0 Stage where operands consumed in italics gt Operands consumed in order 0 Stage where result available forwarding is underlined Add result before mul load before add gt Operands available out of order What data does store need in EX amp MEM Comp Arch Henn amp Patt Fig A31 pg A 50 FU Lat Repeat iALU o 1 if DM 1 1 fadd 3 M1 M4 M5 M6 W MEM WB ID A3 MEM WB fmu39 6 1 IF MEM WB 33 Complications from Multicycle Operations More structural hazards on FPU when not fully pipelined Multiple inst can attempt to write the register file at same time WAW hazards possible since inst may not reach MEMWEB in order Out of order completion makes precise exceptions even harder More time wasted in RAW stalls The longer the multicycle pipeline the more complicated stall and P P FWN forwarding logic becomes 34 Additional RAW stalls from FP Pipes 0 Pipelined units help with structural hazards but do nothing for data 0 Mul stalled on load delay add 84 store stalled on RAW hazards c To simplify hardware many mach would stall store in ID stage until data available ie don39t issue the inst until data available Why does ADDD stall in IF 84 ID Why can store execute EX before result available Other than WB fp ldst use ipipe why Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 mm W 0 With differing length inst can have mult inst writing at same time as shown Fig A34 pg A 52 which can cause structural 84 WAW hazards and complicate exceptions a Our pipelines stall even store in ID but let it go so it hits MEM at same cycle 35 Multiple Exceptions Even in ipipe exceptions complicate things 1 2 3 4 5 6 7 8 Lin IF ID EX MEM WB ADD IF ID EX MEM WB AND IF ID EX MEM WB SUB IF ID EX MEM NB 0 Mult exceptions could occur in clock cycle 4 integer overflow in ADD 84 page fault in LW 0 Exceptions could occur out of order AND could page fault in IF cycle 3 while prior instruction LW doesn39t page fault in MEM until cycle 4 36 Precise Exceptions Supporting Precise Exceptions l Buffer results of op until all pre ceeding ops have completed 0 Expensive grows with pipe Precise exceptions require 1 Exceptions addressed in pipe or der 2 All inst preceeding faulting inst are allowed to complete len 84 variance in inst len 3 Faulting and following inst are 2 Allow out of order comp keep flushed from pipe state so that trap handling rout 4 After exc handler appropriate can complete preceeding inst 3 Allow issue iff prior inst are sure inst restarted assuming not ter minated to complete without exceptions 0 Must det if excep poss early in EX esp for PP 37 Actual Pipeline R4000 38 R4000 Pipe Causes 2 cycle Loaduse Delay IF PC select init ifetch DF init dfetch IS complete ifetch DS complete cache access 39 Data aVai39ab39e at end Of D5 caChe hit RF inst decode reg fetch TC tag check det if cache hit 39 If TC ShOWS miSS DiDe mUSt be baCked UD icache hit det hazard check WB write back to regfile Comp Arch Henn amp Patty Fig A38 pg A59 EX EA calc ALU op branch gt We see caches slower than ALU target cond eval on this arch Comp Arch Henn amp Patt Fig A37 pg A 58 40 Three cycle Branch Delay in R4000 R4000 predicts hot taken computes branch targ in EX 39 Two cycle LDUse Delay in R4000 Br targ forwarded from EX 4th stage gt 3 cycle delay Compiler can reduce taken branches 0 Forwarding from DS stage Why not have 3 delay slots Clock number Clock number Inst 1 2 3 4 5 6 7 8 9 Inst 1 2 3 4 5 6 7 8 LD R1 IF IS RF EX DF DS TC WB taken branch 139 IF IS RF EX DF DS TC WB DADD R2R1 IE IS RF stall stall EX DF delay slot 139 1 IS RF DF DS TC DSUB R3R1 IE IS stall stall RF EX stall z 2 hop hop 0R R4R1 IF stall stall IS RF stall 03 hop hop branch targ 1 Is forwarding needed to defeat hazard on DSUB 2 Is forwarding needed to defeat hazard on GR untaken branch 1 delay slot 139 1 inst i 2 inst i 3 1 Chapter 1 Concepts Terminology Trends Measures of Computer Performance Principles of Computer Design poop 2 Basic Terminology ILP inst lvl parallelism o CMP Chip multiprocessor CPI clocks per inst o multi issue ability to perform IPC inst per clock more than one inst at once FPU floating point unit 0 pipelining breaking single inst ALU arithmetic logic unit up into sub processes PE processing element eg FPU ALU etc ISA instruction set architecture CISC complex instruction set computer RISC reduced instruction set computer VLIW very long instruction word 3 Microprocessor Performance Growth iSPEC Computer Architecture Henn amp Patt Fig 11 pg 3 O Perf relative to VAX 11780 0 Diff between 25 and 50 due to arch 84 org ideas caches Increased ILP pipelining mul tiple inst issue gt FP speedups even V7 v mm greater Software trends reducing arch inertia 0 Standard operating systems uniX dos windows 0 Less use of assembly In 2002 hit wall 20 Power dissipation little ILP left memory latency Old classes based loosely on size 4 Classes of Computers New classes based on usage 1 Desktop iface directly wt user includes PCs and workstations o Emphasize both price and performance Microcomputer Computer in which entire CPU is on one chip microprocs now used types of computers in all Mincomputer 2 Servers provides computation Mainframes or data to other devices Supercomputer o Emphasize availability scala bility and throughput 3 Embedded systems computer lodged in other devices 0 Variable but tend to empha size price minimizing mem ory and power usage and of ten require predictability 5 Three aspects to architecture 1 Instruction Set Architecture programmer visable machine instruc tions 0 Really binary numbers but can think of it as assembly code 0 Architecture used to imply only ISA design but modern ISAs fairly uniform except in embedded 2 Organization high level aspects of machine 0 memory system including caches bus structure CPU etc 3 Hardware low level aspects of machine 0 detailed logic design and packaging technology i We will emphasize application of 2 6 Goals for Computer Architect Functional requirements usage 0 Application area main jobs desktop scientific database embedded 0 software compatibility ISA compat needed or not 0 OS requirements address space memory management etc 0 standards IEEE fp IO de vices networking etc to 40 find correct balance of Price 0 lowered through less cus tomization and greater vol ume Performance Power constraints 0 Increasingly important even in desktopservers Time to market 0 New features must be worth additional devel time 7 Trends hardware increase per year design trends 0 Transisters on a chip 55 1980 CISC gives way to RISC 0 Program memory usage 15 2 0 1995 PentiumPRO Revenge per year of the CISC o DRAM density 40 60 Frontend translates 0 disk density more than 100 inst to RISC microcode o semiannually newest gen of VLIW fails in marketplace o 2004 PowerPC790FX VLIW CISC software trends 0 Languages become higher level assembly F77C C kinda JaVaDython Frontend translates pseudo Backend optimization con RISC to pseudo VLIW inst o 2005 Dual core on one chip AMDInteIIBM 0 now heat wall refocuses arch from clock rate to CMP trolled at higher levels userassmblr compiler VM 8 Results of Trends Use for transisters Put entire CPU on one chip Put fancy front end on chip Can design backend indepen dantly make ISA almost ir relavent put caches on chip too Perform compiler like forms in hardware inst reordering dep analy sis reg renaming OOE speculative execution etc Put multiple cores on chip Mem controllers GPU more cache etc on chip trans ll Some observations 0 Software slow to adapt hard ware quick VLIW exposed at assembly level requires software to adapt with hardware Clock speed is hitting powerheat walls but so far transister counts are not 0 Egtltpect hardware support to continue to expand VLIWSS multicore VM 0 Egtltpect ISA to less and less de scribe machine Egtltperts predict a reversal of these trends I39m dubious o DieChip 9 Computer Manufacturing Terms 0 Wafer a slice of silicon crystal ingot chopped into dies a portion of the wafer that is an independent component 0 Yield the of manufactured devices that work Improve yielddecrease price a learning curve experience with particular cess volume tizes RampD costs lowers margins commoditization competition rewarding best of breed processes margins manufacturing pro speeds curve amor allow for and minimal Computer Architecture 3rd ed Henn amp Patt Fig 19 pg 21 10 Approximate cost breakdown for 1000 PC in 2001 System Subsystem Liliincl Sl cl39l Ilk39lilll1lll39 Fraction of total Hin39ll Al le EB DVD lmu Suhlntul 5 Blhlk Ulllcc Suilc Est costs of 1K Subsystem has little room for improvement Mem video disks ex pand costs dec slower Processor amp monitor biggest hardware costs Software costs almost greatest despite lowest manufacturing costs 39 MS resists loss of monopoly gt com modization gt lower margins software costs are really in maintainence and support not manufacturing o Wall clock time a CPU time 1 1 Performance Terms Indiv process metrics actual time between start and finish of task AKA walltime elapsed time AKABI response time time CPU spends processing task Doesn t include IO processes etc Further split into 1 user CPU CPU time in user space of process 2 system CPU time spent by OS perf tasks requested by process other Server metric throughput total amount of work done in given time mul tiple tasks Questions CPU or wall have greater reso lqun CPU or wall have greater non res variability Why are timings not repeat able Which metrics make sense for what tasks 12 Relative performance Finding speedup value n timey timegc perfgc Perfy y takes 30 sec to run a routine while 13 takes 10 1 is three times faster Time only absolute measure 39 For performance metric to be useful must standardize conver sion performance metrics MIPS depends on too many factors compiler ISA imple mentation etc MFLOPS most useful when you standardize FLOP count regardless of algorithm Usefulness of performance Performance is a measure of speed Speed less affected by size You know the top possible speed for each arch Bigger being better is intuitive 13 Benchmarking type best to worst gotchas 14 Amdahl s Law States that the perf improv from using some faster mode of egtltec is limited by the fraction of time the faster mode can be used 1 User applications most impor o Compilersarchs may tune for o Allows user to understand limits of a considered optimization tant apps used by end user benchmark wtout generality 0 May be applied to any method of speedup hardware improvement 2 Kernels key kernels extracted 0 Benchmark may not use kernel software optimization parallelization etc from popular applications lin in same way as application 0 Urges us to optimize common case pack livermore loops gemm 0 Hard to capture all im 0 More rigorous form of 9010 rule etc portant info compil 0 Can get rough idea by asking quotwhat if this egtltecution were free 3 toy benchmark small prog that erscachesOSsettings 0 TH new time To original time Fe frac of To egtltploiting enhanc produce known result puzzle Worsens problems of repro Se speedup from enhanc St speedup of total program after enhanc sieve quicksort arraymerge ducibiity o Tn 2 Opt time non opt time TUSEFE To 7 To gtlt F5 2 etC Hard to tell what is causing T057 1 7 1 X Fe 2 To 7Fe 6 4 synthetic benchmarks at scores Tn T01 7 Fe tempts to match freq of ops o Benchmarks will vary on relative E stz z T0 l17F5 and operands from real progs score for same machine Tn m SE a drhystone whetstone St 11 7 Fe if 15 Measuring clock speed 1639 CPUt39me CPI etc Clock Rates Clock Periods o CPUtIme CPUCycles ClockPerIod duh o CPUtime InstCount CPI Period clock ticks In a second tick Interval CPUC ales I I I 1 73 CPI y IC Instruction Count 0 hertz Hz one cycle per sec 0 millisecond ms m 10 10 I I 1 InstCount ISA 84 compiler o KHz one thousand cycsec o microsecond us m 3 76 ClockPeriod hardware tech and org lO 10 I I I I 0 MHz one million cycsec 106 o nanosecond ns billionth 03139 COmPquoter 39InSt seleFtlon org ISA I I I 9 1 79 CPI varIes by Instruction machIne state and Inst mIgtlt 0 GHz one bIllIon cycsec lO W O I I I o Thz on trillion cycsec 1012 o picosecond ps trillionth Gwen program Wt Ct mSts Ofn types eaCh W39th and 101quot 1 10712 CPL 2 271 of Inst of type i gtltCPIi 2 271 10 gtlt 01311 1012 0 Book posits various scen increase CPI reduce period what 0 rate cyclessec period 2 seccycle they are inverses SpeeduD39 I I I In practIce each affects the other and varIes Widely by program I I I and so this is not as useful as they make it sound 139 What IS the peHOd Of a 500 Mhz Chlp E uations become more useful when considerin im rovements 2 What is the clock rate of a chip with a 50 picosecond cycle time q 9 p on subsets of instructions 1 Chapter 5Appendix C Memory Hierarchy 0 General Principles of Memory Hierarchies 0 Understanding Caches and their Design 0 Main Memory Organization 0 Virtual Memory 2 Memory Hierarchy What Is It 0 Key idea Use layers of increasingly large cheap and slow storage Try to keep as much access as possible in small fast levels Try to keep cost per byte almost as low as the slow cheap levels 0 Effectiveness tied to two properties of mem access Spatial locality data located close in memory to the present reference is likely to be used soon Temporal locality Recently accessed data is likely te be used again soon If at least one of these not present random access memory hierarchy will provide no benefit 0 We will see that the memory hierarchy is used to implement protec tion schemes in multitasking OS as well i M 3 Memory Hierarchy overview I Composed of four general Classes managed by varying means 7 k IS 1 Registers come 10M TB piler 2 Caches Harde Main Memory ware 7 usually 5000 25 32 GB several levels 3 Main memory Level 2 08 150 e 500 Cache 256KB 7 8MB 4 Virtual memory 08 Level 1 2 e 25 Cache 8 e 64 KB I Nearer to CPU leve els are small fast 84 1 Registers 8732 words expensive 4 DDOT Naturally Exploits Registers o dot assigned to a register 4N references still needed by DDOT dot product of vectors X 84 Y for dot00 i0 i lt N i dot xi yi algorithm 0 2N FLOPS 2N access of dot now from o 4N references register 0 2N data 1 Init main memory read not needed zeroed 2 N reads and N writes of dot in same register 0 Main memory access reduced to 2N of X 84 Y X 84 Y ref irreducable be cause no reuse Cache still possibly provides benefit line fill 84 prefetch ll 5 Memory Hierarchy Why Processor perf increasing much more rapidly than mem log scale After 25 yrs main mem is lt lOgtlt while CPUs are gt 10 OOOgtlt Without hier faster procs won39t help all time waiting on mem Try to design system where most references satisfied by small fast storage systems and use these sys to hide the cost of total storage Henn 84 289 100 000 Ionou L000 Processor Periormance WES 1990 1995 2000 2005 20m Year 6 Typical Layers of Memory Hierarchy 0 Access time Time from request to access a Each component has its own sys access time may be addition of all or only some some systems run in parallel o Bandwidth how fast data can be streamed from given address CMOS CMOS SRAM DRAM or cache 0 Huge drop in hitting disk gt disk is a mechanical device 7 Memory Hierarchy Terminology Each level of the hierarchy has some defining terms Hit item is found in that level of the hierarchy Miss item is not found in that level of the hierarchy Hit Time time to access item in level including time to determine if access was a hit Miss Penalty time to fetch block from further levels of hierarchy Varies widely even for given level eg may be found in Level 2 cache or on disk giving hugely diff values for diff accesses Miss Rate fraction of access that are not in that level Block the amount of information that is retrieved from the next further level on a miss 8 Memory Hierarchy Equations Hit rate nhit nref Miss rate nmiss nref Average access time hittime missrate misspenalty Total access time hitstime nref missspenaltynmiss In 00 machines some of this time may be hidden 0 Decrease mem cost three ways 1 Decrease hit time hardware 2 Decrease nmiss hardwaresoftware 3 Decrease miss penalty hardwaresoftware 9 Memory Hierarchy Evaluation Methods 0 Software approach requires two components 1 Generate traces of memory accesses Instrumentation Simulation Traps 2 Use memory hierarchy simulator Offline On the fIy 0 Hardware approach use hardware performance counters Available on most machines not all May not have capability of capturing all data Usually not repeatable like real world 10 Memory Hierarchy Questions Since a level does not contain entire address space must ask ourselves l M 40 4s Where can a block be placed in the current level a Block placement associativity caches How is block found if it is in the current level a Block identification Which block should be replaced on miss a Block replacement replacement policy What happens on a write a Write policy 1 1 Cache Basics Cache smaller faster area where data may be temporarily stored and operated on Five basic properties key in understanding caching l Inclusiveexclusive 2 block cache line size 0 Strongly correlated with bus width 0 Simplify book keeping 0 Usually filled in order 0 Optimizes for spatial locality 3 Associativityquot 0 Direct mapped cache l way assoc lots of conflicts 0 Fully associative o N way associative cache 4s 01 Replacement policy 0 Least recently used LRU 0 First in first out FIFO 0 Random or pseudo random Write policy 0 Write through 0 Write back Separateshared Most Ll caches separate in structiondata Most further level caches shared 12 Write Policy Write Miss Policy write through data is written to both cache 84 main mem Simpler to implement Can use write buffers to reduce stalls Cache 84 main mem kept consistant Usual write miss policy no write allocate block is not loaded on write miss but is updated in main memory Write only operands do not hog up cache write back data is written only to cache Modified block has dirty bit set so cache knows to write value back to mem when it is ejected from cache Reduces bus traffic but mem does not always have good values Usual write miss policy write allocate block is loaded on write miss 13 Understanding Cache Lookup and Associativity Cache is like a table with rows being number of sets a loc where a blllt with a particular address may reside and columns containing blocks 0 Set row index is determined byimplies certain bits of mem 0 l way assoc has 1 column given maps to only 1 loc in cache 0 N way associative has N could go N different locations in cache Given 5 must search tags ofN blks for hit Fully associative has only 1 row any placed arbitrarily in cache Column index is undetermined and so all columns must be searched to discover if we have a cache miss High associativity is expensive dup hardware or slower access Low associativity leads to more non capacity conflicts Conceptual Neway Associative Cache associa IVI y I NS cacheszblksz gtlt assoc s address17mg mod NS Kebyte long block in cache yte2 I 14 Tradeofrs in Cache Design 0 Make block size larger Pro More cache hits for contiguous access gt spatial loc less overhead in block spec info eg tag last access Con For non cont access load 84 store more useless data more bus contention less indep blks in cache Increase associativity Pro Less conflicts 2 mapping to same entry Con Egtlttra hardware comparison 84 replacement e Most small caches use 2 3 way of ops lrg caches l way 0 Use write back Pro Less bus traffic Con Memory coherence prob esp for multiprocs write only data eg copy may hog up cache e Ll can be either further levels usually write back 15 Addressing the Cache Revenge of the Tag 0 Memory address is decomposed into three pieces for cache access block offset offset within the block to be accessed index Bits imply what entry in table to examine for match tag The part of the memory address not implied by index or block size this value from address must match the cache39s stored tag for there to be a hit 0 Block address tagindegtlt is the block number in memory offset is index into block for element you want 0 Assuming power of two of bits for offsetindegtlt N INi N log2blksz N log21f5 tag 2 address gtgt N Ni 0 h P4 messing 2 MO in Me block offset 100D 4 Offset I 1 2 3 I U I I I I gt Alignment restrictions prevent cache block splits 16 Direct Mapped 1 Way Associative Example 0 Assume 2K8 cache with 16 byte linesize How many blocks and how many sets rows How many bits for tag set and block offset How much storage for tags 32 bit addresses addr T Set Off 17 2 Way Set Associative Example 18 Write Through No Write Allocate Example Assume 2way setassociative 64 cache sets 16byte cache line and LRU replacement policy o How big is cache Assume 2way setassociative 64 cache sets 16byte cache line and Binary addr Tag Set Off Way Way Refs R Binary addr Tag Set Off Way Way 1 1 11 1 12 1 11 11 1 4 1 11 1 11 11 11 1 11 o How many main memory reads o How many main memory writes 20 L1 cache of AMD Opteron 19 Write Back Write Allocate Example o 2way assoc L1 Assume 2way setassociative 64 cache sets 16byte cache line and h Comp Arch Henn amp Patt Fig C5 pg 9 13 0 Valid amp dirty bit V blk mockaddress giggleD CPU o LRU bit for each set if 1 Brk into 3 parts Binary addr Tag Set Off Way Way Refs in M 64 byte blksz e 6 bits Off 1 1 11 1 12 ltgt 512 entry gt 9 bits index 25bit tag 40 bit Index determ proper set Check if tag match amp valid bit set 1 11 i Mux selects which way to 39 1 pass out o Put in Upd Way if that way is still dirty On hit output gt CPU o How many reads and writes and why On miss output gt vic buff Lowerlevel memory 1111 4 111 21 Improving Cache Performance Assume main and virtual memory implementations are fixed how can we improve our cache performance 0 Reduce the cache miss penalty 0 Reduce cache miss rate 0 Use parallelism to overlap operations improving one or both of above Fetch from all levels simult reduces miss penalty Doing hardware prefetch in parallel with normal mem traffic can reduce miss rate 0 Reduce cache hit time gt Will discuss each of these in turn 22 Ideas for Reducing Cache Miss Penalty Use early restart Allow cpu to continue as soon as required bytes are in cache rather than waiting for entire block to load Critical word first Load accessed bytes in block first Load remaining words in block in wrap around manner Status bits needed to indicate how much of block has arrived Particularly good for caches wt large block sizes Give memory reads priority over writes Merging write buffer Victim caches Use multilevel caches 23 Giving Memory Reads Priority Over Writes Reducing Cache Miss Penalty Assume common case of having a write buffer so that the CPU does not stall on writes as it must for reads 0 The CPU can check the s in write buffer on read miss If Q presently in write buff load from there If Q not in write buff load from mem before prior writes Advantages Since read stalls CPU amp write does not we min stalls May avoid mem load if in write buff 0 Write buffers can make write back more efficient as well 1 For dirty bit eviction copy the dirty blk frm cache to write buff 2 Load the evicting block from mem to cache CPU unstalled 3 Write dirty block from buff to memory 24 Merging Write Buffer Reducing Cache Miss Penalty Due to lat multiword writes more effic than writing words sep Mult words in write buff may be associated with same blk Q Valid bits used to indicate which words to write Reduces the of mem accesses Reduces the of write buff stalls for a given buff size Top buff wo write merge Don t need valid tags in write back Assume 32 byte blk for further cache For seq acc 4 fold red in of writes amp buff e In practice must handle 124 as well as 8 byte words Larger blksz more help Comp Arch Henn amp Patt Fig 512 pg 421 Write address V V V V 1 O O 0 1 O O 0 1 O O 0 39te address TOO 25 Victim Caches Reducing Cache Miss Penalty Victim cache small eg 1 8 blocks fully associative cache that con tains recently evicted blocks from a primary cache 0 Checked in parallel with primary cache 0 Available on following cycle if the item in the victim cache Victim block swapped with block in cache 0 Of great benefit to direct mapped L1 cache Less popular today gt Most Lls are at least 2 way assoc Comp Arch 3rd ed Henn amp Patt Fig 513 pg 422 Lmlar lml mummy 26 Multi Level Caches Reducing Cache Miss Penalty Becomes more popular as miss penalty for primary cache grows Further caches may be off chip but still made of SRAM Almost all general purpose machine have at least 2 lvls of cache most have 2 on chip caches Further caches typically have larger blocks and cache size Equations locaLmisszrate misseszinzthiszcache accessesztozthiszcache globalzmisszrate misseszinzthiszcache accessesztolezcache avgzaccztime lehitztime lemisszrate lemisszpenalty Llmisspenalty L2hittime L2Lmisszrate L2Lmisspenalty L2Lmisszpenalty mainmemzaccessztime 0 L1 miss penalty is average access time for L2 0 Local miss rate of this cache s refs that hit next lvl 0 Global miss rate of all cache s refs that hit next v 27 Reducing Cache Miss Rate Just discussed techniques for reducing the cost of a cache miss now want to investigate ways to increase our chances of hitting in the cache Use larger block size bet on spatial locality Use larger cache size less capacity and conflict contention Use higher associativity less conflict contention Way prediction and pseudo associativity mix of simple and asso ciative caches o Compiler optimization reorder accesses to reduce of misses Cache Miss Categories 0 Compulsory First access to block must always miss Calc total of blocks accessed in program 0 Capacity Blocks that are replaced and reloaded because the cache cannot contain all the used blocks needed during execution Sim fully assoc cache sub compulsory miss from total miss 0 Conflict Occurs when too many blocks map to same cache set Sim desired cache sub comp amp capacity miss from total miss 28 Miss Rate vs Cache Size 0 Top figure shows total miss rate Compulsory tiny 1st line misses stay constant Only way to dec comp is to Mme ifch increase blksz which may WW increase miss penalty Capacity lrg blk area go down with size Conflict misses dec wt size Since of conflicts go down wt size assoc pays off less for large caches 0 Bottom figure shows distribu tion of misses of compuls misses increase wt size since other types of misses decrease wt size ComEDArch Henn amp Patt Fig C9 pg C724 1 Miss rate per lyDe 29 Reducing Miss Rate wt Larger Blocks Advantages 0 Exploits spatial locality o Reduces compulsory misses Disadvantages o Increases miss penalty 0 Can increase conflicts 0 May waste bandwidth SPEC92 blksz analysis 0 If linesz lrg comp to cachesz conflicts rise increasing miss rate 0 64 byte linesz reasonable V stud ied cache sizes p Arch Henn 81 Patt Fig C 10 pg C726 m Disdvantages 30 Reducing Miss Rate wt Larger Caches 84 Higher Associativity Larger Caches Higher Associativity Advantages Advantages o Reduces capacity 84 conflict 0 Reduces conflict misses misses Disadvantages 0 May increase hit time Tag check done before data can be sent 0 Req more space 84 power More logic for comparitors More bits for tag Other status bits LRU 0 Uses more space 0 May increase hit time 0 Higher cost power die 0 Will have fast hits and slow hits Way Prediction Each set has bits indicating which block to check on next access 0 A miss requires checking the other blocks in the set on sub sequent cycles Reducing Miss Rate wt Way Prediction 84 Pseudo Associativity Hit time as fast as direct mapped and req only 1 comparitor Reduces misses like a set associative cache Pseudo Associativity Accesses cache as in direct mapped cache wt 1 less indgtlt bit On miss chks sister blllt in cache eg by invert most sig indgtlt bit May swap two blks on an init cache miss wt a pseudo way hit 32 Reducing Cache Miss Penalty andor Miss Rate via Parallelism o Nonblocking caches Allow cache hit accesses while a cache miss is being serviced Some allow hits under multiple misses req a queue of outstand ing misses Could use a status bit for each block to indicate blllt currently being filled 0 Hardware prefetching 84 Software prefetching Idea is that predicted mem blocks are fetched while doing com putations on present blocks Requires nonblocking caches Most prefetches do not raise egtltceptions If guess is right data in cache for use If wrong wasted some bandwidth we weren39t using anyway Helps with latency by exploiting unused bandwidth If bus saturated prefetch won39t help and most archs ignore Can help with throughput if usage is sporadic Could egtltpand capacity misses if prefetch is wrong 33 Hardware and Software Prefetching software prefetching hardware prefetching o Compiler or handetuner inserts o 2 blks fetched on miss ref blk pref inst in loop to cache negtltt blk to pref buffer for i0 i lt N 1 o If blk refis in pref buff 1 cycle access to move to cache and pref of next blk in stream issued A series of strided addresses is a o Adds egtlttra overheads inst fetch stream hardware best at find amp decode so may cause slow ing lowestride up down streams down 0 Can have multiple stream buffs to allow pref of mult arrays prefetchampx16 sum XEi 34 Reducing Cache Hit Time 0 Small 84 simple caches Less storage gt afford more SISbyte Small caches gt less propogation delay Direct mapped gt overlap tag chk amp data sending Some designs have tags onechip data off 0 Avoiding address translation Virtual caches as Avoids virtualephysical trans step but problematic in practice Virtually indexed physically tagged as Indgtlt cache by page offset but tag with physical Q as Can get data frm cache earlier 0 Pipelined cache access Allows fast clock speed but results in greater br mispred penalty 84 load latency 35 Increasing Cache Bandwidth wt Multibanked Caches Increase bandwidth by sending address to 17 banks simultaneously 0 17 banks lookup address 84 write to bus at same time o Increases bandwidth by b in best case 0 Usually use sequential interleaving 5 Figure 56 shows b4 Block Bank 0 Block Bank Block Bank 2 Block Bank 3 address address address address 0 l l i l l 2 l l 3 l l 4 l l 5 l l 6 l l 7 l l a l l 9 l l 10 l l H l l Comp Arch Henn 84 Patt Flg 56 09 299 36 Summary of Cache Optimizations 37 Static Random Access Memory SRAM Caches are built out of SRAM main memory DRAM Requires 4 6 transisters per bit Does not need to be refreshed static Always available for access unlike DRAM Can consume less power if not often accessed SRAM cycle time 8 16 times faster than DRAM SRAM is 8 16 times more expensive 38 Dynamic Random Access Memory DRAM o Requires only 1 transistor a capacitor per bit Expect 6 fold greater density 84 cost adv over 6 trans SRAM Charge leaks away wt time requiring periodic refresh read before charge below threshhold rewrite Not all accesses take same amount of time since mem loc may be busy wt refresh when access request arrives Designers try to maintain 5 or less refresh time ie mem avail 2 95 of time Reading the value drains charge requiring a refresh better not want to access same item again amp again 0 Logically layed out as 2 D square or rectangle Since pins 515 use half as many pins amp send Q in two parts 1 Row Access Strobe RAS row to store in latches 2 Column Access Strobe CAS bits to output 0 Rated by two types of latency 1 Cycle time min time between succ req to given DRAM module 2 access time time between read req amp ouput data on pin 39 DRAM continued Comp Org Patt amp Henn 3rd ed Fig 896 0 512KB DRAM stored 2048x2048 array 2048 211 so need 11 Q pins decoder ay win204a On RAS entire row is written to l column latches Address104 On CAS MUX passes selected bits to output For refresh entire row read and rewritten Row 2048 X 2048 Column latches There is a delay between RAS CAS and data output while electrical signals stabilize fast page mode allows CAS to vary wt RAS constant allowing successive reads of column latches gt Pump out multiple words frm latch using one row read gt We say RAS related to latency CAS to bandwidth Sync DRAM SDRAM has clock signal in DRAM so that mem cont does not need to sync wt DRAM after each word in row Double data rate DDR xfers data on rising amp falling edge of clock 40 DRAM Packaging and Pins 0 Each DRAM module is rated by of bits of storage and output Prior graph was 4M x 1 4Mb x 1 bit output 4Mb mega 512KB kilobytes 4Mb 4 1024 1024 4194304 x4194304 2048 Are often rectangular wt more columns than rows Less time in refresh 0 Multiple DRAM modules are packaged together Older systems use single in line memory modules SIMMS 72 pins 32 bits of output Newer systems use double in line memory modules DIM MS 168 pins 64 bits of output Internally DRAM modules are organized into banks allowing single DIMMS to be accessed during recovery access bank 2 while bank 1 is still in cycle time from prior access 42 DIMM Naming and Timing DRAM DIMM 41 DRAM Timing by Year RAS time DRAM faSt CA5 CyC39e RAS39 row access time Year Size ns ns time ns 39 1980 bit 150 75 250 gt oc latency 1983 120 220 Improves s50oyear 1986 bit 100 190 CA8 I 1989 bit 80 165 co umn access time 1992 bit 60 120 gt oc bandwidth 1996 bit 50 110 o 1998 bit 50 Improves gt1Qoyear 2000 bit 45 Cycle time min time be 2002 W 40 tween access to same array I I 2004 bit 35 0 o DDRxxx describes DRAM BW Pnyyy describes DIMM BW 0 2006 bit 30 39 fOHOWS RAS N5 0 o DIMMs usually have 4 16 DRAM units unit with 16 DRAM arrays wt 4 bit outputs can produce 64 bits gt such a DIMM s BW is 64 times greater than DRAM gt 64 bits 8 bitsbyte 8x faster in bytes gt DIMM MBs 8gtlt DRAM s Mbs 0 processor speeds no longer increasing but mem demand still rising 0 Advertisers use BW it goes up faster than Other measures 1 gt Two procs running full out need twice the mem capacityspeed name o Latency helped by compiler amp arch eg prefetch amp OOE o Cycle time helped by caches amp multiple arraysbanks gt mem designers concentrate on increasing BW 43 Memory Bus 44 Improving Main Memory Bandwidth Comp Arch 3rd ed Henn amp Patt Fig 527 pg 450 a Fetch 1 WOl Cl Le 32 Ol Irgmmn bWIdememoryorgIniu on alr iommmm 64 at a time cacheY O Wider Main Memory busy mem PRO Increases bandwidth more bits sent in parallel reducing Wide busy matching miss penalty assuming blksz gt wordsz wider L2 blksz CON bus min increment for main mem expansion increased 1Word cache amp busy a Simple Interleaved Memory with 4 banks PRO Banks accessed in parallel ideal for sequential access CON Which banks are accessed depends on the 9 fig below Comp Arch Henn amp Eliaktt Fig 56 pg 299 Bloc Bank 0 03k Bank 1 0C Bank 2 B 00quot Bank 3 address address address address I l l7l l l ml l i l l o Independent memory banks like interleaved mem but wt no 9 rest PRO Each bank may be addressed independently very useful for nonblocking caches CON Each bank needs separate 9 lines amp possibly data bus l 4 l l l 8 l9 a Modern systems combine b amp c with all levels of cache having multiword cache blocks a Not sure why multiplexor is big deal most modern machines have it 45 Virtual Memory Managing mem needs in excess of physical mem automatically requires Virtual Memory Originally developed for single prog wt rg mem needs Now critical for multiprocessingtime sharing Divides physical memory into blocks and allocates these blocks to different processes Provides a mapping between blks in physical mem amp blks on disk Allows a process to exec wt only part of process resident in memory Reduces startupswitch time Provides protection to prevent processes from accessing blocks in appropriately Critical for multiprocessing why win9x so crash prone 46 Virtual Memory Terms Page virtual memory s block Page fault a main memory miss page is on disk not in memory Physical address used to access main mem and typically cache as well Page table data structure that maps between virtual and physical addresses Translation Lookaside Buffer cache that contains a portion of the page table 47 Program is Virtually Contiguous Not so Physically Small prog consumes minor amount of virtual space virtual Virtual Physical space bigger than physical on address address most procs 4 A 4 Think of unused mem blks 8K 0 belonging to other users W D 12K Some blks in mem fully assoc amp some on disk Each bk loaded into any bk of physical mem called relocation Without relocation prog is diff every time you load it Comp Arch Henn amp Patt Fig C18 pg C740 Vlnual memory 48 A Process s Virtual Address Space Each process has full virtual address as shown on right 0 Exact organization varies by architecture Typically heap begins at low Q and grows upward Stack begins at high Q and grows downward Memory between heap amp stack is unused and will not be allocated in mem or on disk until created 0 Not only is entire address space not allocated may start running wt only initial page of code eg 1st page of text segment in mem fast load 0 Each page can be protected to allow only certain accesses but entire page treated same Code amp part of data section may be read only seg fault if you try to overwrite inst or read only data Unallocated memory protected fault on read or write Can still write beyond heap on same page 49 Typical Ranges of Parameters for Caches amp Virtual Memory Parameter L1 Cache Virtual Memory Block size 16128 bytes 4K 64K 4M Hit time 1 3 cycles 100 200 cycles Miss penalty 8 200 cycles 1000000 10000000 cycles access time 6 160 cycles 800000 8000000 cycles transfer time 2 40 cycles 200000 2000000 cycles Miss rate O110 00000l 000l Address mapping 25 45 bit phys 32 64 bit virtual to to 14 20 cache 25 45 bit physical gt Huge miss penalty due to physical mechanism seekrotatetransfer o miss penalty 2 access time transfer time 0 access time time to get lst byte from next level 0 transfer time time to stream into this level 50 OS amp Hardware Cooperate to Achieve Efficient Virtual Memory Because miss penalty so large need very good mem man policy and can afford to run in software So hardware amp OS cooperate hard ware generates page faults OS supplies the brains to implement good policies Where can a block go in memory OS places blocks anywhere fully associative How is a block found in memory Page table is indexed by page number which is found by virtual pagesz Page table estab lishes mapping between virtual addresses and either a physical mem or a disk location more later Which bk should be replaced on page fault OSes tend to use very good approx of LRU since misses are so expensive What happens on write A virtual mem systems use write back write allocate due to cost of disk access 51 Mapping of a Virtual to Physical Address via a Page Table Virtual split in two parts 1 Virtual pg mapped to Virtual address Comp Arch Henn 84 Patt Fig C22 pg C 43 physical pg via page table W01quot 2 pg offset indexes physical pg Each process has its own page tabe 32er Usually dedicated reg to hold beginning of pg table Page table itself stored in mem vmuai page number Physical address ory and may be on disk Every ref inst or data requires translation 52 Page Tables of entries 2 of virtual pages virtual pages sizeofvirt space pagesz Each process has its own page table A page entry will typically contain physical page number phys 2 ppn pagesz valid bit table entry used dirty bit pg needs to be written to disk on swap use bit set when pg is accessed helps det LRU protection field eg read only disk where on disk this page is stored Some archs use hashing to allow the of entries to be the number of physical pages usually much smaller than virtual This is known as an inverted page table 53 Virtual to Physical Mapping Example no TLB Virtual address virtual page number page offset page table ndex 92payesz 092W985 Assume 256 byte page size unrealistic use above to find physical assume next physical page number to be used is 8 o How can page be in memory 84 not on disk 0 How can page be on disk but not in memory 0 How can invalid page table entry become valid 54 Translation Lookaside Buffer TLB TLB Special caches that contain a portion of the entries in a page table 0 Each entry in TLB has a tag portion of virtual page number and most of info in a page table entry TLBs are typically cleared on context switch TLBs are typically quite small to provide for very fast translation Typically separate inst and data TLBs Wo TLB every mem ref requires at least 2 mem refs One reference to page table one to address Most archs avoid context switch driven TLB flush by storing PID in TLB U000 55 TLB and Page Table Example PageTable HIndleglRes Dirtleisk lll H0 l0 l0 l0 l1 lH ll71l9 lYeslo l16 lll I Page size 512 bytes I TLB direct mapped wt 64 sets PgTab Page PgTab Page H Hex Q Binary Address index bits offset bits index offset HoxsFDFl1000 1111 1101 1111i l l l H H 0x10DF l 0001 0000 1101 1111 l l l l H TLB TLB TLB PgTab Physical H Addr Binary Addr Indx Tag Hit Indx address HoxsFDFl1000 1111 1101 1111i l l l l H H 0x10DF l 0001 0000 1101 1111 l l l l l H 56 Selecting a Page Size 0 Advantages of using large page size The larger the page size the fewer entries in page table and thus the less memory used by the page table Virtually indexed physically tagged caches can do translation and cache read in parallel normally must lSt do translation then cache access but of bits in page offset sets upper limit on cache size Therefore increasing page size increases the maxi mum cachesz in a virtually indexed cache Transferring larger pages tofrom memory more efficient since fewer seeks are required Will probably result in fewer TLB misses since there are fewer distinct pages in the same address range 0 Advantages of using smaller page size Will waste less space Startup time for processpage allocation quicker 57 Hypothetical Memory Hierarchy from Virtual Q to L2 Comp Arch Henn amp Patt Fig c 24 pg 047 0 Direct mapped L1 L2 TLB 0 Virtually indegtlted physically tagged Ll TLB 84 L1 accessed simult If Ll set valid physical compared wt Ll tag Physically indegtlted L2 64 bit virtual address 41 bit physical address nbitsTLB tag 2 nbitsvirtual page number nbitsTLB in dex 58 Multiprogramming Multiprogramming means that several processes share a computer con currently A process is the code data and any state information used in the execution of a program 0 A context switch means transfering control of the machine from one process to another Must saverestore state regs PC TLB page tab loc 0 Proper protection must be provided to prevent a process from in appropriately affecting another process or itself 0 Virtual mem sys keep bits in page table 84 TLB indicating the type and level of access that the process has to each of the pages Most sys provide user and kernel modes kernel has greater access and user must transfer control via syscall to enter Sharing mem accomplished by having multiple processes39 pgtab entries point to the same physical mem page 59 Crosscutting Issues Multi issue machines require more ports to inst and data caches Speculative egtltecution often results in invalid addresses which raise exceptions that must be suppressed Some machines such as Pentium 4 read in 80x86 inst but save decoded RISC like inst in the inst cache Some caches are designed to save power Eg MIPS 4300 can power only half its chking hardware in its 2 way assoc cache by using way prediction Cache coherence problem 84 10 next slide 60 10 and the Cache Coherence Problem We may ask ourselves Where does IO occur 1 Between IO device and the cache 0 No cache coherence prob but IO blocks cache and thus CPU 2 Between IO device and main mem this is the usual case 0 Min interference wt CPU write thru cache has up to date output 0 For input or write back machine must either a Not allow certain pages used as 10 buff to ever load to cache b Invalidate cache blocks for IO pages before starting IO and disallow mem access until IO done Cache Coherence Problem With data in mem 84 caches data can be stale 1 Data in cache may be more up to date due to write back cache 0 This is problem for multiprocessors 84 output devices 2 Data in mem may be more up to date due to input frm egtltt device 0 This is problem for CPU handled as described above
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'