Advanced Microprocessor Systems Design
Advanced Microprocessor Systems Design ECE 463
Popular in Course
Popular in ELECTRICAL AND COMPUTER ENGINEERING
This 35 page Class Notes was uploaded by Miracle Jaskolski on Thursday October 15, 2015. The Class Notes belongs to ECE 463 at North Carolina State University taught by Staff in Fall. Since its upload, it has received 9 views. For similar materials see /class/223905/ece-463-north-carolina-state-university in ELECTRICAL AND COMPUTER ENGINEERING at North Carolina State University.
Reviews for Advanced Microprocessor Systems Design
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/15/15
Lecture 5 Cache Operation ECE 463521 Fall 2002 Edward F Gehrin er Based on notes by Drs Eric Rutenberg StTum Game of NCSU Outline Review of cache parameters Example of operation of a directmapped e Example of operation of a setassociative cache Simulating a generic cache Writethrough vs writeback caches Writeallocate vs no writeallocate Victim caches Cache Parameters SIZE total amount of cache data storage in b es BLOCKS2E total number of bytes in a single block ASSOC associativity ie of blocks in a set Cache Parameters cont Equation for of cache blocks in cache cacheblocks 7 i BLOCKSIZE Equation for of sets in cache cache blocks SIZE SSOC I BLOCKSIZE 7ASSOC sets 7 Address Fields 5 V v USSdtO k Once a block is Tag eld is compared u a set to the tags ofthe indexed wzose n es f quotdx the OWSSl Che lines 39 contain one or SSEas byt If it matches block is P3 ICU if e or more memo there hit rY word of data In blocks The of blocks per set is the associativity quot If it doesn t match the block block is not there miss Address Fields cont V dths of address fields bits index bits Iogz sets block offset bits Iogzblock size tag bits 132 index bits block offset bits Assuming 32bit addresses ta requested data byte or word to processor Example of cache operation Example Processor accesses a 256 byte directmapped cache which has 32byte lines with the following sequence of addresses Show contents of the cache after each access Count of hits and of replacements Example cont sets 7 i 7 7 BLOCKSIZE 7 ASSOC 3271 index bits logz sets logz8 3 block offset bits logzblock size logz32 bytes 5 tag bits total address bits index bits block offset bits 32 bits 3 bits 5 bits 24 hus the top 6 nibbles 24 bits of address form the tag and lower 2 nibbles 8 bits of address form the index and block offset elds Address hex Tag hex Comment Get block from memory slow gt Get block from gm memory slow SetAssociative Example Example Processor accesses a 256byte 2way set associative cache which has block size of 32 bytes with the following sequence of addresses Show contents of cache after each access Count of hits and of replacements Example cont 5615 7i 7 7 4 BLOCKSle 7 ASSOC 327 2 index bits Iogz sets Iogz4 2 block offset bits Iogzblock size Iogz32 bytes 5 tag bits total address bits index bits block offset bits 32 bits 2 bits 5 bits 25 Data not shown for convenience Data not shown for convenience convenience Data not shown for convenience Data not convenience Data not shown for convenience 7 mm Data not shown f convenience Generic Cache 1 Directm ped e aysetrassociative 2 n way setassociative e Hrwaysetrassociatwe 3 Fullyassociativ Every cache is an Irway setassociative cache nrway seteassoeiame Where there i5 oniy i set containing I7 biocks index bits iogz 0 equation stiiiworksi Generic Cache cont e same equations hold for any cache type Equation for of cache blocks in cache cacheblocks BLOCKSIZE Equation for of sets in cache SIZE BLOCKSIZE ASSOC Fullyassociative ASSOC cache blocks Generic Cache cont Wnat this means don t have to treat the three cache types differently in your simulator Support a generic nway setassociative cache You don t have to speci cally worry about the two extremes directmapped amp fully associative Eg even DM cache has LRU information in simulator even though it is not u Replacement Policy Which block in a set should be replaced when a new block has to be allocated LRU least recently used is the most common policy Implementation Real hardware maintains LRU bits with each block in set Simulator 7 Cheaiusing sequence numbers as timestarn ps 7 Makes Simulation easierrgives Same effect as hardware LRU Example Assume that blocks A B C D and E all map to the same set TraceABCDDDBDE Assign sequence numbers increment by 1 for each new access This yie l Handling Writes What happens when a write oocurs Two issues 1 lsjust the cache updated with new data or are lower levels of memory hierarchy updated at same time 2 Do we allocate a new block in the cache ifthe write misses The WriteUpdate Question Writethrough WT policy Writes that go to the cache are also wntten through to the next level in the memory hierarchy cache next level in memory hierarchy The WriteUpdate Question cont Writeback WB policy Writes go only to the cache and are not immediately written through to the next level of the hierarchy cache quotat level in El memory hierarchy Writeback oont What happens when a line previously written to needs to be replaced We need to have a dirty bit D with each line in the cache and set it when line is written to When a dirty block is replaced we need to write the entire block back to next level of memory wnteback N The WriteUpdate Question cont The WriteUpdate Question cont Wth the writeback policy replacement of a dirty block triggers awriteback of the entire line cache gt Replacement causes riteback next level in memory hierarchy The WriteAllocation Question WriteAllocate WA Bring the block into the cache ifthe write misses handled just like a read miss Typically used with writeback policy WBWA WriteNoAllocate NA Do not bring the block into the cache ifthe write misses Must be used in conjunction with writethrough WTNA The WriteAllocation Question cont VVTNA scenario the write misses 39e miss cache next level in memory hierarchy The Write Allocation Question cont WBWA scenario the write misses cache D ext level in memory hierarchy Victim Caches cache that sits alongside pnmary cac e Eg holds 2 16 cache blocks Management is slightly Weird compared to conventiona c c es A victim cache is a small fullyassociative h h mam cache evlcts replaces a block the vlctlm cache Will take the evlcted replaced block The evlcted block l5 called the vlctlm vvheh the mam cache mlsses the vlctlm cache l5 searched ror recently dlscarded blocks Avlctlmcache hlt means the mam cache doesn t have to go to memory tor the blodlt VictimCache Example Suppose we have a 2entry victim cache t initially holds blocks X and Y Y is the LRU block in the victim cache The main cache is directmapped Blocks A and B map to the same set in main cac e TraceA B A B A B VictimCache Example cont Victim cache Main cache VictimCache Example cont 1 B misses in main cache and evicts A 2 A goes to victim cache amp replaces Y the previous LRU 3 X becomes LRU Main cache Victim cache VictimCache Example cont Amisses ll l Ll but hits ll l victim cache so A and B swap positions A is moved from victim cache t Ll N o l to victim cache Where A was located Note We don t the LRU block x W case ofvictimcache ml X LRU B L1 cache Victim cache Victim Cache Why Directmapped caches suffer badly from repeated conflicts Victim cache provides illusion of set i associativ ty A poorman s version of set associativity A victim cache does not have to be large to ffective even a 4 8 entry victim cache e will frequently remove gt 50 of conflict misses Modeling stalls Let us develop a theory for how much the processor is slowed by branches Avg execution time unpipelined Avg execution time pipelined Speedup CPlunpipe X CTunpipe CPIpipe X CTpipe CPlunpipe isjust n since in the absence of pipelining the instruction takes n cycles CPlpipe CPlnosta stall cycles per instruction stall cycles per instruction CPlunpipe X CTunpipe Speedup CPIpipe x CTpipe CPIunpipe n 1 stall cycles per instruction This assumes that the two CTs are equal Now let s assume the two CPIs are equal but the CTs are not CPlunpipe X CTunpipe Speedup CPIpipe x CTpipe 1 X CTunpipe 1 stall cycles per instruction CTpipe 1 n 1 stall cycles per instruction X CTpipe n 1 stall cycles per instruction 2002 Edward F Gehringer ECE 463521 Lecture Notes Fall 2002 Based on notes from Drs Tom Conte amp Eric Rotenberg of NCSU Figures from CAQA used with permission of Morgan Kaufmann Publishers 2003 Elsevier Science USA Let s take an example Assume n 5 eg pipeline from last lecture 20 of instructions are branches 60 of branches are taken Penalties Taken branches 1 stall cycle Nottaken branches 0 stall cycles How many stall cycles per instruction are there on average Stall cyclesinstr Speedup Control Hazards Suppose that we have the following sequence of instructions where instruction A is BLTZ R1G and the branch target is not known until the MEM stage Instr Instr Instr Instr Instr What is the branch penalty here Methods of handling control hazards However there are various ways of dealing with control hazards Let s first assume we make no prediction as to whetherthe branch is going to be taken but wait until we know One ofthe simplest is to cancel the fetch and stall waiting for the target instruction to become available Lecture 15 Advanced Microprocessor Design nstrABLTZR4X IF I ID I EX MEM WB quot lnstrB F Istall stalll IF ID EX MEM WB What is the branch penalty Let us now suppose that the branch is taken Instr A BLTZ R4 X IF ID EX MEM WB IlnstrX I IFI I I I I I I I What is the branch penalty Another strategy is to stall but not refetch the instruction afterthe branch Instr A BLTZ R4 X IF ID EX MEM WB Instr B IF stall stall ID EX MEM WB Instr C IF ID EX MEM WB What is the branch penalty Let us now suppose that the branch is taken IlnstrX I IFI I I I I I I I What is the branch penalty A third strategy is to always predict the branch will not be taken 2002 Edward F Gehringer ECE 463521 Lecture Notes Fall 2002 3 Based on notes from Drs Tom Conte amp Eric Rotenberg of NCSU Figures from CAQA used with permission of Morgan Kaufmann Publishers 2003 Elsevier Science USA Instr A BLTZ R4 X WB Instr B MEM WB Instr C EX MEM WB Instr D ID EX MEM WB E IF ID EX MEM WB Instr What is the branch penalty Suppose the branch is taken Instr A BLTZ R4 X IF ID Instr B IF Instr C Instr D Instr X What is the branch penalty In the above examples we know when we are going to branch when we are in the ID stage ie we know we have a branch instruction We don t know Whether or Where we are going to branch until the MEM stage Suppose we have the pipeline we considered in the last lecture where we knew all three when Whether Where in the ID stage Then our branch penalty is 1 cycle I Instr A BLTZ R4 X IF ID EX MEM WB InstrX F IF ID EX MEMWB Issues in eliminating stalls When Detect that an undecoded instruction is a branch Where Predict where the branch will go iftaken Lecture 15 Advanced Microprocessor Design 4 Whether For conditional branches Predict if it will branch or not before execution Optimal Try to determine all three in IF stage It won t work perfectly prediction but we can try our best Issues with When and Where What does the IF stage know Only the of the instruction the PC How can this information help us decide whether the branch might be taken So we keep a buffer cache of last known branch targets around Buffer is written to by WB stage Traditional name for this is a branchtarget buffer BTB PC I La gt Hit 2 we know it is a branch When va ue 7 BTB returns branch target Where brainsHes Miss 2 assume not a branch BTB entry tag precomputed target How do we index into the BTB We could make the BTB fully associative but Instead let s index it by the leastsignificant bits ofthe instruction address not counting bits 10 of course 2002 Edward F Gehringer ECE 463521 Lecture Notes Fall 2002 5 Based on notes from Drs Tom Conte amp Eric Rotenberg of NCSU Figures from CAQA used with permission of Morgan Kaufmann Publishers 2003 Elsevier Science USA So if we have a 64entry BTB which address bits are used to index it Predicting Where with returns Problem A lot of branches are returns from procedures Holding the last target address is a poor predictor Solution Keep a hardware stack of return addresses Push return address when a call is executed Pop buffer on returns to get prediction Bottom of stack is filled with old value on a pop Need approximately 4 8 entries for integer code Each entry in BTB I contains return Issues with Whether Predicting conditional branches and sometimes unconditional branches if needed before decode Two approaches Hardware to supply prediction Software Heuristics Profiling Hardware branch prediction HampP 34 Let s add a bit to each BTB entry that says whether the branch was recently taken or not Set prediction field 1 if branch was taken 0 if branch was not taken Lecture 15 Advanced Microprocessor Design 6 At lF check branch prediction buffer 9 if prediction field 1 then predict taken oelse predict nottaken Of course the bit may predict incorrectly It may even have been put there by another instruction at an address that maps to the same BTB entry Probem Some branches don t do what they did last time Exercise Consider a doubly nested loop for i 0 i lt 5 i for J 0 lt5J39 Supposing the branches at the end ofthese loops don t conflict in the buffer how often will they be predicted correctly Of course the first time a branch is encountered it will automatically be predicted not takenquot In general for branches used to form loops a 1bit predictor will mispredict at twice the rate that the branch is not taken We would like to have a predictor that at least matches the frequency at which the branches are taken In a 2bit scheme a prediction must miss twice before it is changed One complication is that a 2bit counter updates the prediction bits more frequently than a onebit counter Why Smith s nbit counter predictor 2002 Edward F Gehringer EOE 463521 Lecture Notes Fall 2002 7 Based on notes from Drs Tom Conte amp Eric Rotenberg of NCSU Figures from CAQA used with permission of Morgan Kaufmann Publishers 2003 Elsevier Science USA Here is one way of using a twobit counter Smith s nbit counter predictor N N initial state using NT heuristic N r gtT predict taken 1 predict not taken Unfortunately it can do quite badly on certain types of branches TTTTT TT NNNNNN Prev state 01101111111111111111100100000000000001 Newstate 10111111111110111110010000000000000110 6 mispredictions out of 19 branch executions Prev state 01 1O 01 1O 01 1O 01 1O 01 1O 01 1O 01 1O 01 1O 01 1O 01 New state 1O 01 1O O1 1O O1 1O O1 1O O1 1O O1 1O O1 1O O1 1O O1 1O 19 mispredictions out of 19 the infamous toggle branch Lecture 15 Advanced Microprocessor Design Anothertwobii predictor Another 2bit predictor is shown on p 198 of CAQA Taken Nm taken mum Iakan i n Taken Taken um um Nm taken 39quot Piedicl um taken Di 4 What kind of accuracy can we expect using a 2bit predictor Here are the results 39om SPEC89 benchmarks with a branchprediction buffer with 4096 entries znnz Edwaid F Gehimgei ECE 453521 Lemme Nutes FaiiZEIEIZ Based un utes 1mm Dis Tum oume K Em Rutenbem umch Figures mm cAaAusee Mh Pemissan mMmgan Kaumann Puhiisners Inna EiSEViEY SEiEnEE USA nasa7 mamxaoa sFEcsa benchmarks quotmun unressn 2mm Frequency cl mlsprsdhllmis 39 39 39 39 for the integer espresso eqntott and Ii is substantially higher 1 1 avg than for the oatingpoint programs 4 avg Unfortunately these programs also have a higher branch 39equency We can attack this problem in two ways By increasing the size of the buffer By improving the accuracy ofthe prediction scheme Lecture is Advanced Micvupvucessuv Design in Pipelining We ve already covered the basics of pipelining in Lecture 1 We saw that cars could be built on an assembly line and that instructions could be executed in much the same way HampP A1 In the ideal situation this could give a speedup equal to the number of pipeline stages Time to execute instruction on unpipelined machine Number of pipe stages However this assumes perfectly balanced stages each stage requires exactly the same amount of time This is rarely the case and anyway pipelining does involve some extra overhead Three aspects of RISC architectures make them easy to pipeline All operations on data apply to data in registers Only load and store operations move data between memory and registers All instructions are the same size and there are few instruction formats An unpipelined RISC For our examples we ll work with a simplified RISC instruction set In an unpipelined implementation instructions take at most 5 clock cycles One cycle is devoted to each of Instruction fetch IF Fetch the current instruction the one pointed to by PC IR lt MemPC Update the PC by adding NPC lt PC 2002 Edward F Gehringer ECE 463521 Lecture Notes Fall 2002 1 Based on notes from Drs Tom Conte amp Eric Rotenberg of NCSU Figures from CAQA used with permission of Morgan Kaufmann Publishers 2003 Elsevier Science USA Lecture 14 Instruction decoderegister fetch ID Decode the instruction Read the source registers from the register file A lt RegslR610 B RegslR1115 Signextend the offset displacement field of the instruction Imm lt signextendlR1631 Check for a possible branch by reading values from the source registers Cond lt A rel B Compute the branch target address by adding the to the ALUOutput lt NPC Imm lfthe branch is taken store the branchtarget address into the PC If cond PC lt ALUOutput else PC lt NPC What feature ofthe ISA makes it possible to read the registers in this stage Executecompute effective address EX The ALU operates on the operands performing one of three types of functions depending on the opcode gt Memory reference ALU adds and to form the effective address ALUOutput lt gt Registerregister instruction ALU performs operation on the values read from the register file ALUOutput lt A op B gt Registerimmediate instruction ALU performs operation on the and the Advanced Microprocessor Design 2 ALUOutput lt A op lmm In a loadstore architecture execution can be done at the same time as effectiveaddress computation because Memory access MEM LoadMemData lt MemALUOutput Load MemALUOutput lt B Store Writeback WB Ifthe instruction is registerregister or the result is written into the register file at the address specified by the destination operand RegReg ALU Operation RegslR620 lt ALUOutput RegImmediate ALU Operation RegslR5 lt ALUOutput Load instruction RegslR5 lt LoadMemData In this implementation some instructions require 2 cycles some require 4 and some require 5 2 cycles 4 cycles 5 cycles Assuming the instruction frequencies from the integer benchmarks mentioned in the last lecture what s the CPI of this architecture Pipelining our RISC It s easy to pipeline this architecture just make each clock cycle into a pipe stage 2002 Edward F Gehringer ECE 463521 Lecture Notes Fall 2002 3 Based on notes from Drs Tom Conte amp Eric Rotenberg of NCSU Figures from CAQA used with permission of Morgan Kaufmann Publishers 2003 Elsevier Science USA Instructioni MEM WB Instr1 EX MEM WB Instr i2 ID EX MEM WB Instr i3 IF ID EX MEM WB Instr i4 IF ID EX MEM WB Here is a diagram of our instruction pipeline Insfrucfion Fetch IF Instruction Decode ID Execute EX Memory MEM Write back WB ALU instruction cache MUX VVVV Sign extend In this pipeline the major functional units are used in different cycles so overlapping the execution of instructions introduces few conflicts Separating the instruction and data caches eliminates a conflict that would arise in the IF and MEM stages Of course we have to access these caches faster than we would in an unpipelined processor The register file is used in two stages Lecture 14 Advanced Microprocessor Design hus we need to perform reads and writes each clock cycle To handle reads and writes to the same register we write in the rst half of the clock cycle and read in the second half Something is incomplete about our diagram ofthe IF stage at l y l 1 I to save values between pipeline stages Otherwise the different instructions in the pipeline would interfere with each other Magnum cm 52 E m fl So we insert latches or pipeline registers between stages Of se we d need latches even in an unpipelined multicycle implementation zuuz Edwavd F Gehvmgev EOE 453521 Lemme Names FallZEIEIZ Based un nules quoturn Dis Turn oume g Evie Rulenbem umch Figureswam CADAlissaMhpemissaanmganKaumannPubliwers Inna ElseviechiencE USA What is our pipeline speedup then Of course we have to allow for latchdelay time We also need to allow for clock skew the maximum delay between when the clock arrives at any two registers T latch T skew39 Let s define T o head Avg unpipelined execution time Avg pipelined execution time Speedup Tungige TuneBe n 7 o39head39 7 n ideal case where To39head 0 Example Consider the unpipelined processor in the previous example Assume Clock cycle is 1 ns Branch instructions 20 ofthe total take 2 cycles Store instructions 10 ofthe total take 4 cycles All other instructions take 5 cycles Clock skew and latch delay add 02 ns to the cycle time What is the speedup from pipelining Lecture 14 Advanced Microprocessor Design How can pipelining help How can pipelining improve performance Ifwe keep CT constant by improving CPI 50n lFID and lVlEMWB 3 JD EX MEM are unpipelined 50n Ifwe keep CPI constant by improving CT I 50ns I Unpipelinedl IF TD MEN WB CPpipe1 wCP391 Pipelined p39pe I 25 nsI Usually we improve both CT and CPI Pipeline hazards A hazard reduces the performance of the pipeline Hazards arise because of the program s characteristics There are three kinds of hazards Structural hazards Not enough hardware resources exist for all combinations of instructions Data hazards Dependences between instructions prevent their overlapped execution Control hazards Branches change the PC which results in stas while branch targets are fetched 2002 Edward F Gehringer ECE 463521 Lecture Notes Fall 2002 7 Based on notes from Drs Tom Conte amp Eric Rotenberg of NCSU Figures from CAQA used with permission of Morgan Kaufmann Publishers 2003 Elsevier Science USA Structural hazards Consider a pipeline with a unified instructiondata cache Load instr IF ID EX MEM WB Instr i1 IF ID EX MEM WB Instr i2 IF ID EX MEM Instr i3 stall IF ID MEM WB Instr i4 IF EX MEM WB Instr i5 ID EX MEM Instr i6 IF ID EX Instruction i3 has to stall because the load instruction steals an instructionfetch cycle In this pipeline what kind of instructions what opcodes cause structural hazards Lecture 14 Advanced Microprocessor Design 8
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'