ADV COMPUTER ARCH I
ADV COMPUTER ARCH I ECE 587
Popular in Course
Popular in ELECTRICAL AND COMPUTER ENGINEERING
This 61 page Class Notes was uploaded by Miss Chadrick Doyle on Tuesday September 1, 2015. The Class Notes belongs to ECE 587 at Portland State University taught by Staff in Fall. Since its upload, it has received 20 views. For similar materials see /class/168255/ece-587-portland-state-university in ELECTRICAL AND COMPUTER ENGINEERING at Portland State University.
Reviews for ADV COMPUTER ARCH I
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/01/15
The Microarchitecture of Superscalar Processors PSU ECE 587687 Program Representation An application begins as a high level program A program is compiled into static machine level code or program binary Implicit in the program is the sequencing model The sequence of executed instructions forms a dynamic instruction stream The address of the next dynamic instruction 0 Incremented program counter oTarget of a taken branch Whiz mil are Uh war 5 3 10172007 Sequential Execution Model Inherent in instruction sets and program binaries Led to the concept of precise architecture state 9 Interrupt and restart o Exceptions 9 Branch mispredictions Out of order issue deviates from sequential execution o How to maintain binary compatibility and retain appearance of sequential execution Fl39iltlannl Signet leversi y 10172007 3 Dependences and Parallel Execu on One can view the program as a collection of basic blocks separated by branches There is a limited number of parallel instructions on the average within basic blocks To execute more instructions in parallel control dependences due to branches have to be overcome Fl39lllial39llil Iiilate Llnwersi y 10172007 4 Dependences and Parallel Execu oncont Instructions have be executed according to true data dependences 9A true dependence appears as read after write sequence Ideally we would like to eliminate output dependences and antidependences oAn output dependence appears as write after write sequence oAn antidependence appears as write after read sequence 10172007 5 Fania ml 13951 are Unmet Si y Elements of Superscalar Processmg Strategies for fetching multiple instructions every cycle 0 Supported by predicting branch outcomes and fetching beyond conditional branch instructions well before branches are executed Methods for determining true register dependencies and eliminating artificial dependencies 9 Register renaming o Mechanisms to communicate register values during execu on Frinlaristl Signet Linwarsi y 10172007 6 Elements of Su erscalar Processing cont Methods for issuing multiple instructions in parallel 0 Based upon availability of inputs and not program order Parallel execution resources 9 Multiple pipelined functional units 0 Memory hierarchies capable of simultaneously servicing multiple memory requests Poniand Siaie University 10172007 7 Elements of Su erscalar Processmg cont Methods for communicating data through memory via load and store instructions potentially issued out of order 9 Memory interfaces have to allow for the dynamic and often unpredictable behavior of memory hierarchies Methods for committing architecture state in order 0 Maintain an outward appearance of sequential execution model 10172007 8 Fauna nd 31 are Univer 5in Typical Superscalar Microarchitecture Fig 3 illustrates the parallel execution model used in most superscalars Fig 4 illustrates the microarchitecture or hardware organization of a typical superscalar processor Fenland University 10172007 9 Instruction Fetch g The instruction fetcher reads instructions from the instruction cache and supplies them to a queue oThe number of instructions fetched per cycle should at least match the peak decode rate oThe fetcher must be told the address of the next block of instructions to fetch oAn instruction cache is usually organized as lines of several Instructions 9 It is typical to fetch a line in a single cycle from an onchip cache 0A cache line starts on a fixed boundary regardless of the instruction needed from the line Folf arsrl flare L39mversuy 10172007 10 Instruction Fetch cont Calculating the next address to fetch o For nonbranch instructions the program counter is incremented by the number of instructions in a cache block 9 For branch instructions the fetch unit has to Recognize a branch Determine its outcome taken or not taken Fetch the next block using Next sequential address or Branch target address Frililanrl flare Unwalsny 10172007 11 Instruction Fetch cont Branch prediction is used to avoid having to wait for the branch execution to complete oTarget comes from Branch target buffer BTB oOutcome comes from Static prediction based on branch type or profile Dynamic prediction based on result of previous branches If we guess wrong we must be able to undo the work and fetch the correct instruction oThis incurs significant misprediction penalty F nlflanrl Signet leversuy 10172007 12 Instruction Fetch cont Transferring control to target address on a taken branch could cause pipeline bubbles oStockpile instructions in instruction queue oOr keep next address in cache block 9 Use delayed branches The instruction queue also helps smooth fetch irregularities caused by cache misses and to allow for cycles when fewer than the maximum instruction number can be fetched F nlflanrl Signet leversuy 10172007 13 Instruction Fetch cont Superscalar machines also pay a penalty for instruction misalignment o Branches and targets don39t always fall on cache line boundanes o Fetched instructions that are not executed waste fetch bandwidth oThis is sometimes called instruction cache fragmentation due to branches F l39mland Signet University 10172007 14 X64 ICache Fragmentation Discard X96 X128 X160 Discard X192 Pvm anni ame L39nwesg fy 10172007 15 Instruction Fetch cont Cache fragmentation caused by branches places a severe limit on very wide superscalars oSequential runs of instructions are easy 9 But the average sequential run length is about six for general integer programs oThe distribution is very broad with a few long runs raising the average oSix instructions corresponds to 3 decode cycles for a two instruction decoder and only 15 decode cycles for the four instruction decoder F nnlanrl Signet leversuy 10172007 16 Instruction Fetch cont Given enough fetch bandwidth the fetcher can realign or merge instructions from multiple lines to make more efficient use of the decoder o For a branch target in the middle of a cache line the fetcher combines the cache line with the one following it presenting the decoder with a decode packet in which the first instruction is useful 0 Decoder quotlinesquot are not aligned with cache lines o Harder to find the program counter associated with one instruction F nlfland Signet L mversty 10172007 17 Instruction Decode Instructions are removed from the instruction queue Execution tuples are set for each decoded instruction containing QOperation to be executed opcode o Identities of storage elements where the inputs reside 9 Identity of the storage element where result must be placed Faniarid flare Linwersny 10172007 18 Instruction Decode cont In the static program the input and output identifiers represent storage locations in the logical register file or storage locations in memory To overcome WAR and WAR hazards register renaming maps the register logical identifiers into physical storage locations Allocation logic assigns each instruction physical storage for the result as well as entries in all required instruction buffers F nlilanrl Signet Unwersuy 10172007 19 Instruction Rename The decoder looks at one or more instructions and releases them to scheduling stations after renaming Register values created by an instruction are assigned physical locations and recorded in a map table 9 Map table has as many entries as there are logical registers Source register mappings are read from the map table and attached to the instruction Portland Signet leversuy 10172007 20 Instruction Rename cont Renaming happens sequentially and map table bypass is sometimes necessary Subsequent stages in the pipeline use mappings attached to an instruction tuple to read and write the physical locations of register values 10172007 21 Portia rid are lever Si 3 Rename Map Table V V II Register Map Logical Source Phy51cal Source Table l Registers Registers gt 7 Allocate physical gt registers from free pool V V V Logical Destination Registers Physical Destination Registers Fania nstl 15315119 Umver 3i y 10172007 Renaming Methods There are two methods commonly used 9 Renaming with a physical register file larger than the logical register file 0 Renaming using a Reorder Buffer and a physical register file equal in size to the number of logical registers F l39mland Slide University 10172007 23 Renaming with a Physical RF A free list of unused physical registers is kept New register results are assigned physical registers from the free list Reclaiming of physical registers into the free list 9 Usage count is O and logical register has been renamed to another physical register 9 A simpler method is to free a physical register when the subsequent instruction with the same logical result register is committed Map table is checkpointed at conditional branches Portland Signet Unwersny 10172007 24 Freeing Physical Registers at Retirement H gt R5 gt P3 I4 gt R5 gt P5 free P3 when retired 7 lt R5 lt P5 Portland State University 10172007 25 Renaming with a Reorder Buffer Physical registers are allocated sequentially in the Reorder Buffer Physical registers are freed and their values are copied to the register file at retirement Mapping table maps logical registers to entries in the Reorder Buffer or the Register File Branch handling options 9 Map table checkpoints 9 Resume renaming from the correct path after mispredicted branch has retired Willilanrl Slide Universny 10172007 26 Instruction Issue After instructions are fetched decoded and renamed they are placed in instruction buffers where they wait until issue An instruction can be issued when its input operands are ready and there is a functional unit available Fig 8 is an example of parallel execution schedule quotnlilarnil Signet l39mversty 10172007 27 Instruction Issue cont All outoforder issue methods must handle the same basic steps 9 Identify all instructions that are ready to issue oSelect among ready instructions to issue as many as possible 9 Issue the selected instructions eg pass operands and other information to the functional units 9 Reclaim instruction window storage used by the now issued instructions F Ml l lfl flare Unwelsny 10172007 28 Methods of Organizing Instruction Issue Buffers Single shared queue OOnIy for inorder issue Multiple queues one per instruction type Multiple reservation stations one per instruction type 9 Fig 10 Shows a typical reservation station Single central reservation stations buffer Portland University 10172007 29 Multiple Queues Requiring instructions to be issued in order at a functional unit greatly simplifies the identification and selection logic Instructions from different queues could be allowed to issue out of order 10172007 30 F l39lnla rid die Llnwar Si 3 Reservation Stations Benefits 9 Logic to identify and select ready instructions is simpler since it need only consider a few locations oStorage can be optimized for each type of functional unit eg stores need not have storage for two source operands Drawback oStorage is statically allocated to functional units This can result in either wasted storage or a resource bottleneck for some programs Frilflanrl Signet Unwersny 10172007 31 Central Window Benefits oOnly one copy of identification and selection logic oOnly one copy of storage reclamation logic 9 Dynamically allocated storage Drawbacks oComplex identification and selection logic oComplex storage reclamation logic 9 Each storage location must be as big as the largest instruction oFunctional unit arbitration must be handled Portland Signet l39nwsrsty 10172007 32 Memory Ordering Stores consist of address and data uops Store addresses are buffered in a queue Store addresses remain buffered until 0 Store data is available 9 Store instruction is committed in the reorder buffer New load addresses are checked with the waiting store addresses If there is a match 9 The load waits 9 Or store data is bypassed to the matching load Fig 11 shows typical memory ordering logic Friltlanrl Slate Universny 10172007 33 Reading Assignment TY Yeh and YN Patt Alternative Implementations of TwoLevel Adaptive Branch Prediction ISCA 1992 S McFarling Combining Branch Predictors Technical Report TN36 Digital Western Research Laboratory June 1993 E Botenberg et al Trace Cache A Low Latency Approach to High Bandwidth Instruction Cache F nlilanrl Signet L mvsrsty 10172007 34 Due November 14 l N Out of order superscalar architecture is one approach for achieving high performance on general purpose non numeric programs Another approach is to use very long instruction word VLIW architectures a recent reincarnation of which is explicitly parallel instruction computers EPIC implemented in Intel Itanium processors a Describe the ideas and the technology behind VLIW and EPIC architectures Write a comparative analysis of the VLIWEPIC vs out of order superscalar architectures Compare the mechanisms used for overcoming various performance bottlenecks in the pipeline in both architectures emphasizing the pros and cons of each approach Consider technical as well as business factors e g deployment cost time to market etc in your analysis b V Out of order superscalar processors achieve performance in the presence of long latency operations such as loads that miss the first level cache and oating point operations by issuing independent instructions that can execute immediately while buffering instructions that depend on the long latency operations The out of order hardware looks ahead in the dynamic instruction stream for parallel instructions that can issue thus keeping the pipeline busy Conventional out of order processors however have limited hardware capacity to buffer waiting instructions and fetch and commit instructions in program order As a result severe limitation on the performance of many applications is frequently observed due to branch mispredictions and extremely long latency to Various researchers have recently proposed new architectures that relax in order fetch and in order commit restrictions of conventional superscalars Write a survey of these new architectures and describe how they propose to improve performance in the presence of branch mispredictions and cache misses to Branch Prediction and Trace Cache PSU ECE587 Reducing1 branch costs with dynamic ardware prediction Branch prediction basics oWe need to predict conditional branch outcome to select the address for next instruction fetch rs PC 4 3 Or branch target address oAlso we need to quickly determine the branch target address a Direct branches a Register indirect branches a Returns Portland state University 2 Predicting conditional branch outcomes Simplest dynamic branch prediction scheme is a branch I PC I prediction buffer or branch history table It is a small memory indexed by the lower portion of the branch address Stores previous branch outcomes to predict next PC 112 outcome Memory is not tagged Prediction may have been put in the entry by a different branch 1K entries prediction buffer Portland state University 3 Predicting conditional branch outcomes 1bit prediction buffer stores the last executed branch outcome and uses it to predict the next outcome 9 If bit 1 branch is predicted taken 9 If bit O branch is predicted nottaken A simple 1bit scheme has a performance shortcoming o A series of branch outcomes and corresponding predictions outcomes 1111011110111101 predictions 111101111011110 mispredictions 111E111 1 11m Portland state University 4 Predicting conditional branch outcomes 2bit saturating counter often used 9 Branch taken gt increment state rs Max state 11 stays at 1 1 when incremented 0 Branch nottaken gt decrement state If Min state 00 stays at 00 when decremented M11 and 1 O are predict taken states 00 and O1 are predict nottaken states Portland state University 5 2bit saturating counter statemachine Predict taken 1 03 3 Predict taken 6 6 1 1 3 3 Predict not taken 01 3 3 Predict not taken 003 3 NT Portland state University Predicting conditional branch outcomes Assuming initial state to be 11 branch outcomes and corresponding predictions now look as follows outcomes 1111011110111101 states 333323333233332 predictions 111111111111111 mispredictions 111111111111111 Portland state University 7 Correlating branch predictors 2bit prediction schemes use the recent behavior of a single branch to predict the future behavior of that branch Behavior of longer sequence of branch execution history often provides more accurate prediction outcome Behavior of other branches rather than just the branch we are trying to predict is sometimes important 9 Because outcomes of different branches often correlate 0 Global branch history For some branches prior history execution of the branch is important 9 Because of loops 9 Local branch history Portland state University 8 Correlating branch predictors code example DSUBUI R3R12 if aa BNEZ R3L1 aa o DADD R1 3030 if bb L1 DSUBUI R3R22 bb o BNEZ R3L2 if aa bb DADD R2RORO L2 DSUBU R3R1R2 BEQZ R3L3 Portland state University Correlating branch predictor with 2bit global history register Branch address 27bit peribranch predictors XX prediction 2 bit global branch history 2003 Elsevier Science USA All rights reserved Portland stale Unwersrty 10 Adaptiver combining branch predictors Some branches are predicted more accurately with global predictors Other branches are predicted better with local predictors It is possible to combine both types of predictors and dynamically select the right predict for the right branch The selector is yet another predictor with 2bit state machine per entry Portland state University 11 State machine transitions for a selector Use predictor 1 Use predictor 2 Use predictor 1 Use predictor 2 2003 Elsevier Science USA All rights reserved Portland state University 12 Branch target buffer BTB It is a cache that stores branch targets It is accessed by the address of the instruction currently fetched Allows branch target to be read in the IF stage oWhen a branch is predicted taken the fetch of the instruction at the branch target address can proceed immediately in the next cycle oStall cycles that would have been needed to wait for the decoding of the branch and the computation of the target are saved Portland state University 13 Branch target buffer Portland stale University Look up Predicted PC No instruction is not predicted to be branch proceed normally Yes then instruction is branch and predicted PC should be used as the next l Branch predicted taken or untaken 2003 Eisevier Science USA All rights reserved Predictin return address using Return ddress Stack RAS Indirect branches have multiple potential targets since address comes from a register which can have many possible values Branch target buffers could be used for indirect branch target prediction 9 However many mispredictions can happen because the BTB can store only one target per branch Most indirect branches come from return instructions Portland state University 15 Return address stack A small address buffer organized as a stack When a Call is encountered the Return address which is Call address 4 is pushed onto the RAS When a Return instruction is encountered the address from the top of the RAS is popped and used as the target Portland state University 16 Trace Cache Future high performance requires increased instruction fetch bandwidth with minimum latency Paper proposes supplementing a conventional instruction cache with a trace cache Noncontiguous instructions appear contiguous in the trace cache Portland state University 17 Trace Cache Performance issues emerging as processor issue rates increase 0 Branch throughput re Predicting multiple branches including taken every cycle 0 Noncontiguous instructions and alignment 0 Fetch unit latency Portland state University 18 Trace Cache One possible solution allow fetch from multiple noncontiguous blocks 9 Multiple addresses have to be generated before fetch begin Implies a level of indirection Additional fetch stage latency oMultiported or interleaved ICache is required 9 Instruction merging and alignment increase fetch latency Portland state University 19 Trace Cache Basic idea oConventional instruction cache holds instructions in static program order oTrace cache holds instructions in dynamic program order oSee Figure 2 in the paper Portland state University 20 Trace Cache The core fetch unit 9 Interleaved sequential 2 consecutive blocks can be accessed in the same cycle 9 Fetch up to the limit or to the next taken branch oLimit of 16 instructions and 3 branches 016way interleaved BTB 0 Multiple branch GAg predictor re Global history register Rearranged PHT organization 8 state machines per pattern Portland state University 21 Trace Cache Figure 4 shows the core fetch unit and the trace cache Trace cache organization 0 Up to 16 instructions wide and 3 branches oContains a fill buffer instruction traces and control information Portland state University 22 Trace Cache Control state OValid bit oTag to identify trace starting address 9 Branch flags to indicate takennottaken direction 9 Branch mask defines number of branches in a trace oTrace fall through address oTrace target address 0 End of trace marker Fill buffer has to stop the trace at indirect branches Portland state University 23 Trace Cache Trace cache design space oAddress associativity and path associativity 0 Partial trace matching 0 Indexing method oFill issues eg number of fill buffers oTrace selection Some committed traces are never reused oVictim trace cache 0 Redundancy Portland state University 24 Trace Cache Simulation processor model oVery large instruction window 2048 0 Unlimited renaming 0 Unlimited functional units 0 Perfect data cache 9 Perfect memory disambiguation Results shown for trace cache and multiple fetch blocks cache organizations of 1 2 and 3 cycles latency Portland state University 25 Reading Assignment J E Smith and A R Pleszkun Implementation of Precise Interrupts in Pipelined Processors ISCA 1985 GS Sohi and S Vajapeyam Instruction Issue Logic for HighPerformance Interruptable Pipelined Processors Portland state University 26
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'