COMP ARCHITECTURE CS 538
Popular in Course
Popular in ComputerScienence
This 21 page Class Notes was uploaded by Orrin Rutherford on Wednesday September 2, 2015. The Class Notes belongs to CS 538 at Portland State University taught by Staff in Fall. Since its upload, it has received 7 views. For similar materials see /class/168301/cs-538-portland-state-university in ComputerScienence at Portland State University.
Reviews for COMP ARCHITECTURE
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/02/15
An Introduction to Branch Prediction Page 1 5252007 Branch Prediction Introduction and Survey Dave Archer Portland State University Prediction is very dyficult especially about thefuture Niels Bohr 1885 1962 Exploitation of instructionlevel parallelism depends on two signi cant architectural choices pipelining and parallel superscalar functional units As you ve seen data hazards can reduce the speedup obtained via these mechanisms by forcing pipeline stalls Branches also create hazards known as control hazards A conditional branch redirects the path of execution based on the outcome of a comparison between a dynamic runtime datum and a xed compiletime datum or perhaps between two dynamic data This makes it impossible to know the address of the next instruction until the comparison and target address computations are nished If conditional branches were rare their impact might be tolerated However conditional branches tend to occur very frequently A survey of the ve SPECint92 programs shows that for an 80x86 architecture implementation 20 of all instructions are conditional branches 7 second only to LOAD instructions at 22 in overall frequency This implies that in a pipeline of ve stages or more at least one such conditional branch would be in ight at any time Why does this cause a problem Consider the simple vestage model for a RISC processor A simple unconditional branch injects one stall cycle in order to discard the linear successor instruction to a branch and fetch the true branch target In a more realistic pipeline such as the MIPS R4000 three or more cycles are lost before a branch target calculation can be completed and one more cycle is lost waiting for branch condition evaluation if the branch is conditional To get a mathematical feel for potential impact on pipeline performance let s review some basics Average instruction time Clock cycle time X CPI and Average instruction time pipelined Speeduppipelining Average instruction time pipelined CIOCk cyele tilne unpipelined x CPI unpipelined Clock cycle time pipelined X CPI pipelined CPI pipelined Clock cycle unpipelined X CPI pipelined Clock cycle pipelined When we discuss pipelined architectures we make the assumption that pipelining can ideally achieve a throughput of 1 instruction each cycle in exchange for the complexity of running all the stages of the pipeline in parallel In other words the bene t is that CPI pipelined ideal 1 An Introduction to Branch Prediction Page 2 5252007 In the real world we must account for hazard impacts to pipeline performance so we modify the above to be CPI pipelined 1 Pipeline stalls per instruction Now we can rewrite our speedup equation as CPI unpipelined Clock cycle unpipelined Speeduppipelining x 1 Pipeline stalls per instruction Clock cycle pipelined Pipelined architectures get their speedup by being able to accelerate the clock cycle since the critical path is shortened to a single pipe stage In other words the ideal CPI of an unpipelined processor is also 1 it just runs slower by a factor equal to the pipeline depth This allows us to further simplify our equation by substituting 1 for CPIunpipelined resulting in 1 Clock cycle unpipelined Speeduppipelining x 1 Pipeline stalls per instruction Clock cycle pipelined Given this goal of making the clock cycle faster by the same factor as the pipe depth we can write Clock cycle unpipelined Pipeline depth Clock cycle pipelined and so we get 1 x Pipeline depth Speeduppipelining 1 Pipeline stalls per Instruction For this discussion we restrict ourselves to discussing pipeline stalls due to control hazards or Pipeline depth Speeduppipelining 1 Pipeline stalls from branches and since Pipeline stalls from branches branch penalty X branch frequency we can say for this purpose that An Introduction to Branch Prediction Page 3 5252007 Pipeline depth Speeduppipelining 1 branch penalty X branch frequency Now if we assume the 20 branch frequency noted above and a branch penalty of say 4 cycles in a MIPS R4000 CPU Hennessy2003 then it is easy to see that pipeline speedup can be easily reduced by almost half from the ideal case 1 18 in this example If we take into account that the branch penalty in other real processors may be worse eg the minimum branch misprediction penalty on a PentiumTM 4 is 17 cycles it becomes clear that control hazards particularly re conditional branch instructions must be addressed with high priority Otherwise we waste CPU hardware resources risk fragmentation and higher miss rates in instruction caches and spend time ushing and cleaning up the one or more execution pipelines in our machine We need two things to address the problem time to develop insight into decisions about the thread of control and time to determine where these decisions will take us in the instruction stream Approaches There are several mechanisms we can use to buy the time we need We could stall the pipeline and wait but the impact as shown above would erase a signi cant amount of the gain from ILP exploitation techniques already discussed the ostrich approach restructure the code so as to leave time where we need it the branch slot delay approach run the code in advance to gain knowledge of which way the branches are likely to go in the production run the pro ling approach have the compiler study the code and insert hints about branch direction the analysis approach make some educated guesses and hardwire these into the CPU assuming they will help all code to some degree the static branch prediction approach add hardware to profile the direction that branches take during execution and hope that this gives us insight into what will happen next time a branch is encountered the dynamic branch prediction approach The branch slot delay approach Early generations of RISC processors used 11 approach called delayed branch in order to avoid pipeline stalls due to control hazards In this approach we buy the time required to evaluate the branch by executing one or more instructions after the branch instruction that can safely be executed regardless of the branch outcome These instructions might originally have come before the branch from the branch target ow or from the branch fallthrough ow The compiler uses knowledge of the CPU s branch delay arrangement in order to reschedule instructions at compile time to take advantage of this capability Let s examine how this rescheduling might work The preferred method is to reschedule an instruction from before a branch into the branch slot For example XOR R5 R8 R9 BNZ R8 foo ltdelay slotgt An Introduction to Branch Prediction Page 4 5252007 might simply be rescheduled as BNZ R8 foo XOR R5 R8 R9 This gives the pipeline an extra cycle to analyze the branch and redirect the instruction stream because the XOR can execute in what would otherwise be a wasted slot in the pipeline This method is preferred because it is safe ie the XOR will be executed regardless of the branch outcome It also minimizes code bloat because it requires no replication of instructions copied from either branch successor ow In addition constraints on whether a predecessor instruction can be used in the branch slot are relatively easy to analyze What happens if one of these constraints poses a problem For example suppose the above code were XOR R8 R8 R9 BNZ R8 foo ltdelay slotgt In this case the XOR cannot be moved after the branch since the branch depends on the result of the XOR If we cannot identify a predecessor instruction to move into the branch slot due to constraints like this we can attempt to use an instruction from the branch target location bar ROTL R9 4 R9 XOR R8 R8 R9 BNZ R8 bar ltdelay slotgt could be rescheduled as ROTL R9 4 R9 bar XOR R8 R8 R9 BNZ R8 bar ROTL R9 4 R9 This is true only if we can safely discard the value of R9 when the branch nally falls through That is selecting a branch delay instruction from either successor path can only be done safely when overall execution of the program is unchanged as a result Note that both the analysis required to know that the extra execution of the branch slot instruction is safe along with the replication of the instruction require extra resources In addition the compiler must have a fairly good idea of whether the branch will be taken or not in order to decide which destination instruction to use to ll the branch delay slot As a result this approach is generally considered a backup for the rst approach above An Introduction to Branch Prediction Page 5 5252007 Regardless of how the branch slot is lled there are some limitations on which instructions can be used In addition to the data dependencies discussed above certain classes of instructions are prohibited from lling branch delay slots For example branch instructions are usually disallowed in this role Can you explain why There are other limitations to this approach as well Many real pipelines have much more than a single branch delay slot 7 the MIPS R4000 has three Unfortunately there is a limit to the number of safe branch delay instructions that can be identi ed by a compiler One nal note what happens when the branch direction predicted by the compiler when choosing instructions to ll the delay slot is incorrect In this case execution of the instruction in the delay slot if selected from the fallthrough or branch target path would lead to incorrect results For this reason the compiler often encodes the predicted branch outcome into the branch instruction and hardware is often added to nullify or cancel the branch delay slot instruction if the branch outcome does not match the direction favored by the choice of branch delay slot instructions Branch prediction by hardware In order to predict a change in thread of execution during runtime we must predict in hardware the presence of a branch the outcome of a branch and what the target PC will be Unfortunately our guesses can be wrong in several ways Not recognizing the branch early enough discoverable at decode time Predicting the wrong target discoverable at address generation time Taking the wrong path on a Jquot 39 branch quot 39 39 at 39 time Prediction can be wrong about all of these because the hardware we use must be relatively simple in order to t into the allowed physical space Some of the constraints facing predictors include Virtual addresses are almost always used as indices making the hardware insensitive to page replacement by the virtual memory system Limited storage often results in aliasing which results in interference in branch history information Interference between threads happens because branch prediction hardware is not sophisticated enough to distinguish threads Lack of time to do the prediction work a singlecycle operation can only have a certain level of complexity Why can t we pipeline more complex predictions What crisis happens on the rst wrong guess The result of these constraints and other e ects is some level of inaccuracy in branch prediction no matter how good our algorithms may be In each of the following branch prediction approaches hardware resources are applied to these problems with results that vary from being slightly worse than a random guess to being as much as 9975 accurate At the low end of the range static schemes make very simple hardwired assumptions At the high end the most recent results have not been implemented and some make simplifying assumptions that may not be implementable in hardware yet they give us some indication of what the upper bound may be in prediction accuracy An Introduction to Branch Prediction Page 6 5252007 The static branch prediction approach Static branch prediction is cheap easy to implement correctly and easy to understand In this category are ve approaches which require almost no hardware support and may require varying degrees of compiler support Branch Never Taken7 Just as the name implies this approach assumes that branches in the code are never taken The pipeline continues to fetch instructions from the successor PC If the branch is actually taken the incorrectly fetched instructions must be killed by the hardware This approach was wed in the Intel 80486 processor and displayed accuracy of 40 or so Recall that 50 accuracy is equivalent to a random guess Branch Always Taken7 In this case every conditional branch is assumed taken This makes sense in the context of frequent WHILE structures in the source code As soon as the branch is decoded and the target is known prefetching continues at the target address The Sun SuperSPARCTM used this approach and displayed accuracy of 60 or so BTFN stands for Backward Taken Forward Not 7 In this approach backward branches those with negative branch offsets are assumed to be taken while forward branches are assumed not taken This approach assumes that program execution is dominated by loops especially WHILE constructs This method has produced results in the 65 accuracy range and was used in the HP PA7x00 series processors Software branch prediction hinted to the hardware The above three static approaches assume relatively simple compiler support7 simple organization of the code to take advantage of the hardware s assumptions The following two approaches require relatively more software support Static analysis 7 In this approach the compiler analyzes code structure to predict on a static basis the likelihood that a branch will be taken or not The resulting prediction is then typically encoded into the branch instruction to give the hardware the necessary hint to choose a predicted path Reported results show that static analysis can deliver prediction accuracy in the range of 70 Some of the key work in this area is discussed in Ball95 Branch predictions are largely driven by a set of heuristics in this approach Some branches are predicted based simply on opcode For example since many programs use negative integer values to represent rare error conditions and since some architectures provide conditional branches that test for negative values an example heuristic would specify that a branch testing for a negative value would be predicted not taken Another heuristic proposed by Ball and Larus determines whether a branch chooses between executing or avoiding a loop and predicts for the direction in which the loop executes Yet another heuristic looks for pointer comparisons and makes assumptions on branches when a pointer variable is compared to a null value Dynamic pro ling analysis 7 In this approach the application can be compiled and run while recording history about branch outcomes This information can then be compiled into hints similar to those in static analysis The assumption implicit here is that future runs of the application will have similar behavior to the pro ling run On some applications pro ling has delivered up to An Introduction to Branch Prediction Page 7 5252007 roughly 75 accuracy in branch prediction The cost of pro ling is high however It requires two compilations and at least one execution to gather data and set predictions Software branch prediction requires some measure of hardware support in most cases For example the IBM PowerPCTM architecture includes hint bits in branch instructions to facilitate communication between compiler and hardware on a perbranch basis The Intel ItaniumTM processor also includes hint capabilities which we ll see an example of later in this paper Dynamic branch prediction If 75 is not a good enough guess it isn t 7 we ll see why later we can begin to explore more signi cant hardware additions to dynamically predict changes in the thread of execution At this point a number of complex tradeoffs must be made regarding silicon real estate and power dissipation in order to balance the bene ts of higher pipeline utilization with the costs of additional hardware Per cache line dynamic predictors 7 The simplest form of branch prediction is a single bit that reminds the hardware of how a branch turned out the last time it was encountered This approach was used in the DEC AlphaTM 21064 and in the AMD K5 processor A simple lowcost way of storing this bit that also allows rapid prediction is to store the bit with the relevant line in the instruction cache When a branch resolves the outcome is compared to the prediction bit and a new prediction is written back to the Icache There are a number of problems with this approach First there may be more than one branch instruction in a cache line Branches occur on average every 4 or 5 instructions in real applications Instructions in many RISC machines are 4 bytes in length and most instruction cache implementations in recent memory have cache lines at least 32 bytes in length On average then each line in the Icache might contain 2 branches leading to signi cant aliasing on the prediction bit and poor prediction as a result In addition a single prediction bit will tend to mispredict at least twice on every loop Speci cally the prediction will be wrong on the last iteration of the loop and on the rst It s simple to see that on the last iteration we need different behavior than on the prior iterations so the branch outcome must be different and because of this will be mispredicted by a simple history scheme To understand why the rst iteration will mispredict let s consider that the large majority of loops are nested constructs This means that as we begin iteration of a loop we retain history of the nal Actual outcome Actual outcome Not Taken Taken State ll Prediction Taken State 10 Prediction Taken Actual outcome Not Taken Actual outcome Actual outcome Taken Taken Actual outcome Not Taken Actual outcome ot Taken State 01 Prediction Not taken State 00 Prediction Not taken Actual outcome Taken An Introduction to Branch Prediction Page 8 5252007 iteration of the loop from last time it was executed Since the prior execution terminated the loop and this new rst iteration will most o en continue the loop you can see that each new rst iteration will mispredict with high probability Overall this singlebit misprediction problem will cause the hardware to mispredict branches at twice the rate at which the branch is taken For long loops this may yield acceptable performance However for loops in the range of 10 iterations branch prediction accuracy with this approach falls into the 80 range 7 far below what is necessary to keep a pipeline fully utilized in a modern processor Overcoming the single bit misprediction problem While we cannot be completely accurate using a simple history scheme we ought to at least be able to reduce the mispredict rate to match the rate of loop termination To accomplish this we can introduce the concept of hysteresis into the prediction scheme For example suppose we introduce a counter instead of a single history bit Suppose we increment this counter when a branch is taken and decrement the counter when a branch is not taken Suppose we select the highorder bit of the counter as our branch predictor Then in the steady state this prediction bit will display hysteres39s that is up to a certain level of difference from consistent prior behavior will not change the prediction In practice consider a 2bit counter with the state diagram shown in Figure 1 As you can see if the actual outcome of the branch has been Not taken for a while the predictor is tolerant of a single misprediction without changing its predicted value Similarly if the steady state has been Taken for a while the predictor is tolerant of a single mispredict In general this scheme can be extended to an nbit saturating counter where the prediction rolls over at half of the counter s maximum value However studies Leel984 have shown that a 2bit counter works nearly as well as a much larger counter so 2bit counters are usually chosen for this scheme Branch history table 7 Armed with this improvement to a simple branch prediction bit we can construct the next natural evolution in branch prediction the branch history table This table simply consists of a set of 2bit saturating counters indexed by the PC and located in the Instruction Fetch pipeline stage If a branch is decoded the prediction from the indexed counter is used to fetch the successor instruction whether from the taken or fallthrough path as soon as the target PC is computed This approach yields a prediction accuracy on SPEC89 benchmarks of between 82 eqntott and well over 99 matrix300 on the PowerPC architecture Panl992 Two level dynamic branch predictors aka correlating predictors 7 At this point we have extracted all available information about a speci c branch instruction In order to take the next step in maximizing prediction accuracy we need to look at the branch of interest in the context of surrounding code Speci cally we should expect that there be some correlation between behavior of a branch and others nearby Since eqntott from the SPEC benchmarks appears to be most resistant to a simple prediction scheme we can investigate what could be done to improve performance on this concrete case regarding correlation of branch behaviors The following code sample from eqntott will motivate our case If aa 2 aa 0 If bb 2 bb 39 0 ifaa 7 bb An Introduction to Branch Prediction Page 9 5252007 Note that each ifthen construct corresponds to a conditional branch You can see that the outcome of the third ifthen is tightly related to the rst two For example if both aa and bb are originally 2 then the code block at the bottom will not be executed However a branch prediction scheme that Lower bits ofPC Array of 2bit local branch predictors Most recent branch outcome Shift regiSteI Multinleon Saturation counter logic Branch outcome Branch prediction Figure 2 A Correlating Branch Predictor focuses only on a single branch instruction has no way of recognizing or exploiting this correlation How do we construct a correlating predictor The prediction of a speci c branch outcome should be based on the outcomes of one or more recent branches We can term this an interest in global branch history while our interest in a speci c branch can be termed local branch history We can record the global branch history in a Branch History Register BHR7 simply a shift register into which we shift a l or 0 based on the outcome of every branch encountered Then we can use as many bits of this BHR as we like to help our predictions One way to do this is to use the history information to choose between a set of local predictors for a particular branch The chosen local predictor can then be used for prediction and updated as already discussed The resulting hardware is sketched in Figure 2 The global BHR is used to select one of four BHTs The PC value of a branch is then used to index a speci c local predictor from the selected BHT This local predictor is used for the required prediction and then updated based on the outcome of the branch Bottom line we maintain several local predictors for a particular branch each one related to a global history Pan So and Rameh Panl992 proposed that we might maintain a set of 4 BHTs selected by 2 bits from a BHR and indexed by the PC of the branch An Introduction to Branch Prediction Page 1 0 5252007 In order to compare performance of this approach to that of the local branch predictor we need rst to establish grounds for a fair comparison Since the trade off is between silicon area and prediction accuracy we can use hardware cost as the independent variable and let prediction accuracy be the dependent variable Since most of the silicon real estate in both schemes is consumed by the required storage cells we can use the number of stored branch prediction bits as the basis for fair comparison Let s assume a 4096bit branch history table using 2bit saturating counters and a 4096bit correlating predictor using 2 bits of history and 2bit local saturating counters An extension of the study noted above showed that on the SPEC89 benchmark suite using the PowerPC architecture the correlating predictor at worst matches the performance of the simple history table and at best cuts the mispredict rate by a factor of 3 The correlating predictor shows prediction accuracy on these benchmarks of from 89 to well over 99 The poorest performing benchmark for the local predictor eqntott improves the most with prediction rate rising from 82 to 94 The Intel PentiumTM Pro and the AMD K6 used this approach Yale Patt and TseYu Yeh Yehl993 generalized the correlated predictor and produced some very interesting results They considered a scheme where the history shift register could be either global as in the original correlated predictor scheme or local an array of history shift registers indexed by the PC For the branch predictor counters they also considered a global array indexing by history shift register only vs a local array indexing by PC and history register The nomenclature they developed speci ed the choice of branch history shift register as either P shift register per branch or G one global history shift register In addition they speci ed the choice of pattern Most recent branch outcome History register Branch outc ome Branch prediction GAg Predictor An Introduction to Branch Prediction Page 1 1 5252007 Lower bits ofPC History register History register History register History register History register History register History register S aturation c ounter logic Branch outcome Branch prediction PAg Predictor table which we have referred to so far as saturating branch predictor as either p predictor per branch or g one global array of predictors The four possible combinations are GAg G and g 7 one global history shift register indexing one global array of branch predictors PAg P and g 7 an array of history shift registers indexed by PC which in turn indexes one global array of branch predictors PAp P and p 7 an array of history shift registers indexed by PC which selects one of a set of branch predictors indexed by PC An Introduction to Branch Prediction Page 1 2 5252007 The fourth combination GAp G and p corresponds to the correlating branch predictor already discussed Figure 2 Results on performance were nonintuitive It turns out if sufficient storage is allocated say a total of 128kb including branch predictors and shift registers the PAg scheme wins out delivering in the range of 965 accuracy running on a Motorola 88100 instruction level simulator for 9 of the SPEC89 benchmarks Yehl993 This is particularly interesting because of the obvious risk of interference due to many local history registers potentially selecting the same global predictor Tournament predictors 7 Additional accuracy can be achieved by implementing both global and local branch predictors and using other levels of branch predictors to select between them Most current tournament predictors use a 2bit saturating counter a predictor to choose between a global and local prediction scheme based on prior prediction accuracy These predictors show a slight improvement over correlating predictors asymptotically achieving 975 or so on the SPEC89 benchmark suite overall compared to 965 for correlating predictors and around 93 Array of 2bit local branch predictors Lower bits ofPC HistorV register HistorV register HistorV register HistorV register turation counter logic HistorV register HistorV register HistorV register Branch outcome Branch prediction PAp Predictor An Introduction to Branch Prediction Page 1 3 5252007 for local predictors The AlphaTM 21264 processor used a multi level tournament predictor 7 the most complex of its kind implemented up to this time so far as we know A description of this predictor can be found in Hennessy2003 Benchmark results for branch predictors vary from study to study and from benchmark to benchmark so it s not possible or meaningful to report a general result that matches any speci c study Instead we summarize branch prediction techniques by saying that over a set of workloads meant to be representative of a broad range of business and scienti c computation the techniques we ve discussed have roughly the following accuracies Helping the Hardware Even the best existing hardware prediction can still use help from someone with broader context or speci c knowledge Recognizing this some architectures allow the programmercompiler to explicitly hint in the instruction stream about how to handle a branch For example the Intel Itanium processor architecture supports a quotbranch whether hint bwh The compiler can indicate several hints to the CPU predict the branch statically s or dynamically d by the hardware predict p if a branch would be taken tk or not taken nt the rst time The following code shows a test of register r32 followed by a branch if not equal to zero The branch is indicated as br dptkfew ltlabelgt The hint quotdptkquot tells the CPU to use dynamic prediction hardware and assume that the branch is taken the rst time The sample code block is as follows An Introduction to Branch Prediction Page 1 4 5252007 if r32 0 then quottake branch pathquot branchitest mov r8 aritc start time cmpeq p0p6 r32 r0 p6 brcondspntfewclr labelia change hint here mov r9 aritc end time not taken sub r8 r9 r8 brretsptkfew rp a labelia Future work Are we there yetquot Our tour of branch prediction has taken us from 40 accuracy to 975 accuracy on real benchmarks and real machines Even this is not enough to keep up with superscalar deeppipeline architectures To understand why let s return to our pipeline speedup equation Pipeline depth Pipeline speedup 1 branch penalty X branch frequency and plug in some numbers Earlier we assumed that branch frequency was simply the rate of branches in the code However we need to be more accurate branch frequency is the rate at which the pipeline encounters branches In the case of a single issue pipeline these rates are the same However in a superscalar pipeline issuing 3 or 4 instructions per cycle the pipeline encounters branches at arate 3 or 4 times the 20 we mentioned earlier For the sake of example let s assume branch frequency 06 a machine effectively issuing 3 instructions per cycle Now branch frequency in this context is really the frequency of branches not compensated for by our predictiontarget cache scheme That is Branch frequencyeff Branch frequency X 1 prediction accuracy Or Branch frequencyeff 06 X 0025 0015 This doesn t seem so bad However remember that the branch penalty in complex pipelines can be large Let s assume the 17 cycles from the PentiumTM 4 example Now we can write our equation as Pipeline delth Pipeline depth Pipeline speedup 1 17 X 0015 1255 An Introduction to Branch Prediction Page I 5 5252007 The result A 20 performance impact even with 975 accurate prediction A quick survey of future possibilities in branch prediction can show the way to prediction accuracy in the range of 9975 another order of magnitude decrease in misprediction So far these approaches have not been implemented to our knowledge in real hardware due to excessive hardware cost and or impractical complexity Neural pattern recognition approaches are the focus of most branch prediction research today The perceptron Block62 predictor JimenezZOOl is a classi er that uses a simple linear neuron but has latency too high to be practical in the effort to improve pipeline performance This approach has been extended by Loh and Henry Loh2002 and used as part of a hybrid predictor Intel includes a perceptron predictor in an Itanium architecture simulator for future research In order to understand how a classi er like a perceptron can help the branch prediction problem consider the set of all program paths leading to a branch say B Each of these paths can be represented by a branch history a binary string recording the taken or not taken outcome of branches along the program path Suppose we look at histories of length 2 That is we have a history recording the outcome of the last 2 branches before B for all program paths leading to B We can plot these histories on a 2 dimensional graph They might look like this In this case we have two program paths leading to B for which we ve recorded a history of length 2 The data are colorcoded to indicate which paths lead to B ultimately being taken black and which lead to B not being taken white In this simple example it is easy to see the induced decision surface A new history which lies above the line shown will predict the branch taken a new history which lies below the line will predict the branch not taken This simple classi er is called linear because the decision surface induced is linear Branch B 2 T aken Not taken Induce d decision surface Not taken Branch Bl An Introduction to Branch Prediction Page 1 6 5252007 Now that we have a simple mechanism it can be extended Each additional branch recorded in the history causes us to add an additional dimension to the above picture Each additional program path will add one colorcoded datum in each of these dimensions The resulting data will be seen as two clouds in nspace 7 one black and one white These clouds may overlap or may be distinct The trick now is to draw an n dimensional decision surface a hyperplane to best divide these data into the two classes Taken and Not Taken The complexity and potential shortfall of the approach is that the hyperplane must be built in advance by training the system on a number of benchmark programs The accuracy of branch prediction by this method is strongly impacted by the similarity between the training benchmarks and the target applications This linear percepton approach can be further extended Consider that a linear perceptron cannot learn an XOR function That is if we swap either of the black dots above with either of the white dots it is no longer possible to draw a line to separate classify the outcomes However a piecewise linear model can be applied allowing us to build nonlinear discriminators out of piecewise linear elements 7 a relatively tractable problem This concept has been developed by Daniel Jimenez at Rutgers University Jimene22005 and shows signi cant improvement over simple linear perceptron modeling A nal extension to consider is motivated by the fact that even a piecewise linear perceptron model is only as good as the training it gets It is often the case that a trained perceptron is very accurate at classifying some datasets but its training does not prepare it for signi cantly different datasets It can also be the case that data from a large and varied benchmark set may result in poor classi cation One way to solve this problem is to train several perceptrons on different likely datasets and then use another simpler classi er to choose between them In the signal processing domain this approach is termed mixture modeling because analysis is accomplished by applying a mixture of models depending on the data To date this mixture model approach has produced the best results in simulation though none have been implemented in real processors Best in class today At the Micro37 conference in December 2004 a branch predictor contest was held to baseline the current state of the art Micr02004 Most of the entrants were variants on perceptron neural predictors The contestants were given a set of traces for the initial rounds of the contest and a secret set of traces was reserved for the fmalist round Details of the implementations are beyond the scope of this paper The workload was varied and included integer heavy and oatingpoint heavy benchmarks No limit was imposed on complexity of the predictors The six nalists produced results that varied between 9919 and 9974 accuracy on average for the secret benchmark suite The winners were Hongliang Gao and Huiyang Zhou from the University of Central Florida using a perceptronbased predictor that adapts using a workload detector a mixture model as described above to classify the running program into one of four categories and changes the historygathering mechanism accordingly Branch target computation and caching So far we have focused on predicting whether or not a branch is taken As you can guess it isn t enough simply to predict the correct outcome of a branch In order to fetch the right instruction without a pipeline stall we also need to know the destination of the branch For example in the classic vestage pipeline model the new target PC must be known by the end of the instruction An Introduction to Branch Prediction Page 1 7 5252007 Branch Branch Target Address Hint PC Tag Hit L ogic Simple Branch Target Buffer BTB fetch stage in order to fetch the next instruction in time This implies that we need to know whether an as yetundecoded instruction is a branch or not and if so what its target PC will be We can address this issue with you guessed it more hardware this time a Branch Target Buffer The branch target buffer is a cache addressed by the PC of the upcoming instruction The tags of this cache are the PCs of known branch instructions The data in the cache are predicted PCs on the branch taken path Note that there is no need to cache target PCs for the fallthrough case because the instruction fetch stage will automatically compute the fallthrough address as it would for any successor instruction The presence of a given tag in the BTB implies that the associated instruction is a branch and that the branch is predicted taken Performance of a 4way set associative target cache and a table size of 1024 entries or greater shows a negligible tag miss rate 1 or less A BTB is shown below However for indirect branches that can resolve to more than one address there is a substantial risk that the wrong target address is cached which we ll discuss below An interesting variant on the branch target buffer is to cache not the target PC but the instructions at the target This approach can be used if the btb access cannot be accomplished in a single pipeline cycle enabling larger btb s than would otherwise be possible This approach also offers the possibility of executing some branches in zero cycles via a technique called branch folding Brie y if the btb buffers the instruction at the target of the branch the instruction decode unit can simply be handed the target instruction of the branch instead of the branch instruction This works because the only function of the branch is to update the PC leaving the pipeline otherwise free for computation This works for unconditional branches and may be made to work for conditional branches in some architectures An Introduction to Branch Prediction Page 1 8 5252007 One additional issue with regard to branch target caching What if the branch instruction is an indirect branch eg an implementation of a SWITCH statement or an indirect call to a dynamically linked library As seen above the branch target cache has a good chance of caching the right branch PC However it may have the wrong branch target if it is an indirect branch that can resolve to more than one address How often does this occur L12005 shows us that for C language SPECINT95 benchmarks about 9 of branches are indirect For interpreted languages however the number can be worse The same study shows that for a crosssection of SPEC JVM98 benchmarks running in interpretive mode about 20 of branches are indirect This implies that indirect branches are frequent enough that they may show noticeable impact on the performance of a BTB As we might expect Li shows that the reason for visible impact is not the hit rate for an entry pertaining to the PC but rather the hit rate of the right target address For SPEC JV M98 benchmarks and a reasonably sized BTB the miss rate for the branch instruction is negligible as discussed above but the miss rate for the right target PC for the branch instruction is very high 7 ranging from 20 to near 100 for Java in interpretive mode Though this miss rate falls to around 10 when Java code is run in JIT mode the prevalence of Java running in interpretive mode makes this issue a cause for concern An Introduction to Branch Prediction Page 1 9 5252007 What this means is that it is not enough to use a simple BTB approach to track the most recently used branch target address This should feel somewhat familiar In branch prediction the direction of the branch was not xed at runtime and we found that it was not enough to track the most recent branch direction However we found a solution the prior branch history correlated strongly to the actual branch direction The same should intuitively be true here we should expect that the execution path history leading up to an indirect branch would have a strong correlation to the target address of the branch This approach has been proposed with at least two distinct implementations the target cache and the rehashable BTB The target cache approach is illustrated below In this approach the BTB is used normally unless the branch type eld indicates that the branch is indirect In this case a parallel lookup is done by hashing the PC value and the contents of the target history register and indexing into a separate target cache While this approach offers Branch Branch Target Address Hint PC Tag Hit L ogic Target History Hashed Branch Target Hit L ogic Branch Target Buffer with Target Cache An Introduction to Branch Prediction Page 20 5252007 signi cant bene t in branch target mispredict reduction see discussion below it has the disadvantage that the size of the target cache is xed at hardware design time The rehashable BTB approach in Li2005 obviates the need for a potentially large separate target cache by maintaining a small list of critica mispredicting indirect branches which can be used to improve hit rates on a BTB It improves on the BTB TC scheme by not requiring a large dedicated resource to address indirect branch target prediction Results from SPEC JVM98 trials using a standard BTB approach and a BTB with additional target cache TC for indirect branches is shown here from Li2005 A traditional BTB is simulated with 2048 entries The corresponding BTB with target cache is split so that the overall resource cost is roughly the same with half of the resource dedicated to improving indirect branch target prediction rates type entries The reduction in misprediction on interpreted Java benchmarks is substantial across the board indicating that the correlation principle applies here as is seemed it should In terms of overall branch target misprediction these results contribute to a signi cant improvement as well On a set of benchmarks particularly sensitive to indirect branches overall branch target misprediction can be reduced to under 10 for interpreted code and under 2 for compiled code Summary We have followed the evolution of branch prediction and discussed potential future improvements and seen branch prediction rates improve from 40 to over 997 as a result We have also taken a look at branch target buffers and shown that for nonpolymorphic branches branches with only one possible target PC hit rates can exceed 99 Taking into account polymorphic multitarget branches and ways of addressing these we can see compiled code getting accurate branch and target prediction in the 98 range There is still work to be done particularly in the area of branch target prediction However we ve seen a path to limit the inef ciencies of pipeline hazards due to branches to a penalty of 20 or less from the ideal 7 a signi cant improvement in the past 10 years of work An Introduction to Branch Prediction Page 21 5252007 References Ball95 Thomas Ball and James R Larus Branch Prediction for Free Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation 1993 Block62 H D Block The perceptron A model for brain functioning Review of Modern Physics 34123135 1962 Gao2005 Hongliang Gao and Huiyang Zhou Adaptive Information Processing An Effective Way to Improve Perceptron Predictors Presented at Micro37 December 2004 Hennessy2003 John L Hennessy and David A Patterson Computer 39 quot A Quantitative Approach 3rd edition Morgan Kaufmann 2003 Jimenez2005 Daniel A Jimenez Piecewise Linear Branch Prediction In Proceedings of the 32nd Annual International Conference on Computer Architecture June 2005 Leel984 J Lee and A J Smith Branch Prediction Strategies and Branch Target Bilfer Design IEEE Computer January 1984 Li2005 Tao Li Ravi Bhargava and Lizy Kurian John Adapting Branch Target Bu er to Improve the Target Predictability of Java Code ACM Transactions on Architecture and Code Optimization June 2005 Loh2002 Gabriel H Loh and Dana S Henry Predicting conditional branches with fusion based hybrid predictors In Proceedings of the 11th Conference on Parallel Architectures and Compilation Techniques September 2002 Micro2004 Proceedings of the 37th Annual IEEEACM International Symposium on Microarchitecture December 2004 Pann1992 ST Pan K So and J T Rameh Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation In Proceedings of the 5111 International Conference on Architecture Support for Programming Languages and Operating Systems Oct 1992 Yeh1993 TY Yeh and Yale N Patt A comparison of dynamic branch predictors that use two levels of branch history In Proceedings of the 20111 Annual International Symposium on Computer Architecture May 1993
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'