New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here

Computer Architecture

by: Furman Breitenberg

Computer Architecture EEC 270

Furman Breitenberg
GPA 3.93


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Engineering Electrical & Compu

This 35 page Class Notes was uploaded by Furman Breitenberg on Wednesday September 9, 2015. The Class Notes belongs to EEC 270 at University of California - Davis taught by Staff in Fall. Since its upload, it has received 60 views. For similar materials see /class/191947/eec-270-university-of-california-davis in Engineering Electrical & Compu at University of California - Davis.

Similar to EEC 270 at UCD

Popular in Engineering Electrical & Compu


Reviews for Computer Architecture


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/09/15
Branch Prediction and Multiple Issue Processors Venkatesh Akella EEC 270 Winter 2004 Based on Material provided by Prof Al Davis and Prof David Culler Branch Prediction Size of basic blocks limited to 4 7 instructio s Delayed branches not a solution in multiple issue processors Why Hard to find independent instructions and remem er e mess they create or precise exceptions resolve a branch need two things a branch target address and b branch direction Prediction deals with b Ie getting the direction Branch Penalty is governed by a Deeper pipeline bad news as BP is higher Static Branch Prediction Let the compiler figure out the branch direction for each branch instruction Three strategies a Always Predict Taken Misprediction is 34 b Forward Not Taken Backward Taken Misprediction is 10 40 c Profile driven using realistic benchmarks and real data and for each branch determine the direction Hennessey at McFarling and Lar39us and Ball Dynamic Branch Prediction Run Time Hardware assisted Intuition branches direction is not random they are DAL ie either strongly taken or not taken Onebit Branch Prediction Buffer or Branch History Table our Smith 1931 ken 0 Not Taken Update BHT when v Past a good a good indicator of the future on make a mistake what are the problems a Aliasing due to limited size of the BHT tag can be stored to avoid this problem b 1bit history may not be sufficient Eg oonsider a loop that iterates 10 times You will mispredict 210 so accuracy is EOX Page ltgt Dynamic Branch Prediction Jim Smith 1931 Enter Solution 2bit scheme where change prediction only if gut mispradiction twice Predict Taken Predict Not Taken Adds hysteresis to decision making process 2 bit Counters Kbits 2k 2 bit saturating counters Upto 9351 accuracy If K is sufficiently large each branch maps to a unique counter Can store tags if you want to avoid aliasing How about mbit counters 1 Doesn39t bencfit much How do you improve further Can we capture the actual history of the specific branch and use that to make our prediction LOCAL HISTORY Can we capture the sequential correlation between branches GLOBAL HISTORY do both Make multiple predictions and choose the right prediction based on the context of the particular branch TOURNAMENT predictors Using Local History Consider the simple for loo FOR I 1 Ilt5 I something If the branch is at the end of the loop body it has following pattern 1110 N The sequence of the branch history is 11101110111011101110 Basically if we know wint the branch did the last three times we can predict EXACTLY what it will do next 1 accuracy 3bit local history accesses Page ltgt Global History Correlated Branch HYbf id Predictor Prediction If x 1 Bl I if x gt 1 32 GSelect Predictor Gshare McFa IIng Observation if BI is taken then 32 is not taken kbit counters This is a characteristic of structured programming neste proce re calls and nested conditionals 50 PC whether 32 is taken or not is related to the previous 39quot branch Bl global history mum n 1bit or 9H3 2bit or mbit quot lobal history Buffer K39 5quot counter shift register Used in Pentium Athlon So it may not be Ultras arc p The current branch 981 CUWY Processor From End Correlating Branches Idea takennot taken of recently execu branches is related to behavior of next branch as well as the history of calem 351 i that branch behavior E E E E Then behavior of recent E E E E mssi39mm E E EE39 branch updating just that E E E E pr ictwn E E E E 22 predictor 2 bit l39 l l39 global 2 bit local Page ltgt Accuracy of Different Schemes Flgure 315p 206 4096 Entries 2bit BHT Unlimited Entries 2bit BHT 1024 Entries 22 BHT in x 2 2 1 a i a 2 2 0 gt u x a 3 cr a i u 4096 armies 27m Der eniry a Urlimilsl mines zbisemry mm was 22 i Reevaluating Correlation Several of the SPEC benchmarks have less than a dozen branches responsible for 90 of taken branches program branch static 90 compress 14 236 13 14711011 15 M i gcc 15 9531 2020 mpeg 10 5598 532 real gcc 13 17361 3214 Real programs 05 more like gcc Small benefits beyond benchmarks for correlation problems with branch aliases BHT Accuracy Mispredict because either Wrong guess for that branch Got branch history of wrong branch when index the table 4096 entry table programs vary from 1 misprediction nasa7 tomcatv to 18 eqntott with spice at 9 and gcc at 12 For SPEC92 4096 about as good as infinite table Tournament Predictors Motivation for hybrid branch predictors is 2 bit predictor failed on important branches by adding global information performance improved Tournament predictors use 2 predictors 1 based on global information and 1 based on local information and combine with a selector Hopes to select right predictor for right branch or right context of branch Page ltgt Tournament Predictor in Alpha 21264 4K 2bit counters to choose from among a global predictor and a local predictor Global predictor also has 4K entries and is indexed by the history of the last 12 branches each entry in the global predictor is a standard 2bit predictor 12 bit pattern ith bit 0 gt ith prior branch not taken ith bit 1 gt ith prior branch taken Local predictor consists of a 2level predictor Top level a local history table consisting of 1024 10 bit entries each 10 bit entry corresponds to the most recent 10 branch outcomes for the entry 10 bit history allows patterns 10 branches to be discovered and predicted Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3 bit saturating counters which provide the local prediction Total size 4K2 4K2 1K10 1K3 29K bits 180000 transistors of predictions from local predictor in Tournament Prediction Scheme of Branch Prediction III 7 I 2bit counter Tournament 39 fig 340 I I lllllll Profile branch profile from last execution static in that in encoded in instruction but profile Accuracy v ize PECS9 imil il li dqeaiq a Page ltgt Need Address Branch Target Buffer at Same Time as Prediction Branch Target Buffer BTB Indexed using the branch I 50513 in The IF Stage initruftion Address to get prediction AND branch address if This is a cache Need The fags as we to en Need to look up whole PC last bits won39t do because this stage we do not know the opcode yet Accessed in IF stage Branch PC Predicted PC Need to keep only predict taken branches only others follow normal fetch sequence HDLBA uogpnuIsu I0 34 Yes instruction me is branch and prediction state No branch not bits predicted proceed normally use Predided PC Next PC PC4 5 quot2 quot PC 39 9 How Branch Target Buffer Is Used Predicafed Execu on Avoid branch prediction by turning branches into conditionally executed instructions if x then A B op C else NOP If false then neither store result nor cause exception Expanded ISA of Alpha MIPS PowerPC SPARC have conditional move PA RISC can annul any following instr This transformation is called quotif conversionquot Drawbacks to conditional instructions Still takes a clock even if quotannulledquot Stall if condition evaluated late Complex conditions or condition becomes known late mm ammmmmm WWW in pipeline gt reduces effectiveness Page ltgt Branch Folding Branch Folding Instead of storing Next PC or B ow about storing t e target instruction itself or multiple instructions if it is a multi issue processor Ea L2 b L m L add R1 R2 R3 At address corresponding to L2 you store the add instruction instead of the unconditional branch iristructio L ZERO cycle BRANCH eliminated one instruction all together Advanced Approaches Trace Caches aggressive prefetching Return Address Caches jr Ra when Ra is return address of a procedure 85 of indirect jumps are due to procedure returns BTB does not work very well because procedure is called from many different laces So you a separate stack cache to push Ra and pop them off Special Case Return Addresses Register Indirect branch hard to predict a ress SPECB9 85 such branches for procedure return Since stack discipline for procedures sa r ad 39n ve small buffer that acts like a stack 8 to 16 entries has small miss rate Pitfall Sometimes bigger and dumber 21264 uses tournament predictor 29 Kbits Earlier 21164 uses a simple 2 bit predictor with 2K entries or a total of 4 Kbits SPEC95 benchmarks 22264 outperforms 7 21264 avg 115 mispredictions per 1000 instructions 7 21164 avg 165 mispredictions per 1000 instructions Reversed for transaction processing TP 7 21264 avg 17 mispredictions per 1000 instructions 7 21164 avg 15 mispredictions per 1000 instructions TP code much larger 51 21164 hold 2X branch predictions based on local behavior 2K vs 1K local predictor in the 21264 Page ltgt Dynamic Branch Prediction Summary Prediction becoming important part of scalar execution Branch History Table 2 bits for loop accuracy Conelation Recently executed branches correlated with next branch 7 EithI diffInt branches 7 different executions of son branches Tournament Predictor more resources to competitive solutions and pick between them Branch Target Buffer include branch address a prediction Predicated Execution can reduce number of branches number of mispredicted branches Return address stack for prediction of indirect jump Multiple Issue Goal how to reduce CPI below 10 Consider two consecutive blocks of instructions Gj i1 i2 i3 i4 and Si i5i6 i7 i8 Gj is already in execution 1 Fetch 6i Check for all structural hazards that instructions in Gj may 39 Check for data hazards between 6i and between instructions in Si and SJ N all a Read operands and execute Flavors of Multiple Issue Processors Vector execute a loop in parallel directly on array data structures Slperscalar 7 Static inorderexecution if 15 has a problem HALT ls sun A sun 11111 7 Dynamic outoforder execution let 16 if 15 has a resource conflict N Speculation If i5 is a branch do not allow 16 till branch is resolved IaM law 2 With Speculation Allow 15 but be prepared to rol back Pentium 4 Pentium 4 Alpha 21264 MIPS R10K VLIW e Compiler determines what to execute in parallel Trimedia e EPIC basis for Itanium Multiple Issue Headaches Increased I Cache Fetch BW Alignment problems may not allow 4 instructions to be fetched Need to check for more hazards Branches 25 of instructions are branches so you need to resolve a branch every cyc e Increased ports on register file and memory So how do we proceed 1 Pipeline the Issue unit into 2 stages 2 Restricted Issue eg one int and one FP Getti CPI in Multiple InstructionsCycle Superscalar MIPS 2 instructions 1 FP a 1 anything Fetch 64bitsclodlt cycle Integer on left FP on right Can only issue 2nd instruction if 1st instruction issue More ports for FF registers to do FP load a FP op in a pair Pipesfagz Int Instruction IF ID EX MEM WB FP instruction IF ID MEM WB Int instruction IF ID EX MEM WB FP instruction IF ID EX MEM t instruction F F instruction 1 cycle load delay emands to 3 instructions in 55 h s instruction in right alf can39t use it nor instructions in next slot Multiple Issue Issues issue packet group of instructions from fetch unit that could potentially issue in 1 clock 7 If instruction causes str due to issue p uctural hazard or a data ha uction in execution or to earlier i instruction does not issue earlier instr acket then s o to N instruction issues per clock cycle for Nissue zard either nstruction in Performing issue checks in 1 cycle could limit clock cycle time 0n2 n comparisons e s gt issue stage usually split and pipelined 7 1st stage decides how many instructions from within this packet can issue 2nd stage examines hazards among selected instructions and those already been iss e s gt higher branch penalties gt prediction accuracy important Multiple Issue Challenges While IntegerFF split is simple for the HW get CPI of 05 only for programs with s Exactly 50 FP operations AND No hazards If more instructions issue at same time greater difficulty of decode and issue 7 Even 2scalar gt examine 2 opoodes 6 register specifiers amp decide if 1 or 2 instructions can issue Nissue oN2N comparisons 7 Register file need 2x reads and 1x writescycle Multiple Issue Headaches e logic must be able to rename s me register multiple times in one cycle For instance consider 4 way issue add 1 2 3 sub 4 1 2 w 1 44 1 2 add p11 p4 p7 i 911 p4 4p22 add p12 p23 p4 Imagine doing this transformation in a single cycle PM 2 o 3951 NN u Result buses Need to complete multiple instructionscycle So need multiple buses with associated matching logic at every reservation station Or need multiple forwarding paths Page ltgt Dynamic Scheduling in Superscalar eas How to issue two instructions and keep inorder instruction issue for Tomasulo r Assunu 1 intlgI e 1 floating point r 1 Tomasulo control for integer 1 for floating po39nt Issue 2x Clock Rate so that issue remains in order anly locusstores might cause dependency between integer and F issue 7 Replace load reservation station with a load queue operands must be read in the order they are fetched 7 Load chedlts addresses in Store Queue to avoid RAW violation 7 Store checks addresses in Load Queue to avoid WARWAW Register renaming virtual registers versus Reorder Buffers Alternative to Reorder Buffer is a larger virtual set of registers and register renaming Virtual registers hold both architecturally visible registers temporary values 7 replace functions of reorder buffer and reservation station Renaming process maps names of architectural registers to registers in virtual register set 7 changing subset of virtual registers contains architecturally visible registers Simplifies instruction commit mark register as no longer speculative free register with old value Adds 40 80 extra registers Alpha Pentium 7 Size limits no instructions in execution used until commit How much to speculate Speculation Pro uncover events that would otherwise stall the pipeline cache misses Speculation Con speculation costly if exceptional event occurs when speculation was incorrect Typical solution speculation allows only low cost exceptional events 1st level cache miss When expensive exceptional event occurs 2nd level cache miss or TLB miss processor waits until the instruction causing event is no longer speculative before handling the event Assuning single branch per cycle future may speculate across multiple branches Limits to ILP Conflicting studies of amount B nc marks vectorized Fortran FP vs integer c programs 7 Hardware sophistication s Compiler sophistication How much ILP is available using existing mechanisms with increasing HW budgets Do we need to invent new HWSW mechanisms to keep on processor performance curve 7 Intel MMX SSE Streaming SIMD Extensions 64 bit int 7 Intel SSEZ 12E bit including 2 64bit Fl Pt per clock 7 Motorola AltaVec 12E bit int and FPS s Snpersparc Multimedia ops etc Page ltgt Limits to ILP Initial HW Model here MIPS compilers Assumptions for idealperfect machine to start 1 Register renamiry infinite virtual registers gt all register WAW amp WAR hazards are avoided 2 Branch predictan perfect no mispredictions 3 Jump predicton all jumps perfectly predicted 3 gt machine with perfect speculation amp an unbounded buffer of instructions available 4 Memoryaddress aias analysis addresses are known amp a store can be moved before a load provided addresses not equal lso unlimited number of instructions issuedclock cycle perfect caches 1 cycle latency for all instructions FP IPC Instruction Issues per cycle on D Upper Limit to ILP Ideal Machine Flgure 335 p 242 an 0 1501 FP 75 150 1187 4 o N o Integer 18 60 o o o gcc espresso Ii fPPDP doducd tomcatv Programs More Realistic HW Branch Impact Figure 337 Change from Infinite pp 15 45 so window to examine to a as 2000 and maximum 507 issue of 64 u As 45 As 45 45 Instructions per clock 40 cycle as Integer 6 12 u 207 a 9 12 l 5 l3 l4 9 in mi 5 e 7 s a 7 2 2 2 A II gcc espresso IPC inuuucuun Issues per cycle 2 I mm doducd tnmcatv Program I Perfect l Selective predictor lStandar Z39blt Static Nane Perfect Tournament BHT 512 Profile No prediction IISECQ Issues per cycle More Realistic HW Renaming Register Impact Figur234l 39 Change 2000 instr window 64 instr FP 11 45 39 issue 8K 2 level Prediction I II Integer 5 15 qumm limnme manually32w Infinite 256 128 64 32 None Page ltgt IPC More Realistic HW Memory Address Alias Impact Change 2000 instr II in wmdow 64 instr issue FF 4 45 gquot 7 362 level Prediction Fortran m T renaming registers no heap ii lnteger 4 9 E II 239 i I I I I I I I II Prngram I Perfect Globalstack Perfect l lnspectlon Cl None Realistic HW Window Impact Flgure 346 quot erfect disambiguation lHW 1K Selective rediction 16 entry return 64 registers issue 5 many as window E I it cycl i 3 U Equot a I I II Ilnl39lmleZSE liza C E4 I32 Die la 4 39 niteZSG 128 64 32 16 8 4 How to Exceed ILP Limits of this study WAR and WAW hazards through memory eliminated WAW and WAR halards on registers through renaming but not in memory usage Unnecessary dependences compiler not unrolling loops so iteration variable dependence Overcoming the data flow limit value prediction predicting values and speculating on prediction Address value prediction and speculation predicts addresses and speculates by reordering loads and stores could provide better aliasing analysis only need predict if addresses Use multiple threads of control Workstation Microprocessors 32001 Alpha Inlel MD 12643i Adlloll 1 PA ll1Eulluiinl clocklme a 31mm new i 552MHz i 45mm 10012 15GH1 400MHz i 480MHz i so Cache llDLZ e4iue4K saxsame 392l2KlM 32we4elt lt l JbK I s til m 32mm irK ex 31 Issue Rate 4 issue 3 x86 I sl I 4 ism I 4 issue 3 xasiiim 3 x Rap 4 issue up 4 Pipeline Stages 79ltlxgrs 91l ldgu 7i my 75 slagos 1714 sing liIdsmgrs mgrs N95mgcs Hl olovd aoi 7 ROPS 5 l I 32 mm R0 P5 18l no l Rename legs illAll io l3 5K l lal llG Inlltl lp 1U lolal UH lnlal 125 Nl me I am mules l 4K x9 bit 4K x2 hit 2K xzbil IZK x242quot gt 5 2 4K xzbit 2K xzbil 512 xzhll 16K m Enlnes 122123 18028 iza ulllllt d mime 32 754D izaiasb a miiiisd l sin 1 Mammy 31w 2 66685 z ices 1 5466 l 6683 1 OSCEs 3 2mm 39 MBs 1 CHIS l 4 age I immaa Hume cmoma ramu MANlt23 rm 2 iiiHa iisa lCPmizess l o isii 6M a mi 6M luzsii 2M l a 221 6m 0 laii SM a lap M a zsii 4M 29 6M I 01 me She 1lLiiiii i iiiin 4Iiiiiii msuiiir luciiiiii 111nm minim l3 iiviiquot zl T sl Ms 4 million 27 nlllllan I130 million I 23 mllllo 24 million 42 million 7 2 mllllon a a mllllnn l 29 Esl ml 051 160 so 5550 El m 539 ii S125 53970 owed 7 w 76W i saw I 36W 30W 55WTDP ZSW OW Avallablllly iQU I 4000 20m 4000 ZQD J mm 7000 300 l Max issue 4 instructions many CPUs Max rename registers 128 Pentium 4 Max BHT 4K x 9 Alpha 212643 16Kx2 Ultra III Max Window Size 000 126 intructions Pent 4 Max Pipeline 2224 stages Pentium 4 Source Microprocessor Report wwwMPEorlirlecom Page ltgt How to Improve CPI Instruction Level Parallelism and Its Dynamic Exploitation Venkatesll Akella EEC 270 Winter 2005 Based on Orginal Lecture Material from Prof David Cllller UC Berkeley amp Prof Al Davis University of Utah a Inordcr issue and Inordcr compktion is rather limited b Allow outoforder issue c Allow outofordur completion d Allow both e Multiple Issue 0 Basic block is small 47 instructions before we nit a control flow instruction g Look for ILP across basic blocks n Predict Branches a Speculation Ex Ioit instruction level parallelism HIV multiplt functional units to uxacutc morn than on instruction at thI sum tini chapter 3 ILP chapter 3 ILP Basic Problems to address Roadmap Detecting data hazards with larger window sixes becomes more complex Deal witn control hazards Deal witn outotorder execution Deal witn outotorder completion ts Static Let tne compiler figure these out and deal with tnem cnapter 4 Dynamic Let tne nardware figure it out cnapter s Both yes tne familiar codesign again Itanium approacn chapter 3 chapter 4 chapter 3 ILP chapter 3 ILP Page ltgt Instruction Level Parallelism ILP Data quot J and Hazards Basic Block BB ILP is quite small 7 as a straightline code sequence with no branches in except to the entry and no ranches out except at the exit 7 average dynamic branch frequency 15 to 25 gt 4 7 39nstructions execute between a pair of branches 7 Plus instructions in as likely to depend on each other To obtain substantial performance enhancements we must exploit ILP across multiple basic blocks Simplest looplevel parallelism to exploit parallelism among iterations of a loop 7 Vector is one way 7 If not vector then either dynamic via brmch prediction or static via loop unroll39ng by conpiler InstrJ is data dependent on InstrI InstrJ tries to read operand before InstrI writes it I add r1r2r3 a sub r4 r1 r3 or InstrJ is data dependent on InstrK which is dependent on InstrI Caused by a True Dependencequot compiler term If true dependence causes a hazard in the pipeline it is called a Read After Write RAW hazard chapter 3 ILP chapter 3 ILP Name Dependence 1 A 39 A A Data quot r J and Hazards Dependences are a property of programs Presence of dependence indicates potential for a hazard but actual hazard and length of any stall is a property of the pipeline Importance of the data dependencies 1 indicates the possibility of a hazard 2 determines order in which results must be calculated 3 sets an lpper bound on how much parallelism can possibly be exploited Name dependence when 2 instructions use same register or memory location called a name but no flow of data between the instructions associated with that name 2 versions of name dependence Instr writes operand be m Instrx reads it C1 sub r4r1r3 J add r1r2r3 K mul r6r1r7 Called an antidependencequot in compiler work This results from reuse of the name r1 It antidependence caused a hazard in the pipeline called a Write After Read WAR hazard chapter 3 ILP chapter 3 ILP Page ltgt Name Dependence 2 Memory Dataflow nib J J my r Instrg writes operand before InstrI writes it K mul r5r1r7 Called an output dependencequot by compiler writers This also results from the reuse of name r1 If anti dependence caused a hazard in the pipeline called a Write After Write WAW hazard When instructions communicate through memory Memory aliasing is a problem Eg store val 100r4 load r5 20r6 How do you know if 100r4 and 20r6 are the pointing to the same physical memory location or not Inpossible to determine statically if two pointers are pointing to the same physical memory location Dynamic memory disambiguation is needed which adds overhead runtime check Chapter 3 ILP Chapter 3 ILP ILP and Data Hazards Control quot I J program oralor oralor instructions would ammo in if oxocmd squcntially 1 at a that as determined by original source rograrl HWSW goal oxplon paralklism by prasarvirg appurancc of program oralor r modify order ill mam than cannot be obswad by program 7 must not affact tho outcome of the program Ex Instructions irvolml ir a ram dependence can lxlcu39l simunaraously if ram usad ir instructions is chungad instructions do not conflict 7 Rg39 quotnaming mow ram dogmam for Mgists 7 Elli by conpilor or by HW 2 r and r3 r2 1 Every instruction is control dependent on some set of branches and in general these control dependencies must e preserved to preserve program order if p1 sl if p2 52 Si is control dependent on p1 and 52 is control dependent on p2 but not on p1 Chapter 3 ILP Chapter 3 ILP Page ltgt Control Dependence CAN Ignored Exception Behavior Control dependence need not always be preserved willing to execute instructions that should rlot have been executed thereby violatirlg the control dependences if can do so without affecting correctness of the program 2 properties critical to program correctness are exception behavior and data flow reserving exception behavior gt a y changes in instruction execution order must rlot h rlge how exceptions are rais 39n program gt no new exceptions Example DADDU R2 Ra R4 BEQZ R2 L1 Lw R1 0 R2 L1 Problem with moving Lw before BEQZ Chapter 3 ILP Chapter 3 ILP Data Flow lJl Dynamic Data flow actual flow of data values among instructions that produce results and those that consume them 7 branches make flow dynamic determine which instruction is supplier of d to Example DD R1 R2 R3 BEQZ 4 L nsuau R1 R5 R6 L 0R R7 R1 R8 OR depends on DADDU or DSUBU Must preserve data flow on execution llandles cases when dependences unknown af compile fime e eg because they may inrolre a memory reference If simplifies fhe compiler Allows code fhaf compiled for one pipeline fo run efficiently on a differenf pipeline ll rdware specuiafion a fechniaue wifh significanf performance advanfages fhaf builds on dynamic scheduling Chapter 3 ILP Chapter 3 ILP Page ltgt HW Schemes Instruction Parallelism Dynamic quot L J quot Step 1 Key idea Allow instructions behind stall to proceed DI D r0r2r4 RDDD r10r0ra SUBD F12E EE 14 Enables out of order execution and allows out of order completion Will distinguish when an instruction begins execution and when it completes execution between those two times the instruction is in execution In a dynamically scheduled pipeline all instructions pass through issue stage in order in order issue Simple pipeline had 1 stage to check both structural and data hazards Instruction Decode ID also called Instruction Issu Split the ID pipe stage of simple 5 stage pipeline in stages Issue Decode instructions check for structural hazards Read operands Wait until no data hazards then read operands chapter 3 ILP chapter 3 ILP A Dynamic Algorithm Tomasulo39s Algorithm Tomasulo Alaorithm For IBM 36091 before caches Goal High Performance without special compilers Small number of floating point registers 4 in 360 prevented interesting compiler scheduling of operations 7 This led Tomasulo to try to figure out how to get more effective registers renaming in hat ware Why Study 1966 Computer The descendants of this have flourished 7 Alpha 21264 HP a000 MIPS 10000 Pentium 111 PowerPC 604 Control di buffers distributed with Function Units FU Fu buffers oaiied reservation stationsquot have pending operands Registers in instructions replaced by values or pointers to reservation sta ionsRS 7 form of register renaming avoids WAR WAW hazards o e reservation stations than registers so can do optimizations compilers can39t Results to FU from RS not through registers over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches allowing FP ops beyond basic block in PP queue chapter 3 ILP chapter 3 ILP Page ltgt woo Que us Load Buffers stations Reservation Station Corn ponenfs Op Opuration to punform in quotis unit g or v5 Vk Valuc of Soilcc o Irands a buffcrs has V ficld Mail to b stard Qj Qk IlsaVatican stations producing soilcc rugisttrs nus to b writttn 7 Note 39Qko gt r 7 Star bufflrs only ium Qi for IE producing mun Busy Indicat rascrvation station on FU is busy Rugistur r sult status Indicatus which functional unit will writ uach rugisttr if on Ixists Blank whun no Funding instructions that will wit that rugisttr Chapter 3 ILP i both ooonmds noody ii mom if not mdy watch Cannon Data Bus for quotsuit Write result finish sxscuvion WB Wrm on Common Data Bus to all auditing units mark r 39abk Squot lsuvation smion avali Normal data bus data datination go toquot b 4t Common d a bus data source a s fromquot bus 7 64 b39 of data 4 bits of Funct a Unit scum oddrss 7 Writ if nutchcs 7 Dogs til broaden Exninpi spud s clocks for Fl or 10 for 39 40 clks for oxoomd Functional Unit onoduoos mun si Chapter 3 ILP Three Stages of Tomasulo Algorithm T m SUI Example 1 Issue gut instruction from F Op Qusus mmquot mm Em We If MsIvation smion no no structural hazard lnsirumnn k 155m Cam Resuh control issuc insin amp sonds opIands nsnanss ragistIs LD 2 Regtster resultstam c1 m 52 F4 F6 FEFIO F12 F30 FUI L Chapter 3 ILP Page ltgt Tomasulo Example Cycle 1 Tomasulo Example Cycle 2 Instructtan 52am Exec Wme mm Cam Result Ragtster resultstams Clock F8 F10 F12 F30 Instmcttanstaas Exec ng mu m J 55149 CamRexu LD F6 34 F2 45 1 F0 F2 mm F2 5 own no Fa ADDD F6 F2 Mm Regtster rem 52am Clock F F8 F10 F12 F30 Lmn Note Can have multiple loads outstandina Chapter ILP Chapter ILP Tomasulo Example Cycle 3 5m Wme g Cm Rexu Register rem 52am Clock 32 F2 F6 F8 F10 F12 F30 3 Mum me me Mm rugis39trs Hum m Mmovad Venumcd39 in mumquot smions MULT issued m m Tamasula Example Cycle 4 IrLsthttan mas mu Clock F0 F2 F4 F6 F8 F10 F12 F30 FU LoadZ completing what is waiting for LoadZ u r Chapter ILP Chapter ILP Page ltgt Tnmnsuln Example Cycle 5 Tnmnsuln Example Cycle 6 Instruction 52am Ragtster resultstams Clock F0 F2 F4 F6 F8 F10 F12 F30 FE Mum Mm WA 1 u Multl Timer starts down for Addi Mulfi Insmmzmn mas mu m LD a v Vk k 1mm Adel Yes ADDD MAI Adm A663 a 9Mnm Yes MLYLTD NKAZ RF4 Mum Yes am A Mum Register rem slam c1 F0 F2 F4 F6 F8 F10 F12 F30 Mum NKAZ Advil Kddl Mum Issue ADDD here despite name dependency on F6 Chapter 3 ILP Chapter 3 ILP Tnmnsuln Example Cycle 7 Tnmnsuln Example Cycle 8 Instruction 52am nsxu Na MUL39KD Mmz Rm Yes DND Al Mn 0 F F4 F6 F8 F10 F12 F30 Register resulzstams Clock F 2 m Mum MAI AddZ Add Mum Addi SUBD completing what is waiting for if Insmmzmn mas mm MULTD NKAZ Rm Yes DIVE Al Mm Register rem 52am Clock F0 F2 F4 F6 F8 F10 F12 F30 F0 Chapter 3 ILP Chapter 3 ILP Page ltgt Tnmnsuln Example Cycle 9 Tnmnsuln Example Cycle 10 Instruction 52am nsxu Instructtan mas mu m LD Na ADDD 1Mer NKAZ MULTD NKAZ Rm Yes am A Mu Register result 52am Register result 52am Clock F0 F2 F4 F6 F8 F10 F12 F30 Clock F0 F2 F4 F6 F8 F10 F12 F30 m m m AddZ ADDD completing what is waiting for if Chapter ILP Chapter ILP Tnmnsuln Example Cycle 11 Tnmnsulo Example Cycle 12 Instruction 52am mm Insmmzmn mas mm m Na MULTD NKAz Yes DIVE Register rem 52am Register rem 52am Clock F0 F2 F4 F6 F8 F10 F12 F30 Clock F0 F2 F4 F6 F8 F10 F12 F30 W M 12 F0 Mum MAI 39MMW KMM NW1 Write resqu of ADDD here Emigll gylg lpsfmc ors complete In this cycle chapter 3 m Page ltgt Tnmnsuln Example Cycle 13 Tnmnsuln Example Cycle 14 Instruction 52am m n 1mm Ragtster resultstams Clock F0 F2 F4 F6 F8 F10 F12 F30 Instructtan mas mm m LD MUL39KD 51151 i 3 rs om F10 ro ADDD F6 F2 Na MULTD NKAZ Rm 1 s DIVD Al Mu Register rem 52am Clock F2 F8 F10 F12 F30 F0 F4 F6 F0 Mum NKAZ MiamiMM Mum Chapter 3 ILP Chapter 3 ILP Tnmnsuln Example Cycle 15 Instruction 52am ane nsxu 1mm Register resulzstams Clock F0 F2 F4 F6 F8 F10 F12 F30 F0 Mulfi MULTD completing who is waiting for if Insmmzmn mas Exec ng mu mu F8 F10 F12 Mm J us waiting for Mule DIVD to conplefe F30 Chapter 3 ILP Chapter 3 ILP Page ltgt Tamasula Example Cycle 55 Insmmzmn mas Exec ng mu m c m Rexu LD lm Faster than light compufafion 33 skip a couple of cycles Mm Register r2341 52am Clock F0 F2 F4 F6 F8 F10 F12 F30 m Chapter ILP Chapter ILP Tamasula Example Cycle 56 Tamasula Example Cycle 57 Instructtan 52am IrLsthttan mas nsxu mu Mm Register r2341 52am Register r2341 52am Clock F0 F2 F4 F6 F8 F10 F12 F30 Clock F0 F2 F4 F6 F8 F10 F12 F30 F0 56 W M NW 39MVWM39MVM Ream Once again IIIorder issue outofordu cxuu39ion and out Mule DIVD IS completing what IS wamng for If ofordu complainquot Chapter39 3 ILP Chapter 3 ILP Page ltgt Tomasulo Drawbacks Tomasulo Loop Example Complexity 7 delays of 36091 MIPS 10000 Alpln 2 PPC 620 in CAAQA 2e but Many associative stores CDB at high speed Performance limited by Common Data Bus 7 Each CDB must go to multiple functional units ahigh capacitance high wiring density 7 Number of limited to o 1254 not in silicon functional units that can complete per cycle nel Multiple CD85 a more FU logic for parallel assoc stores Non precise interrupts e We will address this later Loop LD F0 0 R1 MULTD F4 F0 F2 51 F4 0 R1 5031 R1 R1 8 BNEZ R1 Loop This time assune Multiply takes 4 clocks Assume ist load takes 8 clocks L1 cache miss 2nd load takes 1 clock hit To be clear will show clocks for SUBI BNEZ 7 Reality integer instructions ahead of Fl Pt Instructions Show 2 iterations chapter 3 ILP chapter 3 ILP Loop Example Loop Example Cycle 1 Instruction swims Exgg ng Instmctxan stains Exec Write TE instruction J k 5142 Compkcmu I instruction J k 5142 Compkemu I ll F0 0 El l LD Fl U R 1 t mm to n n quotfquot 1 n Rl anon 1 MULID F4 F0 Ft Count 1 r U Rl Rcwmtmn Stanans SI 52 RS39 ReservatxanStanans 52 3539 m Mm an W e m NW 131 Addl 0 Adam F0 AddK so a Multl sum Rl Rl as nit Na BNEZ Rl Loop Registcncsultsmtus Instruction Loop Jock in F0 F2 F4 F6 F8 F10 F12 F30 0 so chapter 3 ILP chapter 3 ILP Page ltgt Loop Example Cycle 2 Loop Example Cycle 3 Wrzze 11150112111211 5m 11 Em ITER 1mm Camp Rama 1 z Instrucan J k 1 111 11 1 R1 1 1111111 H 11 1 2 Reservazmnsmms 51 a R9 w N W 11151111511011 stams Exec ane If structmn J Ca 11 1 1 LD r11 11 r 1 1 111111 11 1 11 1 1 gt1 1 11 F 1 Reservatum Stations Tm 511 No Yes Mum F0 F7 3 so F11 1111 F4 F6 F8 F10 F12 F30 111111 Chapter ILP ml llClY renumin sers u mm mm r39u n Chapterg ILP g P g P Loop Example Cycle 4 Loop Example Cycle 5 11150115111211 shuns Em Wm ITER Instrucan J k Hue CampRmu 1 11 m R1 1 1 MULID W1 R11 n 2 1 3n 1 R1 RewrvatmnSmtmns s1 9 as W N 131 V k cm LD F0 0 R1 11111111 F4 F0 F2 so FA 0 R1 SL131 R1 R1 3134 1111 41 1mm R1 11 Regxster quot211111 501111 Jack R F0 F2 F4 F6 F8 F10 F12 F30 1 so F11 11111 111111 11151111211011 5m ane m5 Exec R lnstmctmn Hue CampRmu 1 1 J k 1 1 1 1111111 174 um z 2 1 31 174 11 1 3 ReservatmnStanans s1 52 Rs Tm 131 0 R1 F0 F2 4 11 R1 11111 arm111131 R1 we M11112 Loop Regxder 129111 smms ac R F0 F2 F4 F6 F8 F10 F12 F30 Lum M11111 Chapter uisguvcning sum msrr39ucrion 7107 in rr queue ILP Ana DNtL insYr39ucrion mm in rr queue Chapter ILP Page ltgt Loop Example Cycle 6 Loop Example Cycle 7 11150115111111 5mm Em Wm ITER 11151711151111 J k Hue CampREmH 1 111 m 1 R1 1 r11 1 2 R1 3 R1 5 RIFJ MR1 Mm fhnf Fn never no land fmn lnrnfum an lnstmctwn stams Exec ane lnstmctmn J k Hue Campkmu 1 LD 11 11 1 1 F2 FA 12 R1 R1 R1 we R1 Loop F6 F8 F10 F12 Rugis39u file complmly detached from computation Chapter ILP ChLP Feirsg un PSacond iteration my uvu39upplu Loop Example Cycle 8 Loop Example Cycle 9 Insmmum shuns Em Wm ITER nslxuctmn J k Hue Camp R2qu 1 m R1 1 1 1 111 m 2 11 p 1 3 11 R1 5 N n 7 11 R1 2 V W n k cm 0 R1 F0 F2 NR 12 R1 19 111111111 Rm W111 R1 we Yes 11111111 RrleLnadZ Loop Regxster quot2511 5mm Jack R F0 F2 F4 F6 F8 F10 F12 F30 72 F LoadZ M11112 Inszmczmn stams39 Exec ane Rem lnstmctmn J Hue Cam 1 1 1 1 MEL m m Irn gt73 i1 m 1 39 LD Pb 11 R1 1 mm 11 F0 F2 2 51 FA 0 R1 Reservatmn Stations 5 52 R9 2m NW 13 V W 1 am Add a 0 R1 A112 1111111 F4 F0 F2 AddK N1 5 1 11111 1111111 mam 511131 R R1 wagg 17072 an m2 R1 mp M11112 Regxder 1292115121111 Jack R F0 F2 F4 F6 F8 F10 F12 F30 9 72 F Loudl compll39 w o is waitinu Chapter ILP in h chhpm39sgtjsruvching SUBI Page ltgt Loop Example Cycle 10 Insvuctum 5mm Em Wm ITER lnslxuctmn J k 1mm CampRmu l LD 0 n m 1 9 10 l mu H F r2 2 l so u Rl 3 2 10 F0 0 Rl a 2 F2 7 2 N gt2 0 RI F0 F2 0 Rl R we Loop Register result smtus 1 no In F0 F7 F4 F6 F8 F10 F12 LoudZ complaint who is waitinu Loop Example Cycle 11 E Instruction 5mm 5 R Instructan l m lIUL n m n Cl pm39SLIhiE39u39chivg BNEZ Next land in sequence Chapter 3 ILP Loop Example Cycle 12 Loop Example Cycle 13 InsmActum smzus39 Em Wrzze ITER 1 Re u L nslxuctmn J D m Add No 2 m Mum Yes k in Register result 5mm Jack R F0 F2 F4 F6 F8 F10 F12 12 m F LoadK Munl Why not issue third multiply Instruction 5mm 5 R Instructan J l U FD No 1 Mum Velt Ammmml 2 MM res Mulwl Mm Regxder result 5mm Jack R F0 F2 F4 F6 F8 F10 F12 3 m F LoadK Mum Whv not issue third stare Chapter ILP Chapter ILP Page ltgt Loop Example Cycle 14 Em Wrzze Insmmmns z 5 ITER nslxuctmn J k 1mm CampREmH LD an m 1 m Loop Example Cycle 15 E Instructh status If structan J L m 11 NHL 15 7 Fl Fd w LD Fu 0 MUL 11 F4 Fa s F o A Rewmnan Smtnms Reservation Stations w N s n V W a k cm Tm N Bu in Add F0 F2 Q Addz F4 0 m AddK MX m w R1 83 Mum Nu anw y 0 MM m Regxder result 5mm F0 F2 F4 F6 F8 F10 F12 F30 Jack in F4 F6 F8 F10 F12 F30 u 6 Fu LnadK Mum 15 6 F LnadK Mum Multl completing Who is waifim Mulf2 campletim Who is waiting Chapter39 3 ILP Chapter 3 ILP Loop Example Cycle 16 Loop Example Cycle 17 Insmmm 5m us E r112 Instmctxansta 5 5 mg 1 Instruction J c u ITER instructan J Result 1 u m u l Ll F0 n m l was a wn m m n 1 LD F0 2 MUL in F4 F9 1 1 F4 Reservame Smtnms Wm Mm Add Addl Na Register result 5mm Jack R No Yes Mum RF2 LnadK we m2 an3 BNEZ RA lWP Regxder result 501015 F F2 F4 F6 F8 F10 F12 F30 Jack in F0 F2 F4 F6 F8 F10 F12 F30 16 N F n m F max Mum Chapter ILP Chapter ILP Page ltgt Loop Example Cycle 18 Insvuctum smzus Em Wrzzc ITER lnslxucuun J Cam Re an MA an RF2 LoadK F0 F7 F4 8 so Fu LoadK Mum F6 F8 F10 F12 Instruction st m Loop Example Cycle 19 E s slmcltun N l R R1 Loop F0 F7 F4 F6 F8 F10 19 56 Fu LoadK Mum cltapter 3 ILP chapter 3 ILP Loop Example Cycle 20 Why can Tomasulo overlap iterations n I InsmActum smias Em Wnla ITER Instruction J ompmu i u m i mum on Wll i m m r 1 LD F0 u 1 MULID 54 Fl 1 an F4 u Reservation Smtnms lime NW at RF2Lnau13 F0 P7 F4 F6 F8 F10 F12 20 56 Fu Load Mum r issue out of order execution Register renaming 7 Multiple iterations use different physical destinations for registers dynamic loop unrolling Reservation stations 7 Permit instruction issue to advance past integer control flow operations 7 Also buffer old values of registers totally avoiding the w stall that we saw in the scoreboard Other perspective Tomasulo building data flow f dependency graph on the ly Once agai In orde Ch d gutIlgf order completion chapter 3 ILP Page ltgt Tomasulo39s scheme offers 2 major A A What about Precise Interrupts 1 the distribution of the hazard detection logic 7 distributed reservation stations and the CD8 7 If multiple instructions waiting on single result a each instruction as other operand then instructions can be released simultaneously by broadcast on CD8 a centralized reglter file were used the units would have a r ad their results from the registers when re i r buses are available 2 the elimination of stalls for WAW and WAR hazards State of machine looks as if no instruction beyond faulting instructions has issue Tomasqu had In order issue out of order execution and out of order completion Need to quotfixquot the out of order completion aspect so that we can find precise breakpoint in instruction stream chapter 3 ILP chapter 3 ILP 39 between precise interrupts HW support for precise interrupts and speculation Speculation guess and check Important for branch prediction 7 Need to take our best If we speculate and are wrong need to back up and restart execution to point at which we predicted inoorrectl 7 This is exactly same as precise exceptions Technique for both precise interriptsexoeptions and speculation i DMEI39 completion or commit shotquot at predicting branch direction Need HW buffer for results of uncommitted instructions reorder buffer 7 3 fields instr destination value 7 Use reorder buffer number instead of der buffer c n be operand source gt more registers like RS Res 5 Instructions commit F39 Add 7 As a result easy to undo speculated instructions on mispredicted branches ns tatlu FF39 Adder o reg39ster39 chapter 3 ILP or exception chapter 3 ILP Page ltgt Four Sfeps of Speculafive Tomasulo u nigur mm 11ssue get instruction from F Queue If rscwation station and reorder bumr siot roe issue instr d sd opemnds d MordI buffer no for destination this stage somtimc calld dis men 2 Execution operate on operands When both opIands roady then execute if not Mody watch CDB for result when both in momtion station execute chucks RAW ssul sWrite result finish execution ws Write on Common an s to all awaiting FUs d MordI buffI new resmtion station avoiiobie 460mmit updu39l Mgis39u with Mordu result wnen instr at hood of reorder buffI d res oment update register witi quotsuit or store to memory aid remove instr from world buffer Mispiedimd brunch ushes Mord buffer sonetmes calld grurination What are the hardware complexities with reorder buffer RUB nest neg HASH Exceptions Valid Program Counter Reorder Table FF39 Adder FF39 Adder How do you find the latest version of a register s As specified by Smith paper need associative comparison nc39wolk I I i ich spcific Mord buffer Inas received tine value u i o on o quotI chapter 3 ILP chapter 3 ILP Summary neservations stations iMficir register mowing to larger set of registers buffering source operands s Prevents registers as bottleneck 7 Avoids WAR WAW azards of Scoreboard e Allows loop iinrolling in HW Not limited to basic blocks onteger units gets ahead beyond branches Today helps oaeiie misses as well 7 Don39t stall for L1 Dom cache miss nsifficicnt ILP for L2 miss s Register renaming s Loadstore disambiguation 86091 dcsctndunis art Pan39ium III PawnPC 604 MIPS R10000 HPPA 3000 4th 21264 chapter 3 ILP Page ltgt Project Topics Computer Architecture Venkatesh Akella EEC 270 Winter 2005 1 Trading Rtllablllty for Enugy Voltage avsnscaiing in u39u Casias sspsciany in madia applications P A plication Spacich Instruction Proussor for a givpn domain s39nsainingpaamixag39 n N 3 an 3 v E a n 5839 a o n 30 s v v o 1 n a multimedia or massags passing algorithms decoding focus on programmablt memory unit and smalkr bitwdith operations 4 Network pmcsssons for multimadia mn winsisss ad hoc nsmanks LDPC accsss Proccssovs for Scnsor Networks Low bansivy Parity Oicck Codss Pnognaininabis Ancnimmss o Chapter 2 7 Simultaneous Multithmding and dynamic Msourct munugcmcnv Chapter 2 Project Topics Emblddld Proclssors s ASIP r ASLP Voltage spaculation to say anangy r Dvmclockin cocks 7 Reliability vs lnIgy Multithnading for dynamic Msourcu managtmtn39 Low Density Parity Chuck owns 7 High Spud innrconncction n works r HWSW Codosign Networking Pnocsssons s Sansons s Multimdio Paclm Schduling Instruction Set Design Principles and xam les XECquOVI me IC Dynamic Instruction Count Instruction Set influences IC CPI Desktop IntFP poweroodesize not important Server Integer No FP string manipulation imp Embedded Cost Power Real time smaller bitwdith Chapter 2 Chapter 2 Page ltgt Overview of the Chapter Anatomy of an Instruction What are the alternatives and tradeoffs Taxonomy of ISA and quantitative assessment ISA of embedded and DSP processors Role of Compilers and Highlevel Languages TriMedia and MIP564 Cases Study Operation Arithmetic Logical Control Flow Procedure Call Operands Type of operands bit byte char string float int Addressing the operands Addressing modes Representation in the memory encoding Fixed vs van39able Aligned vs unaligned Lilliputian Wars Compressed vs Uncompressed Chapter 2 Tm39l39u e M eiqlalgnalg iiIn l Chapter 2 Classifying Instruction Set Architectures Why did Register Register ISA survive Ell icienl code is dillicull lo Registers are faster than memory Take advantage of principle of locality 39 Flexibility Consider the expression AB 3C AD There are several ways of evaluating this on a RR machine but on a stack based machine it is restricted to one order Code density registers can be specified with fewer bits than memory addresses Reduces memory traffic locality Amenable for automatic compilation Chapter 2 Chapter 2 Page ltgt Addressing Modes Note the relationship to high level constructs More Addressing Modes Mndr Fxnmplr lnslrurlinn Manning I size of an element Chapter 2 Use of Memory Addressing Modes X86 Measurements TEX 15quot Mammy indium wine x80 inslrurlinn not W instructions 0 5mm spice h Mme 8 T Reglnarlndlmx a 13 an 11 Tax 43 17 um 39 Tax 32 Displacement wlm gm 40 0 10 M 50 m 50 80 menwmmmmmnmm 95 Vax Measurements based on SPECBQ benchmarks small regisfer file bloafs number of loads and tnrwc Chapter 2 Chapter 2 Page ltgt Mix of Instructions on TI DSP C54x Interesting Questions Instruction Percent Store mem16 322 Load mem16 94 Add m em 16 68 CALL 5 Push mem 16 5 Subtract mem 16 49 Move memmem16 40 MAC 4 6 V V How many bits for the immediate field How many registers do I need What addressing modes to support What operations to support Amadahl39s law Make the common case fast Can the compiler use it Don39t forget ultimately Execution Time matters so aways ook at the impact on IC CPI and Tc Measure on Real Benchmarks Chapter 2 Chapter 2 Role of a Compiler Anatomy of a compiler Function Toda ma orit of re rammin is in hi h level Vanguagsdeuenuemn Wiende Wis39wawmw h at common IrilEiri39iedi39ils larm languages mpen em anguw H 7e772d3Li So the Instruction set should be amenable for a quot quotquot Somewhat ianguage dependeni H ghrlevel Fm example lucp compller as a fargef aigelymachmeindeperidenl ggsgggneam r ijnzna 7 7 7 lalsacalled Case in pomt DSP micro controllers are not WMWEM EQ39E Dquot which makes software development a nightmare i223li li i s EJI LEiS nZTKZSilZf egegiswmnwesi mm Remember architecture is a codesugn issue so a smart compiler can help if the architecture llghly machine dependent Detailed lnslmminn seimian exposes some aspects gamma anfmacnnineraepenae n W W may u a ID ill 1 mm r Eg instruction scheduling to avoid pipeline stalls W m 5 hide long latency of memory use the registers beuHer avoid recompufafl on code Size ZDDZEIseiieiSciencelUSAAllrigtilsreserved optimization cache optimization Chapter 2 Chapter 2 Page ltgt


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Allison Fischer University of Alabama

"I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.