Computer Architecture COSC 6385
Popular in Course
Popular in Chemistry
This 187 page Class Notes was uploaded by Lowell Harris on Saturday September 19, 2015. The Class Notes belongs to COSC 6385 at University of Houston taught by Edgar Gabriel in Fall. Since its upload, it has received 38 views. For similar materials see /class/208173/cosc-6385-university-of-houston in Chemistry at University of Houston.
Reviews for Computer Architecture
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/19/15
III PSTL COSC 6385 Computer Architecture Instruction Set Architectures Edgar Gabriel Fall 2008 Edgar Gabriel PVSTL Instruction Set Architecture ISA Definition on Wikipedia Part of the Computer Architecture related to programming Defines set of operations instruction format hardware supported data types named storage addressing modes sequencing Includes a specification of the set of opcodes machine language native commands implemented by a particular CPU design 0050 as compuherArchltecture Edgar Gabriel cost as compuherArchltecture Edgar Gabriel PSTL Instruction Set Architecture II ISA to be distinguished from the microarchitecture set of processor design techniques used to implement an ISA Example Intel Pentium and AMD Athlon support a nearly identical ISA but have completely different microarchitecture III PSTL ISA Specification Relevant features for distinguishing ISA s Internal storage Memory addressing Type and size of operands Operations Instructions for Flow Control Encoding of the IS 0050 as compuherArchltecture Edgar Gabriel in LJJ TL Internal storage Stack architecture operands are implicitly on the top of the stack Accumulator architecture one operand is implicitly the accumulator General purpose register architectures operands have to be made available by explicit load operations Dominant form in today s systems cosc Es compuherArchltecture H EdgarGabnel PSTL Internal storage II Example C AB Stack Accumulator LoadStore Push A Load A Load RlA Push B Add B Load R2B Add Store C Add R3RlR2 Pop C Store R3C cost as compuherArchltecture Edgar Gabriel i PSTL Internal storage III memory cost as compuherArchltecture Edgar Gabriel Advantage of general purpose register architectures vs stack architectures Registers are fast Easier and faster to evaluate complex expressions eg ABBCAD Registers can hold temporary variables Reduces memory traffic A register can be named with fewer bits than main 1 PSIL Addressing modes How does an ISA specify the address an object will access Addressing mode Example instruction Meaning Register Add R4 R3 RegsR4eRegsR4 RegsR3 Immediate Add R4 3 RegsR4eRegsR4 3 Registerindirect Add R4 R1 RegsR4HRegsR4 MemRegsR1 Displacement Add R4 100 R1 RegsR4eRegsR4 Mem100RegsR1 Memory indirect Add R4 R3 RegsR4 e RegsR4 MemMemRegsR3 cost as compuherArchltecture Edgar Gabriel 391 i PSTL Addressing modes II Addressing modes must match Ability of compilers to use them Hardware characteristics Which modes are most commonly used Displacement Immediate Register indirect Size of address for displacement mode Typically 1216 bits Size of the immediate field 816 bits 005 Es compuherArchltecture gtu r r l we quat Internal storage IV Two major GPR architectures 2 or 3 operands for ALU instructions 3 operands 2 source 1 result 2 operands 1 operand is both source and result How many operands can be memory addresses No of memory addresses Max no of operands Architecture 0 3 Register register loadstore arch 1 2 Registermemory 2 2 Memorymemory m 3 3 Memorymemory Edgar Gabriel i P TL Memory alignment I Memory is typically aligned on a multiple of word boundaries Best case accessing misaligned address leads to performance problems since it requires accessing multiple words Worst case hardware does not allow misaligned access cosc Es compuherArchltecture 39 Edgar Gabriel LW double double double cost as compuherArchltecture Edgar Gabriel quot P311 Type and size of operands I How is the type of an operand designated Encoded in the opcode Annotated by tags Common operand types Character 8bits Half word 16 bits 2 bytes Word 32 bits 4 bytes Single precision floating point 32 bits 4 bytes Double precision floating point 64 bits 8 bytes m 22 aabr zompmerArchnecmre PVSTL Type and size of operands II Encoding of characters ASCII UNICODE Encoding of integers Two s complement binary numbers Encoding of floating point numbers EEE standard 754 No uniform representation of the data type long double cost as compuherArchltecture Edgar Gabriel Operations in the Instruction Set Operator type Examples Arithmetic and logical Integer arithmetic add subtract and or multiple divide Data transfer Load store move Control Branch jump procedure call return traps System 05 call virtual memory management Floating point Floating point arithmetic add multiply divide compare Decimal Decimal add multiply String String move string compare string search Graphics Pixel and vertex operations compression cosc Es compuherArchltecture Edgar Gabriel PVSTL Flow Control instructions Four types of different control flow changes Conditional branches Jumps Procedure calls Procedure returns How to specify the destination address of a flow control instruction PCrelative Displacement to the program counter PC Register indirect name a register containing the target address cost as compuherArchltecture Edgar Gabriel Flow Control instructions II Register indirect jumps also required for Caseswitch statements Virtual functions in C Function pointers in C Dynamic shared libraries dll in Windows 50 in UNIX Procedure invocation global variables could be accessed by multiple routines location of the variable needs to be known Options for saving registers Caller saving Callee saving due to the possibility of separate compilation many compilers store any global variable that may be accessed during a call cost as compuherArchltecture Edgar Gabriel PSTL Encoding an Instruction Set How are instructions encoded into a binary representation Affects size of compiled program Affects implementation of the processor Decision depends on range of addressing modes supported Address Address Variable encoding speci er 2 I eld 2 Individual instructions can vary widely in size and amount of work to be performed Easy to decode 9quot new 2 cost as compuherArchltecture Edgar Gabriel O eration and no ofoperands Address speci er 1 Address eld 1 i Example Architecture MIP564 I Loadstore architecture 32 64bit GPR registers R0R1R31 R0 contains always 0 32 64bit floating point registers F0F1F31 When using 32bit floating point numbers the other 32 bits of the FP registers are not used or Instructions available for operating 2 32bit FP operations on a single 64bit register Data types 8 bit bytes 16bit halfwords 32bit words 64 bit double words 32bit and 64bit floating point data types cosc Es compuherArchltecture N Edgar Gabriel PSTL Example architecture MP564 II Addressing modes Immediate Displacement Displacement field and immediate field are both 16bit wide Register indirect addressing accomplished by using 0 in the displacement field Absolute addressing accomplished by using R0 as the base register cost as compuherArchltecture Edgar Gabriel Example architecture MIP564II o All instructions are 32bit wide fixed encoding 6bit opcode o Addressing modes are encoded in the opcode LD R1 30 R2 load double word Regs Rle64 Mem30Reg5 R2 with en load n bits LW R1 60 R2 load word Regs He Mem60Reg5 R2 El32 Mem60Reg5 R2 With Regs 2 D indicating a bitfield selection eg Regs R2 D is the sign bit of R2 eg Regs R25363 last byte of of R2 with x replicate a bit field eg Regs R2 m3 e 024 set high order three bytes to 0 With concatenate two fields cosc Es compuherArchltecture Edgar Gabriel R R PSTL Example architecture MP564IV Thus RegsRle64 em60ResR2032 Mem60RegsR2 Replicate the sign bit of memory address 60RegsR2 on the first 32 bits of Regs1 cost as compuherArchltecture Edgar Gabriel COSC 6385 Computer Architecture Instruction Level Parallelism with Software Approaches l Edgar Gabriel Fall 2006 COSC 6385 Computer Architecture I Edgar Gabriel 0 Regarding the homework The two machines for code development and running the benchmarks are up and running Machine for your code development papiOl cs uh edu Available through ssh from within the CS department papi headers and libraries are available in us rlocalinclude and us rlocallib respectively Please note Machines have 512 MB memory so do not overextend papi01 more than necessary People needing an account should stop by my office starting from tomorrow COSC 6385 Computer Architecture Edgar Gabriel Regarding the homework II For executing your benchmark once the code is correct usepapi02csuhedu Shares the home directory with papi01 so you do not have to copy your filesexecutables Access disabled by default you can not login To execute your benchmark in papi02 you have to type from papi01 gabrielpapi01gt srun batch ltmyexecgt Submits yourjob into the batch queue Once yourjob is at the head of the queue it will be executed Guaranteed that two jobs can not interfere Maximumjob length limit 30 min COSC 6385 Computer Architecture Edgar Gabrie39 Limitations of ILP in hardware Limits of dynamic ILP Need to be able to look arbitrarily far ahead to find instructions to issue Predict all branches perfectly Rename all registers to avoid WAR and WAW hazards Avoid data dependencies amongst instructions in an issue packet Handling memory dependencies among issuing instructions Provide enough replicated functional units to issue all instruction which are ready COSC 6385 Computer Architecture Edgar Gabrie39 Limitations of ILP in hardware ll Finite number of Registers For a single instruction stream When considering threadlevel parallelism Imperfect data dependence analysis depending on whether the data is located on the stack the heap or is statically declared COSC 6385 Computer Architecture Edgar Gabriel 0 Structure of modern compilers Front end per language High level optimizations Transform language into common intermediate form Language dependent machine independent Loop transformations procedure inlining etc Somewhat lang dependent mostly machine independent v Local and global Global optimizer optimizations register allocation etc Small lang dependencies minor machine dependencies Machine dependent optimi Languaugeindependent zation instruction selection C od e e n e rato r Strongly machine dependent g COSC 6385 Computer Architecture Edgar Gabriel 0 Loop An example for ilOOO i gt O i Xi Xi S LD F0 0Rl ADDD F4 F0 F2 SD F4 0Rl DADDUI R1 R1 8 ENE R1 R2 Loop COSC 6385 Computer Architecture Edgar Gabriel CS UH Assumptions Instruction producing Instruction using Latency in clock cycles results results FP ALU FP ALU 3 FP ALU Store 2 Load FP FP ALU 1 Load FP Store 0 1 cycle branch delay Edgar Gabriel IIll COSC 6385 Computer Architecture An example III Loop LD IIll ADDD S S SD DADDUI S BNE S COSC 6385 Computer Architecture Edgar Gabriel F0 F4 F4 R1 R1 0Rl I kOOO5U gtLUNI CS UH Rescheduling the code Loop LD F0 0Rl l DADDUI R1 R1 8 2 ADDD F4 F0 F2 3 s 4 ENE R1 R2 Loop 5 SD F4 8R1 6 4 l Delayed branch slot Each loop iteration consists of 3 instructions of actual work load add store and 3 cycles loop overhead COSC 6385 Computer Architecture I Edgar Gabriel 0 Loop IIll Loop unrolling LD ADDD SD LD ADDD SD LD ADDD SD LD ADDD SD DADDUI BNE COSC 6385 Computer Architecture Edgar Gabriel F0 0R1 F4 F0 F2 F4 0R1 F6 8R1 F8 F6 F2 F8 8R1 F10 16R1 F12 F10 F2 F12 16R1 F14 24R1 F16 F14 F2 F16 24R1 R1 R1 32 R1 R2 Loop Drop DADDUI amp BNE Drop DADDUI amp BNE Drop DADDUI amp BNE CS UH Loop unrolling ll Eliminates three branches and three decrements Reduces loop overhead The previous code sequence still contains many stalls since many operations are still dependent on each other Requires a large number of registers Increase of the code size For general application of loop unrolling Two consecutive loops First executes n mod k times not unrolled n number of loop iterations k unroll depth of the loop Second executes nk times unrolled COSC 6385 Computer Architecture Edgar Gabriel 0 Scheduled version of the unrolled Loop LD IIll W U U 0000 BNE COSC 6385 Computer Architecture Edgar Gabriel F0 F6 F10 F14 F4 F8 F12 F16 F4 F8 R1 F12 R1 F16 loop 0R1 8Rl l6Rl 24Rl F0 F2 F6 F2 F10 F2 F14 F2 0R1 8Rl R1 32 16R1 R2 Loop 8R1 CSLUH Scheduled version of the unrolled loop ll Non trivial transformations in the previous code Adjust the offsets of s D instructions Determine that it is legal to move the s D after DADDUI and BNE Determine that loads and stores of difference iterations can be interchanged COSC 6385 Computer Architecture I Edgar Gabriel 0 Effects of register renaming Loop LD F00R1 Loop LD F00Rl ADDD F F0 F2 ADDD F4 F0 F2 sD F 0Rl sD F4 0Rl LD 8Rl LD F6 8 R1 ADD D 4NFO F2 X g ADDD F8 F6 F2 5D F I 8R1 sD F8 8Rl LD 16R1 LD E ng 16Rl ADDD I F0 F2 ADDD F12 F10 F2 S39D 39 16R1 sD F12 16Rl LD 24Rl LD F1 24R1 SD 4 24R1 ADDD F16 F14 F2 DADDUI R1 R1 32 5D F161 24R1 ENE R1 R2 Loop DADDUI R1 R1 32 ENE R1 R2 Loop No dependencies between the loop m COSC 6385 Computer Architecture bOdieS Of two iterationS Edgar Gabriel IIll Limitations of loop unrolling Code size limitations Lower limit on the code amount which can be saved by loop unrolling Consider that you need to prepend another unrolled loop of length n mod k Register pressure shortfall of registers generated by aggressive unrolling COSC 6385 Computer Architecture Edgar Gabriel Static Branch Prediction by the compiler LD R1 0R2 DSUBU R1 R1 R3 BEQZ R1 L branch bl OR R4 R5 R6 DADDU R10 R4 R3 L DADDU R7 R8R9 Suppose we know that branch b1 is nearly always taken and the fallthrough branch does not touch R7 Move DADDU R7 R8 R9 afterthe load and remove a stall Suppose we know that the branch b1 is nearly nevertaken and R4 was not needed on the taken branch Move 0R R4 R5 R6 a erthe load instruction COSC 6385 Computer Architecture Edgar Gabriel 0 Static Branch Prediction by the compiler ll Static branch prediction schemes Predict branches as taken Predict backward going branches as taken and fonNard going branches as not taken Using profile information of previous runs COSC 6385 Computer Architecture Edgar Gabriel CSLUH IIll Detecting looplevel parallelism Loopcarried dependencies dependencies between operands across multiple iterations Analysis typically done at the source code level or close to it for i1000 i gt O i xi xi S Dependence on xi within a single iteration No loop carried dependencies COSC 6385 Computer Architecture Edgar Gabriel Another example for i1 i ltlOO i Ail Ai Ci instruction Sl Bil Bi Ail instruction 52 Assuming that A B and C are distinct and non overlapping arrays 81 uses a value computed by 81 in an earlier iteration 82 uses a value computed by 82 in an earlier iteration 82 uses a value computed by 81 in the same iteration Requires Si and 2 to be executed in order COSC 6385 Computer Architecture Edgar Gabriel Another example II for i1 i ltlOO i Ai Ai Bi instruction Sl Bil Ci Di instruction 52 Si uses a value computed by S2 in an earlier iteration No dependence from S2 to Si No circular dependency between Si and S2 loop can be executed in parallel How Al Al Bl for i1 i lt99 i Bil cm Di Ail Ail Bil BlOl C100 DlOO COSC 6385 Computer Architecture Edgar Gabriel Another example III for i2 i ltlOO i Ai Aik AH Recursion with a dependence distance of k Consider unrolling this loop If k 1 successive statements are dependent on each other If k gt 1 there is a sequence of k statements that have no dependence COSC 6385 Computer Architecture I Edgar Gabriel 0 Finding dependencies Complex task even when ignoring indirect access etc Imagine the following case Two loop indices j and k Two instructions of the format Xaj b Xckd Do the two accesses to x create a dependence No general method to determine dependence at compile time A sufficient test for the absence of a dependence is the greatest common divisor GDC test m COSC 6385 Computer Architecture m 0 Edgar Gabriel Finding dependencies ll Greatest common divisor GDC test a loop carried dependency does exist if the GCD c a divides d b Please note that the test can succeed and still no dependence exists Example for i1 i ltlOO i A2i3 A2i a2 b3 c2 010 cco ac 2 and d b 3 Since 2 does not divide 3 no dependence is possible m COSC 6385 Computer Architecture m 0 Edgar Gabriel COSC 6385 Computer Architecture Pipelining Edgar Gabriel Fall 2006 Some ofthe slides are based on a lecture by David Culler University of California Berkley httpwwweecsberkeleveducuIlercoursescs252s05 COSC 6385 Computer Architecture 1 Edgar Gabriel O UH Instruction Set Architecture Relevant features for distinguishing ISA s Internal storage Memory addressing Type and size of operands Operations Instructions for Flow Control Encoding of the IS COSC 6385 Computer Architecture Edgar Gabriel CSLUH Pipelining Pipelining is an implementation technique whereby multiple instructions are overlapped in execution Split an expensive operation into several suboperations Execute the suboperations in a staggered manner Real world analogy assembly line in car manufacturing Each station is doing something different Each station working on a separate car Pipelining increases the throughput but does not reduce the latency of an operation COSC 6385 Computer Architecture Edgar Gabriel Classes of instructions ALU instructions Take either 2 registers as operands or 1 register and one 16bit immediate offset Results are stored in a 3rd register Load and store instructions Branches and jumps COSC 6385 Computer Architecture I Edgar Gabriel 0 Typical implementation of an instruction I 1 Instruction fetch cycle IF send PC to memory Fetch current instruction Update PC to next sequential PC 4 bytes 2 Instruction decoderegister fetch cycle ID Decode instruction Read registers corresponding to register source specifiers from register file Sign extend offset fields if needed Compute possible branch target address COSC 6385 Computer Architecture Edgar Gabriel 0 Typical implementation of an instruction ll 3 Execution effective address cycle EX ALU adds base register and offset to form effective address or ALU performs operations on the values read from register file or ALU performs operation on value read from register and sign extended immediate 4 Memory access cycle MEM lf instruction is a load read memory using the effective address computed in step 3 lf instruction is a store write the data from the second register read of the register file to the effective address 5 Writeback cycle WB Write result into register file From memory for a load instruction Fl From ALU for an ALU instruction COSC 6385 Computer Architecture Edgar Gabriel 0 Typical implementation of an instruction Ill Instruction Instr Decode Execute Memory write Fetch Reg Fetch Addr Calc Access Back Next PC Next SEQ PC InlQAl ntnq U iowaw WB Data Slide based on a lecture by David Culler University of California Berkley hltn39llwww PPF herkelev ed 0 COSC 6385 Computer Architecture Edgar Gabriel Datapath I Fetching instructions and incrementing program count PC Read ad d ress Instruction gt Inst ru ction me mo ry COSC 6385 Computer Architecture Edgar Gabriel 0 UH Datapath II ALU instructions eg add R1 R2 R3 Register number input is 5 bit wide if you have 3225 registers ALU Operatlon control signal 4 bits 4 Read ALU operation register1 4 Register 5 Read 531 gt numbers t reglster2 Data Zero Write Register R d ALU register ea le data 2 gt result Data gt Write Data RegWrite COSC 6385 Computer Architecture Write CO ntrol S g n al Edgar Gabriel 0 Datapath III LoadStore instructions eg LW R1 offset R2 MemWrite gt Address Read gt data 16 32 Data 39 memory gt Write Data MemRead Basic steps for a loadstore operation sign extend the offset from 16 to 32 bit add the sign extended offset to R2 Load the content ofthe resulting address into R1 or m store the data from R1 into the resulting memory address COSC 6385 Computer Architecture Edgar Gabriel 0 Datapath IV Combining LoadStore and ALU instructions ALU operation gt Read 4 registeri Read gt Read MemWrite MemtoReg Instruction v registerz I data1 ALUsrc I Write Reg395ter Read 0 Address Read 0 register le data 2 39 dat gt Write Data Data r memory 1 1 Write RegWrite Data 16 s 32 MemRead 7 I971 xten COSC 6385 Computer Architecture Edgar Gabriel 0 Datapath V Branches eg beq R1 R2 offset Basic steps for a branch equal instruction compute branch target address sign extended offset field shift offset field by 2 bits in order to ensure a word offset add shifted signextended offset to PC compare registers R1 and R2 COSC 6385 Computer Architecture I Edgar Gabriel 0 Datapath VI Implementation of branches eg beq R1 R2 offset PC4 from instruction datapath Branch target i gt gt Read registem R d 4 ALU operation Instruction gt Rezf d 121 I register a e ister To branch Write 9 Read E control logic register flle data 2 gt Write 7 Data I RegWrite COSC 6385 Computer Architecture Edgar Gabriel UH M H ENQEQ Visualizing pipelining Time cock eyees Cycle IECycle 2 ECycle Cycle Cycle 5 ECycle 6 Cycle 7 Slide based on a lecture by David Culler University of California Berkley hl39ln39llwww PPF herkelev ed COSC 6385 Computer Architecture Edgar Gabriel Effects of pipelining A pipeline ofdepth n requires ntimes the memory bandwidth of a nonpipelined processor for the same clock rate Separate data and instruction cache eliminates some memory conflicts Register file is used in stage ID and in WB Usually not a con ict since write s are executed in the first half of the clockcycle and reads in the second half Instructions in the pipeline should not attempt to use the same hardware resources at the same time Introduction pipeline registers between successive stages ofthe pipeline Registers named afterthe stages they connect eg lFID lDALU etc COSC 6385 Computer Architecture Edgar Gabrie39 Instruction EInstr Decode Execute Memory Write Fetch Reg Fetch Addr CalcE Access Back Next PC InlQAl Slicle based on a lecture by David Culler University of California Berkley hltn39llwww PPF herkelev ed quot 0R COSC 6385 Computer Architecture Edgar Gabriel Pipeline Hazards Limits to pipelining Hazards prevent next instruction from executing during its designated clock cycle Structural hazards HW cannot support this combination of instructions Data hazards Instruction depends on result of prior instruction still in the pipeline Control hazards Caused by delay between the fetching of instructions and decisions about changes in control flow branches and jumps Slide based on a lecture by David Culler COSC 6385 Computer Architecture UniverSIty of California Berkley Edgar Gabriel ht ln39lwwweer herkelev ed quot 0R CS UH One Memory PortStructural Hazards Cycle IECycle 2 Cycle Cycle 4 Cycle 5 Cycle 6 Cycle 7 Instr 1 lm Instr 2 I 9 Instr 3 Ii Instr 4 COSC 6385 Computer Architecture Edgar Gabriel E ENE wig El El M H 9 a9 NQEQ EEI E Slide based on a lecture by David Culler University of California Berkley hl39ln39llvwwv PPF herkelev ed quot 0R CSLUH One Memory PortStructural Hazards Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle Load III El Instr 1 I 3 El Instr2 I 3 Stall nanE m a Slide based on a lecture by David Culler University of California Berkley hl ln39lwww PPF herkelev ed 39 M H NQEQ Instr 3 COSC 6385 Computer Architecture Edgar Gabriel CSLUH quot 0 Speed Up Equation for Pipelining CPIpipelmd Ideal CPI Average S39rall cycles per Inst Ideal CPI x Pipeline dep39ih x CYCle Timeunpipelined Ideal CPI Pipeline stall CPI Cycle Timepipelined Speedup For simple RISC pipeline CPI 1 Pipeline CYCle Timeunpipelined S d pee Up 1 Pipeline stall CPI X Cycle Timepipelined cosc 6385 c t A m t Slide based on a lecture by David Culler ompu er m 39 ec quotre University of California Berkley Ed G b 39 l gar a quot8 hi ln39lwww Per herkelev ed quot 39 0 CS UH Example Dualport vs Singleport Machine A Dual ported memory Harvard Architecture Machine B Single ported memory but its pipelined implementation has a 105 times faster clock rate Ideal CPI 1 for both Loads are 40 of instructions executed SpeedUpA Pipeline Depth1 O x clock Pipeline Depth SpeedUpB Pipeline Depth1 04 x 1x clock Pipeline Depth14 x 105 075 x Pipeline Depth SpeedUpA SpeedUpB Pipeline Depth075 x Pipeline Depth 133 Machine A is 133 times faster cosc 6385 c t A m t Slide based on a lecture by David Culler ompu er m I ec quotre University of California Berkley unpipeclockpipe unpipecockunpipe 105 Ed G b39 gar a quote ht ln39lwww PPF herkelev ed quot ll 1 Data Hazard on R1 he xkqw IF IDRF EX MEM WB add r1r2r3 sub r4r1r3 and r6r1r7 or r8r1r9 xor r10r1r11 Slide based on a lecture by David Culler I University of California Berkley hl39ln39llvwwv PPF harkalav Pd quot 0R COSC 6385 Computer Architecture Edgar Gabriel iig n q Q Th H Three Generic Data Hazards Read After Write RAW InstrJ tries to read operand before lnstrl writes it ltI add r1r2r3 J sub r4r1r3 Caused by a Dependence in compiler nomenclature This hazard results from an actual need for communication Slide based on a lecture by David Culler COSC 6385 Computer Architecture UniverSIty of California Berkley Edgar Gabriel th39lwwweer herkelev ed quot I nE CS UH Three Generic Data Hazards Write After Read WAR InstrJ writes operand before lnstrl reads it I sub r4r1r3 J add r1r2r3 K mul r6r1r7 Called an antidependence by compiler writers This results from reuse of the name r1 Can t happen in our 5 stage pipeline because All instructions take 5 stages and Reads are always in stage 2 and Writes are always in stage 5 Slide based on a lecture by David Culler University of California Berkley th39lvwwv PPF herkelev ed quot QR COSC 6385 Computer Architecture Edgar Gabriel CS UH IIll Three Generic Data Hazards Write After Write WAW InstrJ writes operand before lnstrlwrites it I sub r1r4r3 J add r1r2r3 K mul r6r1r7 Called an output dependence by compiler writers This also results from the reuse of name r1 Can t happen in DLX 5 stage pipeline because All instructions take 5 stages and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes Slide based on a lecture by David Culler University of California Berkley th39lvwwv PPF herkelev ed quot QR COSC 6385 Computer Architecture Edgar Gabriel CS UH hwh kl EN Q Q Ill Forwarding to Avoid Data Hazard Time clock cycles add r1r2r3 sub r4r1r3 and r6r1r7 or r8r1r9 Slide based on a lecture by David Culler CSLUH xor r10r1r11 COSC 6385 Computer Architecture Edgar Gabriel University of California Berkley th39lwwweer herkelev ed quot n Data Hazard Even with Forwarding T mzkwwkqw j I 1w r1 0r2 n s f sub r4r1r6 n 0 and r6r1r7 r d e or r8r1r9 r cosc 6385 c t A m t Slide based on a lecture by David Culler ompu er m I ec quotre University of California Berkley Ed G b39 gar a quote th39lvwwv PPF herkelev ed quot 0R Data Hazard Even with Forwarding Time cock eyees I I7 5 lw r1 0r2 f 1 sub r4r1r6 0 I Z and r6r1r7 or r8r1r9 A m t Slide based on a lecture by David Culler m I ec quotre University of California Berkley th39lvwwv PPF herkelev ed quot quot 0R Branches Pipelined Datapath Memory Write Access Back Execute Addr Calc Instr Decode Reg Fetch Instruction Fetch Next PC Slide based on a lecture by David Culler University of California Berkley hltn39lwww PPF herkelev ed COSC 6385 Computer Edgar Gabriel Four Branch Hazard Alternatives 1 Stall until branch direction is clear 2 Predict Branch Not Taken Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47 branches not taken on average PC4 already calculated so use it to get next instruction 3 Predict Branch Taken 53 branches taken on average But haven t calculated branch target address yet still incurs 1 cycle branch penalty Other machines branch target known before outcome Slide based on a lecture by David Culler 23 gigglcompum Amh39tecmre University of California Berkley m hl39ln39llwww PPF herkelev ed quot ll 1 Four Branch Hazard Alternatives 4 Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successorl sequential success Branch dday oflengfh n sequential successcf branch target if taken 1 slot delay allows proper decision and branch target address in 5 stage pipeline cosc 6385 c t A m t Slide based on a lecture by David Culler ompu er m I ec quotre University of California Berkley Ed G b39 gar a quote th39lvwwv PPF herkelev ed quot ll 1 CSLUH Delayed Branch Where to get instructions to fill branch delay slot Before branch instruction From the target address only valuable when branch taken From fall through only valuable when branch not taken Compiler effectiveness for single branch delay slot Fills about 60 of branch delay slots About 80 of instructions executed in branch delay slots useful in computation CSLUH n Slide based on a lecture by DaVId Culler COSC 6385 Computer Architecture UniverSIty of California Berkley Edgar Gabriel n I ht fn39lvwiAv PPF herkelev ed COSC 6385 Computer Architecture Memory Hierarchy Design I Edgar Gabriel Fall 2006 Slides are based on a lecture by David Culler University of California Berkley Fl httpwwweecsberkeleveducu llercou rsescsZ52 sO5 COSC 6385 Computer Architecture 1 Edgar Gabriel J Recap Who Cares About the Memory Hierarchy Processor DRAM Memory Gap latency CPU I39Pr39oc 60 oyr 3 Moores Law 2X15yr g Memory g Performance Gap 3 10 9rr9ws 50 Year 3 K39DRAM O DRAM 9 yrl 1 I I I I o COSC 6385 Computer Architecture I I m e Edgar Gabriel 0 E Levels of the Memory Hierarchy Upper Level Capacify Access Time Cosf WU Registers 100s Bytes lt 10s ns 10 100 ns 1 01 centsbit Main Memory M B es 200ns 500ns 0001 00001 cents bit Sfagily X fer U if progcompiler 1 8 bytes cache cntl 8 128 bytes faster OS Disk Pages 512 4K bytes 6 Bytes 10 ms 10000000 ns Disk 10395 1 cents on useroperator I Files Tape Larger infinite Lower Level c giin Tape COSC 6385 Computer Architecture Edgar Gabriel CS UH The Principle of Locality The Principle of Locality Program access a relatively small portion of the address space at any instant of time Two Different Types of Locality Temporal Locality Locality in Time If an item is referenced it will tend to be referenced again soon eg loops reuse Spatial Locality Locality in Space If an item is referenced items whose addresses are close by tend to be referenced soon eg straightline code array access Last 15 years HW relied on locality for speed m It is a property of programs which is exploited in machine design COSC 6385 Computer Architecture 1 Edgar Gabriel J IIll Memory Hierarchy Terminology Hit data appears in some block in the upper level example Block X Hit Rate the fraction of memory access found in the upper level Hit Time Time to access the upper level which consists of RAM access time Time to determine hitmiss Miss data needs to be retrieve from a block in the lower level Block Y Miss Rate 1 Hit Rate Miss Penalty Time to replace a block in the upper level Time to deliverthe block the processor Hit Time ltlt Miss Penalty 500 instructions on 21264 T0 Processor From Processor COSC 6385 Computer Architecture Upper Level Memory Blk X Edgar Gabriel Blk Y Cache Measures Hit rate fraction found in that level 80 high that usually talk about Miss rate Miss rate fallacy as MIPS to CPU performance miss rate to average memory access time in memory Average memoryaccess time Hit time Miss rate x Miss penalty ns or clocks Miss penalty time to replace a block from lower level including time to replace in CPU access time time to lower level flatency to lower level transfer time time to transfer block fBW between upper amp lower levels COSC 6385 Computer Architecture Edgar Gabriel CS UH Simplest Cache Direct Mapped Memory Address Memory 4 Byte Direct Mapped Cache Cache Index Location 0 can be occupied by data from Memory location 04 8 etc In general any memory location whose 2 LSBs ofthe address are 0s Addresslt10gt gt cache index Which one should we place in the cache How can we tell which one is in the cache COSC 6385 Computer Architecture Edgar Gabriel 0 1 KB Direct Mapped Cache 328 blocks For a 2 N byte cache The uppermost 32 N bits are always the Cache Tag The lowest M bits are the Byte Select Block Size 2 M 31 9 4 0 Ex 0x01 Ex 0x00 Stored as part of the cache state Valid Bit Cache Cache Data 0 o COSC 6385 Computer Architecture Edgar Gabriel Twoway Set Associative Cache Nway set associative N entries for each Cache Index N direct mapped caches operates in parallel N typically 2 to 4 Example Twoway set associative cache Cache Index selects a set from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Data Cache Tag Valid Cache Block 0 Cache Block 0 SCIO COSC 6385 Computer Architecture Cache Block Edgar Gabriel Disadvantage of Set Associative Cache Nway Set Associative Cache v Direct Mapped Cache N comparators vs 1 Extra MUX delay for the data Data comes AFTER HitMiss In a direct mapped cache Cache Block is available BEFORE HitMiss Possible to assume a hit and continue Recover later if miss Cache Index Valid Cache Tag Cache Data Cache Data Cache Tag Valid Cache Block 0 Cache Block 0 SCIO COSC 6385 Computer Architecture Edgar Gabriel Cache Block 4 Questions for Memory Hierarchy Q1 Where can a block be placed in the upper level Block placement 02 How is a block found if it is in the upper level Block identi cation QB Which block should be replaced on a miss Block replacement Q4 What happens on a write Write strategy COSC 6385 Computer Architecture 1 Edgar Gabriel J Q1 Where can a block be placed in the upper level Block 12 placed in 8 block cache Fully associative direct mapped 2way set associative SA Mapping Block Number Modulo Number Sets Dir39ec r Mapped 2Way Assoc FU Mapped 12 mod 8 4 12 mod 4 0 01234567 01234567 01234567 Cache 1111111111222222222233 01234567890123456789012345678901 Me mory COSC 9 P 39 l rc1it 1 u 3 Edgar Gabriel 0 QZ How is a block found if it is in the upper level Tag on each block No need to check index or block offset Increasing associativity shrinks index expands tag IIll gt p Block Address Bock Offset Index Tag COSC 6385 Computer Architecture Edgar Gabriel QB Which block should be replaced on a miss Easy for Direct Mapped Set Associative or Fully Associative Random LRU Least Recently Used Assoc 2way 4way 8way Size LRU Ran LRU Ran LRU Ran 16 KB 52 57 47 53 44 50 64 KB 19 20 15 17 14 15 256 KB 115 117 113 113 112 112 COSC 6385 Computer Architecture Edgar Gabriel 0 Q4 What happens on a write Write through The information is written to both the block in the cache and to the block in the lowerlevel memory Write back The information is written only to the block in the cache The modified cache block is written to main memory only when it is replaced is block clean or dirty Pros and Cons of each WT read misses cannot result in writes WB no repeated writes to same location WT always combined with write buffers so that don t wait for lower level memory COSC 6385 Computer Architecture Edgar Gabriel Write Buffer for Write Through Processor Cache Write Buffer DRAM A Write Buffer is needed between the Cache and Memory Processor writes data into the cache and the write buffer Memory controller write contents ofthe bufferto memory Write buffer is just a FIFO Typical number ofentries 4 Works ne if Store frequency wrt time ltlt 1 DRAM write cycle Memory system designer s nightmare Store frequency wrt time Write buffer saturation COSC 6385 Computer Architecture IIll Edgar Gabriel gt 1 DRAM write cycle Impact of Memory Hierarchy on Algorithms Today CPU time is a function of ops cache misses vs just fops What does this mean to Compilers Data structures Algorithms The Influence of Caches on the Performance of Sorting by A LaMarca and RE Ladner Proceedings of the Eighth Annual ACM SIAM Symposium on Discrete Algorithms January 1997 370379 Quicksort fastest comparison based sorting algorithm when all keys fit in memory Radix sort also called linear time sort because for keys of fixed length and fixed radix a constant number of passes over the data is suf cient independent of the number of keys For Alphastation 250 32 byte blocks direct mapped L2 2MB cache 8 byte keys from 4000 to 4000000 COSC 6385 Computer Architecture Edgar Gabriel 0 Quicksort vs Radix as vary number keys Instructions Radix sor39139 Radix Instrkey 700 600 500 400 300 200 100 Instruchons key 0 1000 10000 100000 1000000 1E07 Set size in keys COSC 6385 Computer Architecture Edgar Gabriel 0 Quicksort vs Radix as vary number keys Instrs amp Time Radix sor39139 Quick Instrkey 39 Radix Instrkey Quick Clockskey N Radix clockskey 800 700 600 500 400 300 200 I 1 00 I39IS h UC I39IOI39IS 0 1000 10000 100000 1000000 1E07 Set size in keys COSC 6385 Computer Architecture Edgar Gabriel 0 Quicksort vs Radix as vary snumber keys Cache misses Radix sor39139 Quickmisslkey Radixmisslkey 0 Quick o o o o o o o Cache misses sort I 1000 cosc 638 EEEEE Ga 39 5 rrrrrrrrrrr hi ttttt re brlel I I I I 10000 100000 1000000 10000000 Set size in keys A Modern Memory Hierarchy By taking advantage of the principle of locality Present the user with as much memory as is available in the cheapest technology Provide access at the speed offered by the fastest tech oloov Processor Control S d T em My econ ary Storage Second Main i321 DiSkTape g O Level Memory 0 Da aPa h a g Cache DRAM E a E SRAM Speed ns 1s 10s 100s 10000000s 10000000000s 10s ms 10s sec Size bytes 100s COSC 6385 Computer Architecture KS MS GS Edgar Gabriel What is virtual memory rtual Physical Address Space Address Space Vlrtual Address 4 10 gt t Page Table Page Table Base Reg V Access index Riqhts into I faabglg table located In phySIcal offset I memory 10 Physical Address Virtual memory gt treat memory as a cache for the disk Terminology blocks in this cache are called Pages Typical size of a page 1K 8K Page table maps virtual page numbers to physical frames PTE Page Table Entry COSC 6385 Computer Ar itecture I Edgar Gabriel 0 UH Three Advantages of Virtual Memory Program can be given consistent view of memory even though physical memory is scrambled Makes multithreading reasonable now used a lot Only the most important part of program Working Set must be in physical memory Contiguous structures like stacks use only as much physical memory as necessary yet still grow later Different threads or processes protected from each other Different pages can be given special behavior Read Only Invisible to user programs etc Kernel data protected from User programs Very important for protection from malicious programs gt Far more viruses under Microsoft Windows COSC 6385 Computer Architecture Edgar Gabriel 0 m Can map same physical page to multiple users Shared memory Issues in Virtual Memory System Design What is the size of information blocks that are transferred from secondary to main storage M Contrast with physical block size on disk e Which region of M is to hold the new block How do we find a page when we look for it Block of information brought into M and M is full then some region of M must be released to make room for the new block What do we do on a write Missing item fetched from secondary memory only on the occurrence of a fault mem d Is k lt gt gt l E pages COSC 6385 Computer Architecture fra me Edgar Gabriel 0 Translation LookAside Buffers Just like any other cache the TLB can be organized as fully associative set associative or direct mapped TLBs are usually small ty ically not more than 128 256 entries even on high end machines his permits fully associative lookup on these machines Most midrange machines use small nway set associative organizations hit VA PA mi TLB Main CPU Lookup COChe Memory Transam 39 39 with a TLB quotquot55 l h Trans lation data COSC 6385 Computer Architecture Edgar Gabriel 1 139 139 Summary Caches The Principle of Locality Program access a relatively small portion of the address space at any instant of time Temporal Locality Locality in Time Spatial Locality Locality in Space Three Major Categories of Cache Misses Compulsom Misses sad facts of life Example cold start misses Capacity Misses increase cache size Con ict Misses increase cache size andor associativity Nightmare Scenario ping pong effect Write Policy Write Through needs a write buffer Nightmare WB saturation Write Back control can be complex COSC 6385 Computer Architecture 1 Edgar Gabriel J Summary The Cache Design Space Several interacting dimensions Cache Size cache size block size associativity replacement policy writethrough vs writeback Associa vi39l39y Block Size write allocation The optimal choice is a compromise depends on access characteristicsB Id workload use lcache Dcache TLB Good F c 39 F c 39 B depends on technology cost Less More I Stmglgetrmgttsmexxtns m Edgar Gabriel Summary TLB Virtual Memory IIll Caches TLBs Virtual Memory all understood by examining how they deal with 4 questions 1 Where can block be placed 2 How is block found 3 What block is repalced on miss 4 How are writes handled Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance funny times as most systems can t access all of 2nd level cache without TLB misses COSC 6385 Computer Architecture Edgar Gabriel Summary Memory Hierachy Virtual memory was controversial at the time can SW automatically manage 64KB across many programs 1000X DRAM growth removed the controversy Today VM allows many processes to share single memory without having to swap all processes to disk today VM protection is more important than memory m Today CPU time is a function of ops cache misses vs just fops What does this mean to Compilers Data structures Algorithms COSC 6385 Computer Architecture 1 Edgar Gabriel J COSC 6385 Computer Architecture File lO Edgar Gabriel Fall 2006 COSC 6385 Computer Architecture 1 Edgar Gabriel 0 IIll lO problem Current processor performance eg Pentium 4 3 GHZ GGFLOPS Memory Bandwidth 133 MHZ 4 64Bit 426 GBs Current network performance Gigabit Ethernet latency 40 us bandwidth125MBls InfiniBand 4x latency 5 us bandwidth 1GBs Disc performance Latency 712 ms Bandwidth 2OMBsec 60 MBsec COSC 6385 Computer Architecture Edgar Gabriel CSLUH IIll Basic characteristics of storage devices Capacity amount of data a device can store Transfer rate or bandwidth amount of data at which a device can readwrite in a certain amount of time Access time or latency delay before the first byte is moved Prefix Abbreviation Base ten Base two kilo kibi K Ki 10quot3 2quot101024 Mega mebi M Mi 10quot6 2quot2O Giga gibi G Gi 10quot9 2quot30 Tera tebi T Ti 10quot12 2quot4O Peta pebi P Pi 10quot15 2quot50 Archltec ure Edgar Gabriel UNIX File Access Model I o A File is a sequence of bytes When a program opens a file the file system establishes a le pointer The le pointer is an integer indicating the position in the file where the next byte will be writtenread Multiple processes can open a file concurrently Each process will have its own file pointer No conflicts occur when multiple processes read the same file If several processes write at the same location most UNIX file systems guarantee sequential consistency The data from one of the processes will be available in the file but not a mixture of several processes COSC 6385 Computer Architecture Edgar Gabriel 0 UNIX File Access Model ll Disk drives read and write data in fixedsized units disk sectors File systems allocate space in blocks which is a fixed number of contiguous disk sectors In UNIX based file systems the blocks that hold data are listed in an inode An inode contains the information needed to find all the blocks that belong to a file If a file is too large and an inode can not hold the whole list of blocks intermediate nodes indirect blocks are introduced COSC 6385 Computer Architecture Edgar Gabriel 0 Write operations Write the file systems copies bytes from the user buffer into system buffer If buffer filled up system sends data to disk System buffering allows file systems to collect full blocks of data before sending to disk File system can send several blocks at once to the disk delayed write or write behind Data not really saved in the case of a system crash For very large write operations the additional copy from user to system buffer couldshould be avoided COSC 6385 Computer Architecture Edgar Gabriel Read operations Read File system determines which blocks contain requested data Read blocks from disk into system buffer Copy data from system buffer into user memory System buffering file system always reads a full block le caching If application reads data sequentially prefetching read ahead can improve performance Prefetching harmful to the performance if application has a random access pattern COSC 6385 Computer Architecture Edgar Gabriel File system operations Caching and buffering improve performance Avoiding repeated access to the same block Allowing a file system to smooth out lO behavior Nonblocking 0 gives users control over prefetching and delayed writing Initiate readwrite operations as soon as possible Wait for the finishing of the readwrite operations just when absolutely necessary COSC 6385 Computer Architecture Edgar Gabriel 0 Distributed File Systems vs Parallel File Systems Offer access to a collection of files on remote machines Typically clientserver based approach Transparent for the user Concurrent access to the same file from several processes is considered to be an unlikely event in contrary to parallel file systems where it is considered to be a standard operation Distributed file systems assume different numbers of processors than parallel file systems Distributed file systems have different security requirements than parallel file systems COSC 6385 Computer Architecture Edgar Gabriel 0 NFS Network File System Protocol for a remote file service Client server based approach Stateless server v3 Communication based on RPC Remote Procedure Call NFS provides session semantics changes to an open file are initially only visible to the process that modified the file File locking not part of NFS protocol v3 File locking handled by a separate protocoldaemon Locking of blocks often supported Client caching not part of the NFS protocol v3 depending on implementation Eg allowing cached data to be stale for 30 seconds COSC 6385 Computer Architecture Edgar Gabriel 0 IIll Edgar Gabriel COSC 6385 Computer Architecture Frontend node hosts the file server NFS in a cluster IIll NFS in a cluster II All file operations are remote operations file server NFS server bottleneck Extensive usage of file locking required to implement sequential consistency of UNIX lO Communication between client and server typically uses the slow communication channel on a cluster Do we use several disks at all Some inefficiencies in the specification eg a read operation involves two RPC operations Lookup filehandle Read request COSC 6385 Computer Architecture Edgar Gabriel Parallel lO o Basic idea disk striping Stripe factor number of disks Stripe depth size of each block COSC 6385 Computer Architecture Edgar Gabriel S J Disk striping Requirements for improving disk performance Multiple physical disks Separate lO channels to each disk Data transfer to all disks simultaneously Problem of simple disk striping Minimum stripe depth sector size required for optimal disk performance since file size is limited the number of disks which can be used in parallel is limited as well Loss of a single disk makes entire file useless Risk to loose a disk is proportional to the number of disks used RAID Redundant Arrays of Independent Disks see lecture 2 COSC 6385 Computer Architecture Edgar Gabriel Parallel File Systems Goals Several process should be able to access the same file concurrently Several process should be able to access the same file efficiently Problems Unix sequential consistency semantics Handling of filepointers Caching and buffering COSC 6385 Computer Architecture Edgar Gabriel CSLUH Concurrent file access logical view Number of compute and HG nodes need not match Blocks from compute nodes Logical view shared filequot IO nodes g COSC 6385 Computer Architecture 1 Edgar Gabriel 0 Concurrent file access opening a file Each lO node has a subset of the blocks File system needs to look up where the file resides Each lO node maintains its own directory information or Centralized name service File system needs to look up striping factor often fixed Creating a new file file systems has to choose different lO nodes for holding the first block to avoid contention COSC 6385 Computer Architecture Edgar Gabriel 0 Concurrent write operations How to ensure sequential consistency File locking Prevents parallelism even if processes write to different locations in the same file false sharing Better locking of individual blocks Parallel file systems often offer two consistency models Sequential consistency A relaxed consistency model application is responsible for preventing overlapping writeoperations COSC 6385 Computer Architecture Edgar Gabriel File pointers In UNIX every process has a separate file pointer individual le pointers Shared le pointers often useful eg reading the next piece of work writing a parallel logfile On distributed memory machines slow since somebody has to coordinate the file pointer Can be fast on shared memory machines General problems file pointer atomicity Non blocking IO Explicit file offset operations each process tells the file system where to readwrite in the file no update to file pointers COSC 6385 Computer Architecture Buffering and caching Client buffering buffering at compute nodes Consistency problems eg one node writes another tries to read the same data Server buffering buffering at lO nodes Prevents concatenating several small requests to a single large one gt produces lots of traffic COSC 6385 Computer Architecture Edgar Gabriel Redundant arrays of independent disks RAID Central idea replicate data over several disks such that no data is lost if a disk fails Several RAID levels de ned RAID 0 disk striping without redundant storage JBOD just a bunch of disks No fault tolerance Good for high transfer rates Good for high request rates RAID 1 mirroring All data is replicated on two or more disks Does not improve write performance and just moderately the read performance COSC 6385 Computer Architecture Edgar Gabriel RAID level 2 RAID 2 Hamming codes Each group ofdata bits has several check bits appended to it forming Hamming code words Each bit of a Hamming code word is stored on a separate disk Very high additional costs eg up to 50 additional capacity required Hardly used today since parity based codes faster and easier COSC 6385 Computer Architecture I Edgar Gabriel 0 RAID level 3 Parity based protection Based on exclusive OR XOR Reversible Example 01101010 data byte 1 XOR 11001001 data byte 2 10100011 parity byte Recovery 11001001 data byte 2 XOR 10100011 parity byte 01101010 recovered data byte 1 COSC 6385 Computer Architecture I Edgar Gabriel 0 RAID level 3 cont Data divided evenly into N subblocks N number ofdisks typically 4 or 5 Computing parity bytes generates an additional subblock Subblocks written in parallel on N1 disks For best performance data should be of size N sector size Problems with RAID level 3 All disks are always participating in every operation gt contention for applications with high access rates If data size is less than Nsector size system has to read old subblocks to calculate the parity bytes RAID level 3 good for high transfer rates COSC 6385 Computer Architecture Edgar Gabriel 0 RAID level 4 Parity bytes for N disks calculated and stored parity bytes are stored on a separate disk Files are not necessarily distributed over N disks For read operations Determine disks forthe requested blocks Read data from these disks For write operations Retrieve the old data from the sector being ovenvritten Retrieve parity block from the parity disk Extract old data from the parity block using XOR operations Add the new data to the parity block using XOR Store new data Store new parity bock Bottleneck parity disk is involved in every operation COSC 6385 Computer Architecture Edgar GabrlEI RAID level 5 Same as RAID 4 but parity blocks are distributed on different disks lBlock 1 lBlock2 lBlock3 lBlock4 IIll RAID level 6 Tolerates the loss of more than one disk Collection of several techniques Eg PQ parity store parity bytes using two different algorithms and store the two parity blocks on different disks Eg Two dimensional parity Parity disks llOllOllO COSC 6385 Computer Architecture Edgar Gabriel RAID level 10 Is RAID level 1 RAID level 0 RAD1 mirroring 339 s 3 3 s N RAID 0 striping Fr Also available RAID 53 RAID 0 RAID 3 COSC 6385 Computer Architecture Edgar Gabriel 0 Comparing RAID levels RAID Protection Space usage Good at Poor at level 0 None N Performance Data protect 1 Mirroring 2N Data protect Space effic 3 Parity N1 Transfer rate Request rate 4 Parity N1 Read req rate Write perf 5 Parity N1 Request rate Transfer rate 6 PQ or 2D N2 or Data protect Write perf MNMN 1O Mirroring 2N Performance Space effic 53 parity Nstriping Performance Space effic factor IIll COSC 6385 Computer Architecture Edgar Gabriel COSC 6385 Computer Architecture Tomasolu s Algorithm ll Edgar Gabriel Fall 2006 Some slides are based on a lecture by David A Patterson University of California Berkley littD MNWW CS berkelev edLlQattrSH252801 cos m5 Computer Armitemure my Gabriel Detailed steps Lets look at the details for an operation OP rd r5 rt eg ADDD F6 F2 F0 Assume that Operation has been assigned to reservation station r 7 RS r is the data structure holding all the elds for reservation station r as described on slide 15 7 Registerstat 15 is the data structure holding the status of register r5 eg whether a reservation station will write the register cos m5 Computer Armitemure my Gabriel Detailed steps ll Instruction state Wait until Actionbookkeeping issue Station remply r Reglster tatrs01 l o t Pp operation some Reglster tatrsQl l else t mm 0 R r v Regstrs l Reglster tatrt01 t t RS 0 ueglseersesetrelol u R rvk Regstre l RsrEusy yes Reglster tatrd01 r cos saeae Computer Armltemure Edgar Eahrlel Detailed steps Ill Instruction state Wait until Action bookkeeping Execu compute resule uslng v and Vk Wile result FP operation Execution complete Vquot i m Elite tatMQl Reg x resule and CD8 available Reglster tat x or n V l RSXQj r R xVj result u VX1 RSXQk r R x vk result D RS x 0 l RSEusy no l u Detailed steps IV For a LOAD operation eg LD rt 1mmrs Instruction state Wait until Issue Buffer r empty Load ease saeae Computer Armitemure Edgar Eahriei III Action bookkeeping l Reglster tatrs01 I RsQj Reglster tatrs01 else I R rQj Rsrvg I mum mm R rEusy es Reglster tatrt01 r u Regeree Detailed steps V Instruction state Wait until Actionbookkeeping Execute R5r QE ampamp RSA mew RSA Load stepi r IS head of Ioad queue Load Step 2 Load Step1 Reaurrum MemRSr A compiete W teregmt Vac 1 ReglsterStatx01 z I Load Execution compiete in D and CD8 avaiiabie Va 1 RSXQ r RSxVj result Rsx oj u I Vac fRsxQkr R xv esult Rsx Qk u I cos we compute Armiieuure RS 1 Edgar Eahriei IIll Eusy he A loop based example Loop LD FO ORl MULTD F4 F0 F2 SD F4 ORl SU39BI R1 R18 BNEZ R1 Loop This time assume Multiply takes 4 clocks Assume 1st load takes 8 clocks total 1 effective address 7 mem ess L1 cache miss 2nd load takes 1 clock hit To be clear will show clocks for SUBI BNEZ Reality integerinstructions ahead of Fl Pt Instructions Show 2 iterations snug based an a mm m DavidA Patterson 5 smocomvmewmwe owwwmcmm ml g3 mo 5 zsm Issue rst load Time2 rst load effective address calc Issue mult II Fn In In In In If IHz I Iran I Im Imam I IMum I I I I I I I rst load mem access1l7 mult stalled Issue store II n In In In In Inn IHz I Iran I Iqi Imam I Ian I I I I I I I Time4 rst load exec 27 mult stall store effect addr Calc SUBI not shown II Fn In In In In If IHz I Iran I Im ILuadl I IMum I I I I I I I rst load exec Ci7 mult stall store stall BNEZ not shown II n In In In In Inn IHz I Iran I Iqi Imam I IMulH I I I I I I I Time6 rst load exec 417 mult stall store stall issue load II Fn In In In In If IHz I Iran I Iq ILnndZ I IMum I I I I I I I rst load exec ES7 mult stall store stall load2 eff Add issue mult2 II n In In In In Inn IHz I Iran I Iqi ILmmz I IMunz I I I I I I I Time8 rst load exec ti7 mult store mult2 stall load2 exec issue store2 II Fn In In In In If IHz I Iran I Iq ILnndZ I Iwumz I I I I I I I rst load exec 717 mult store mult2 stall load2 exec store2 II n In In In In Inn IHz I Iran I Iqi ILmmz I IMunz I I I I I I I Time10 rst load writes res mult store mult2 stall load2 nish storeZ stal I Fn lrz lw lrs lra F1n lnz l lran I 1m ILnndZ l lwumz l l l l l l I Load 2 write res Mult1 14 mult2 store1 storeZ stalled Time14 Mult1 44 Mult2 34 store1 store2 stalled Mult1 write res Mult2 44 store1 exec storeZ stalled Time16 store1 store2 exec Tomasulo Please note F0 never sees data from the rst load Register File completely detached from computation First and Second iteration overlap completely Assuming two Mult units we could not have issued a third mult operation for the next iteration of the loop gt no third store iteration could be issued In order issue outoforder execution outoforder completion Slide based an a lecture bv DavidA Panersun cusc m5 Computer Armitemure a a He UNVEYSW ul Calrlurma Berklev Why can Tomasolu overlap iterations of loops Register renaming Multiple iterations use different physical destinations for registers dynamic loop unrolling Reservation stations Permit instruction issue to advance past integer control flow opera ions Also buffer old values of registers totally avoiding the WAR stall that we saw in the scoreboard Other perspective Tomasulo building data flow dependency graph on the fly snug based an a team by DavidA Patterson 25gingmerArmed Univevsllv momma may zsm Dynamic branch prediction Up to now we used for techniques to avoid branch hazards Stall Predict not taken Predicttaken Delayed branch All methods are static gt do not take the previous behavior of branches into account Dynamic branch prediction ll Seven techniques for dynamic branch prediction 1bit branch prediction buffer 2bit branch prediction buffer Correlating Branch Prediction Buffer Tournament Branch Predictor Branch Target Buffer Integrated Instruction Fetch Units Return Address Predictors cusc m5 Computer Armitemure my Gabriel 1 bit Branch prediction buffer I Branch prediction buf fer Small memory area indexed by the lower portion of the address of the branch instruction Records whether the branch was taken the last time or not 1 bit is sufficient Please note Several branches might share the same address since we do not use the full branch instruction address for accessing the branch prediction buffer c as am saw as 1bit Branch Prediction Buffer ll Limitations Even for a regular loop embedded in another large loop the 1bit Branch Prediction Buffer will mispredict at least the first and the last iteration 1St iteration the bit has been set by the last iteration of e same loop to not taken but the branch will be taken Last iteration the bit says taken but the branch won t e a en cbsc m5 Computer Armitemure my Gabriel 2bit Branch Prediction Buffer A prediction must miss twice before the prediction is change The bits represent the history of the branch Can be extended to nbits Predict taken 10 Nuttaken Predict not taken 39 39 00 Nuttaken Taken Nuttaken f 39ken Predict not taken 01 i u Correlated branches BN39EZ R1 L1 lbranch b1 DADDIU R1 R0 1 Ll DADDIU R3 R1 71 BN39EZ R3 L2 lbranch b2 L2 Branches b1 and b2 are correlated ofd Correlated branches For values of d and 1bit All predictions are wrong l u Correlated branches Idea generate a prediction pair one prediction in case the last branch in the program has been taken on prediction in case the last branch in the program has not been taken Each branch has its own prediction pair which of the two possibilities is chosen is depending on the previous branch in the program New The example on the previous slide is called a 11 Correlated branches predictor Uses the behavior of the last branch to choose from the pair of predictions Uses 1bit branch predictors General case mn predictors m last branches used for choosing from 2m branch predictors nbit branch predictors used in each case Correlating Branches Idea takennot taken of Bramh address 4 ms recently executed 2bits per branch branches is related to local Predictors behavior of next branch as well as the history of that branch behavior Then behavior of recent branches selects between say i 4 predictions of next branch updating just IIIIJ Hgtr39 I I IIII EIUIIJIUIIEI Prediction that prediction 21quot global 22 predictor 2bit ram h39smry 01 not taken then taken global 2bit local Slide based an a mm W DavidA Mum nus mar comm mum Univers v m Calilumia Berklev Edgy Eamg httn llwwwc uattrsnl252m Accuracy of Different Schemes 4096 Entries 2bit BHT Unlimited Entries 2bit BHT 1024 Entries 22 BHT x xxx Frequency of Mispredictions 0 r 2 E E g E Q i 3 2 5 g E A e 3 2 mm 2 be mm cumming 2 mm 1 U24emuesrzz Slide based cm a lecture bv DavidA Pattersun EDS m5 Computer Armweme Univers v ur Calilurma BerklEV Edgy EEW mtg NWWW cs berkeley enuigamsnizazsm I Tournament Predictors Motivation for correlating branch predictors is 2bit predictor failed on important branches by adding global information performance improved Tournament predictors use 2 predictors 1 based on global information and 1 based on local information and combine with a selector Hopes to select right predictor for right branch Slide based an a lecture bv DavidA Pattersu cuscsaa rcompmerArmmemure 0 ME am my Gabriel u Univers v m a min llwwwcs n a rsnZSZSEH Need Address at Same Time as Prediction Branch Target Buffer BTB Address of branch index to get prediction AND branch address if taken Note mustcheck for branch match now since can t use wrong branch address Figure 319 p 262 HDEd unuamsu p 94 Yes insfmcfion is branch and use quot 0 No branch predicted PC as Predic39iaquot 5quotquot p 39 ed proceed normally nexf PC 395 ex P P 4 revaavidA Pattersu cusc m5 Emmy Amiga E l Slide based u a lectu Univers v m Califurma BerklEv h agar 53m Special Case Return Addresses Register Indirect branch hard to predict address SPECSQ 85 such branches for procedure return Since stack discipline for procedures save return address in small buffer that acts like a stack 8 to 16 entries has small miss rate Slide based an a lecture bv DavldA Panevsum nus m5 cumpmer Armuemure University m cammma Berkley Edgar Gabriel min NW cs lnallvsnl252m COSC 6385 Computer Architecture MultiProcessors II The IBM Cell Intel Larrabee and Nvidia G80 processors Edgar Gabriel Fall 2008 75 1 PQTL References Intel Larrabee 1 L Seiler D Carmean E Sprangle T Forsyth M Abrash P Dubey S Junkins A Lake J Sugerman R Cavin R Espasa E Grochowski T Juan P Hanrahan Larrabee a manycore x86 architecture for visual computing ACM Trans Graph Vol 27 No 3 August 2008 pp 115 ilesenusFi le llarrahee m anvcoreDdf IBM Cell processor 2 C R Johns D A Brokenshire Introductioon to the Cell Broadband Engine Architecture IBM Journal of Research and Development vol 51 no 5 pp 503519 httpllvawresearcnibmcomjournalrdSI5 ohnspdf 3 M Kistler M Perrone F Petrini Cell Multiprocessor Communication Network Built for Speed IEEE Micro vol 26 no 3 pp 1023 ttphpcpnlgovpeoplefabriziopapersieeemicrocellpdf Nvidia 380 4 Scott Wasson Nvidia GeForce 8800 graphics processor httptechreportcomarticlesgtlt112111 cost as compuherArchltecture Edgar Gabriel Larrabee Motivation Comparison of two architectures with the same number of transistors Half the performance of a single stream for the simplified core 40X increase for multistream executions 2 outoforder 1O ino rder cores cores Instruction issue 4 2 VPU per core 4wide SSE 16wide L2 cache size 4 MB 4 MB Single stream 4 per clock 2 per clock Vector 8 per clock 160 per clock throughput 111 223219 m PVSTL Larrabee Overview Manycore visual computing architecture Based on X86 CPU cores Extended version of the regular X86 instruction set Supports subroutines and page faulting Number of X86 cores can vary depending on the implementation and processor version cost as compuherArchltecture Edgar Gabriel Overview of a Larrabee Core I X86 core derived from the Pentium processor No outoforder execution Standard Pentium instruction set with the addition of 64 bit instructions Instructions for prefetching data into L1 and L2 cache Support for 4 simultaneous threads separate registers for each thread Each core is augmented with a wide vector processor VPU 32kb L1 Instruction cache 32 kb L1 Data Cache 256 KB of local subset of the L2 cache Coherent L2 cache across all cores cosc Es compuherArchltecture 7 Edgar Gabriel PSTL Vector Processing Unit in Larrabee 16wide VPU executing integer single and double precision floating point operations VPU supports gatherscatter operations The 16 elements are loaded or can be stored from up to 16 different addresses Support for predicated instructions using a mask control register ifthenelse statements cost as compuherArchltecture Edgar Gabriel cost as compuherArchltecture Edgar Gabriel PSTL InterProcessor Ring Network Bidirectional ring network 512 bitswide per direction Routing decisions done before injecting message into the network Ill PSTL Larrabee Programming Models Most application can be executed without modification due to the full support of the X86 instruction set Support for POSIX threads to create multiple threads API extended by thread affinity parameters Recompiling code with Larrabee s native compiler will generate automatically the codes to use the VPUs Alternative parallel approaches Intel threading building blocks Larrabee specific OpenMP directives cost as compuherArchltecture Edgar Gabriel L quotSTL I Larrabee Performance u h m 4 PSTL I IBM Cell Oveniew I CeH Bmadband Archvtecture CBEA de ned by a cunmmum mam my and Tmmba Ongmany Largetmgthe mumrmedva mdumy r E v ayst b E a 2 c mm a regularcumputerblade am by BM 7 BMQEZO 1121 122 Nam vdea heterugeneuu mvcrupmcemurcunmmng uf 7 me ur mare general purpuse prunessur element WE and 7 me ur mare synergstn prunessur elements PES In Wmew Twu generaupnsa 7 Ce as 2m 5 snows 14 s snows 2m 5 snows dauhl e PMrXCeH av LGnm m2 mm vs a Bmhhave1PPE andB svg vavlable 5 far vrgle Dremxmn peak perfarmance e p man peak Derfwmanue vrgle Dremxmn peak De armance auhle Dramer peak Derfwmanue General Purpose Processor PPE Based on the IBM PowerPC processor Supports multiple simultaneous operating environments virtualization Eg can execute an instance of a realtime operating system and an instance of a nonrealtime operating system Performs management and application control functions cosc Es compuherArchltecture 7 Edgar Gabriel PSTL Synergistic Processor Element SPE SIMD processor used for offloading computeintensive data parallel operations from the PPE Each SPE has its own local storage and can access data only from the local storage Current versions of the Cell processors 256k local storage The local storage is connected to the main memory through a Memory Flow Controller MFC MFC moves data from main memory to local storage or between two SPEs cost as compuherArchltecture Edgar Gabriel In PSTL I Synergistic Processor Element SPE II Each SPE ha 128 regmer Each reguterv 12B bu AAde vach can be med to hold 7 vateen 87va mteger Dr r 9ng 1mm Integer m r Fuur 327m Integer m mgle precmun uaungrpumt mum er 7 Twu 54m Integer m duume precmun uaungpumt number Mon mnrucuon upported bythe ynergvmc proceor umt utmze an element m a reguter rgt 5m a unpumumnnmm t a l PSTL I Simplified representation of a curren Cell processor H t quot quot m gt Hg Element Interconnect Bus I WE and PES communicate lhruugv the Element interconnect Bus 7 Camainsa aredcammandhu 522 p 2 Section leed m coherence Protocol 7 Pmmrta39vmnt em interconnect Faur1639l7yterwide me two uxedfar clockwixe em tranxfers Wm M cauntenclackvme em tramfer Each Ngtranxferiza byte packets l cache black me m an we Cammunicatmn cast between two SPE cam3W hetweenl nap and 5 ha In 7 Overallhandwidth maeag i PSTL Comparison IBM Cell and Intel Larrabee Both use a large number of small and simple cores Both use highbandwidth ring bus to communicate between the cores Intel Larrabee is homogeneous while IBM Cell is a heterogeneous process difference between PPE and IBM Cell requires data to be moved explicitly to the local store while Larrabee can address any memory area Programm for the Cell have to be written taking the limited amount of memory available for a SPE into account 0050 as compuherArchltecture Edgar Gabriel III PSTL Nvidia G80 Parallel Stream Processor Each green block is a stream processor 16 stream processors are grouped and connected by a L1 cache Each 680 has 8 groups with 16 SPs 128 SPs total Each SP is a generalized processors running at 135 GHZ Each SP operates on a single element scalar groups are connected by a crossbar style switch and that connects them to six ROP Each ROP has its own L2 cache and an interface to graphics memory frame buffer with 64 bits width 6 64bits 384 bits path to memory cost as compuherArchltecture Edgar Gabriel E g 4 gt H III EMBED ttrr HT cascaa rcumpma mhma ure Edgar Emma r 5 Performance comparison t6 IBM Ray Tracing Application Stanford Bunny 1024x1024 FransSec Omemn Gan oeumspacamsm uszn Procssm Sour httpgametomorrowcombloymda wZOW0905ceHrvrgB0 Ma mm casts Edgar m Earn COSC 6385 Computer Architecture Memory Hierarchy Design III Edgar Gabriel Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word rst and earty restart Giving priority to read misses over writes Merging write buffer Victim caches 05 m5 Computer Armitemure my Gabriel Reducing miss rate Five techniques to reduce the miss rate Larger cache block size Larger caches Higher associativity Way prediction and pseudoassociative caches Compiler optimization cos m5 Computer Armitemure my Gabriel Reducing Cache Miss penaltyrate via Parallelism Nonblocking caches Hardware prefetch of Instructions and Data Compiler controlled prefetching Nonblocking caches Allow data cache to continue to supply cache hits during a cache m39ss Reduces effective miss penalty Further optimization allow overlap of multiple misses Only useful if the memory system can service multiple Increases complexity of cache controller cos m5 Computer Armitemure my Gabriel Hardware prefetching Prefetch items before they are requested by the processor Eg processor prefetches two blocks on a miss assuming that the block following the missing data item will also be requested Requested block is placed in the instruction cache Prefetched block is placed in an instruction stream buffer In case of a request to the prefetched block the original cache request is cancelled by the hardware and the block is read from the stream buffer Downside wasting memory bandwidth if block is not require cos m5 Computer Armitemure my Gabriel Compilercontrolled prefetch Register prefetch load value into register Cache prefetch load data into cache Faulty and nonfaulting prefetches a nonfaulting prefetch turns into a noop in case of a virtual address faulthrotection violation cos m5 Computer Armitemure my Gabriel Compilercontrolled prefetch II Example for 10 1lt3 1 l for 30 jlt1oo j i atl j blj 0 bljtl 0 l l Assumptions Each element of a b are 8 byte double precision fp 8 KB direct mapped cache with 16 byte blocks Each block can hold 2 consecutive values of either a or b c as am saw as Compiler controlled prefetching Ill No of cache misses for a Since a block contains aij aii1 every even value of will lead to a cache miss and every odd value of will be a cache hit gt 3 x 1002 150 cache misses because of a No of cache misses because of b b does not bene t from spatial locality Each element of b can be used 3 times for i012 Ignoring cache conflicts this leads to a Cache miss for b00 for i0 cache miss for every bj10 when i0 1 100 101 cache misses cbsc m5 Computer Armitemure my Gabriel Compiler controlled prefetching IV Assuming that a cache miss takes 7 clock cycles Assuming that we are dealing with nonfaulting prefetches D es not prevent cache misses for the rst 7 iterations for 30 3lt1oo 3H prefetch b37 0 prefetch aO 37 aO j bljl O b1 O l for 11 1lt37 l l for 30 3lt1oo j prefetch a1 37 all j bljl O b1 0 cos m5 Computer Armitemure my Gabriel Compiler controlled prefetching V New number of cache misses 7 misses for b00 b10 in the rst loop 72 4 misses for a00 a02 in the first loop 72 4 misses for a10 a12 in the second loop 72 4 misses for a20 a22 in the second loop gt total number of misses 19 gt reducing the number of cache misses by 232 A write hint could tell the cache I need the elements of a in the cache for speeding up the writes But since I do not read the element don t bother to load the data really into the cache it will be overwritten anyway cos m5 Computer Armitemure my Gabriel Reducing Hit time Small and simple caches Avoiding address translation Pipelined cache access Trace caches c as am saw as Small and Simple caches Cache hit time depends on Comparison of the tag and compare it to the address No of comparisons having to be executed Clockcycle of the cache Onchip vs offchip caches Smaller and simpler caches are faster by nature cos m5 Computer Armitemure my Gabriel Avoiding address translation Transl ation ofvirtual address to physical address required to access memorycaches Can a cache use virtual addresses In practice no Page protection has to be enforced even if cache would use virtual addr ses Cache has to be ushed for every process sw39 h Two processes might use the same virtual address without meaning the same physical addr m w s Adding ofa process identi er tag PID to the cache address tag would solve parts 0 39 Operating systems might use different virtual addresses for the same physical address alias Could lead to duplicate copies ofthe same data in the cache Edgar 53m Pipelined cache access and trace caches Pipelined cache access Reduce latency of first level cache access by pipelining Trace caches Find a dynamic sequence of instructions to load into a instruction cache block The cache gets dynamic traces of the executed instructions of the CPU and is not bound the spatial locality in the memory Works well on taken branches Used eg by Pentium 4 cos m5 Computer Armitemure my Gabriel Memory organization How can you improve memory bandwidth Mder main memory bus Interleaved main memory Try to take advantage ofthe fact that readingwriting multiple words use multiple chips banks Banks are typically one word wide Memory interleaving tries to utilize all chips in the system Independent memory banks Allow independent access to different memory chipsbanks Multiple memory controllers allow banks to operate independently Might required separate data buses address lines Usually required by nonblocking caches cos m5 Computer Armitemure my Gabriel COSC 6385 Computer Architecture Memory Hierarchy Design II Edgar Gabriel Fall 2006 COSC 6385 Computer Architecture I Edgar Gabriel 0 Cache Performance Avg memory access time Hit time Miss rate x Miss penalty with Hit time time to access a data item which is available in the cache Miss rate ratio of no of memory access leading to a cache miss to the total number of instructions Miss penalty timecycles required for making a data item in the cache COSC 6385 Computer Architecture Edgar Gabriel 0 Split vs unified cache Assume two machines Machine 1 16KB instruction cache 16 KB data cache Machine 2 32KB unified cache Assume for both machines 36 of instructions are memory referencesdata transfers 74 of memory references are instruction references Misses per 1000 instructions 16 KB instruction cache 382 16 KB data cache 409 32 KB unified cache 433 Hit time 1 clock cycle for machine 1 1 additional clock cycle for machine 2 structural hazard Miss penalty 100 clock cycles COSC 6385 Computer Architecture Edgar Gabrie39 Split vs unified cache ll Questions 1 Which architecture has a lower missrate 2 What is the average memory access time for both machines Missrate per instruction can be calculated as Misses m 1000 Miss rate Memory access Instruction COSC 6385 Computer Architecture Edgar Gabriel 0 Split vs unified cache Ill Machine 1 since every instruction access requires exactly one memory access Miss rate 16 KB instruction 382100010 000382 0004 Since 36 ofthe instructions are data transfer Miss rate 16KBdata 4091000036 0114 Overall miss rate since 74 of memory access are instructions references Miss rate sputcache 074 0004 026 0114 00324 COSC 6385 Computer Architecture Edgar Gabrie39 Split vs unified cache IV Machine 2 Unified cache needs to account for the instruction fetch and data access Miss rate 32KB uni ed 43410001 036 00318 gtAnswer to question 1 the 2nd architecture has a lower miss rate COSC 6385 Computer Architecture Edgar Gabriel CSLUH Split vs unified cache V Average memory access time AMAT AMAT instructions x Hit time Instruction Miss rate x Miss penalty data x Hit time Data Miss rate x Miss penalty Machine 1 AMAT1 074 1 0004x100 026 1 0114 x 100 424 Machine 2 AMAT2 074 1 00318X100 026 1 1 00318 X100 444 gtAnswer to question 2 the 1st machine has a lower average memory access time m COSC 6385 Computer Architecture Edgar Gabriel IIll Direct mapped vs set associative Assumptions CPI without cache misses perfect cache 20 No of memory references per instruction 15 Cache size 64 KB Machine 1 direct mapped cache Clock cycle time 1ns Miss rate 14 Machine 2 2 way set associative Clock cycle time 125 ns Miss rate 10 Cache miss penalty 75ns Hit time 1 clock cycle COSC 6385 Computer Architecture Edgar Gabriel Direct mapped vs set associative ll Average memory access time AMAT AMAT Hit time Miss rate x Miss penalty AMAT1 10 0014 X 75 205 ns AMAT 2 125 0010 X 75 20 ns gt avg memory access time better for 2way set associative cache COSC 6385 Computer Architecture Edgar Gabriel CSLUH Direct mapped vs set associative Ill CPU performance CPU time IC x CPI exec Missesinstruction x Miss penalty x Clock cycle time lC x CPI exec x Clock cycle time Miss rate x memory accessinstruction x Miss penalty x Clock cycle time CPUtime1le2x1015x0014x75358lC CPUtimeZC2X12515X001X75363C gt Direct mapped cache leads to better CPU time COSC 6385 Computer Architecture Edgar Gabriel 0 Processor Performance CPU equation CPU time CIOCk CyCIe CPU execution CIOCk CyCIes memory stall X clock cycle time Can avg memory access time really be mapped to CPU time Not all memory stall cycles are due to cache misses We ignore that on the following slides Depends on the processor architecture lnorder vs outof order execution For outoforder processors need the visible portion of the miss penalty Memory stall cyclesinstruction Missesinstruction x Total miss latency overlapped miss latency COSC 6385 Computer Architecture Edgar Gabriel 0 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses over writes Merging write buffer Victim caches COSC 6385 Computer Architecture Edgar Gabriel CSLUH Multilevel caches I Dilemma should the cache be fast or should it be large Compromise multilevel caches 1st level small but at the speed of the CPU 2nd level larger but slower Avg memory access time Hit time U Miss rate U x Miss penalty L1 and Miss penalty L1 Hit time L2 Miss rate L2 x Miss penalty L2 COSC 6385 Computer Architecture I Edgar Gabriel 0 Multilevel caches ll Local miss rate rate of number of misses in a cache to total number of accesses to the cache Global miss rate ratio of number of misses in a cache number of memory access generated by the CPU 1st level cache global miss rate local miss rate 2nd level cache global miss rate Miss rate L1 x Miss rate L2 Design decision for the 2nd level cache 1 Direct mapped or nway set associative 2 Size of the 2nd level cache COSC 6385 Computer Architecture Edgar Gabriel 0 Multilevel caches ll Assumptions in order to decide question 1 Hit time L2 cache Direct mapped cache10 clock cycles 2way set associative cache 101 clock cycles Local miss rate L2 Direct mapped cache 25 2way set associative 20 Miss penalty L2 cache 100 clock cycles Miss penalty direct mapped L2 10 025 x 100 35 clock cycles Miss penalty Nay assoc L2 101 02 x 100 301 clock cycles m COSC 6385 Computer Architecture m 0 Edgar Gabriel Multilevel caches Ill Multilevel inclusion 2nd level cache includes all data items which are in the 1st level cache Applied if size of 2nd level cache gtgt size of 1st level cache Multilevel exclusion Data of L1 cache is never in the L2 cache Applied if size of 2nd level cache only slightly bigger than size of 1st level cache Cache miss in L1 often leads to a swap of an L1 block with an L2 block COSC 6385 Computer Architecture Edgar Gabriel 0 Critical word first and early restart In case of a cachemiss an entire cacheblock has to be loaded from memory Idea don t wait until the entire cacheblock has been load focus on the required data item Critical word rst ask forthe required data item Fonvard the data item to the processor Fill up the rest of the cache block a envards Early restart Fetch words of a cache block in normal order Fonvard the requested data item to the processor as soon as available Fill up the rest of the cache block a envards COSC 6385 Computer Architecture Edgar Gabriel 0 Giving priority to read misses over writes Writethrough caches use a writebuffer to speed up write operations Writebuffer might contain a value required by a subsequent load operations Two possibilities for ensuring consistency A read resulting in a cache miss has to wait until write buffer is empty Check the contents of the write buffer and take the data item from the write buffer if it is available Similar technique used in case of a cacheline replacement for nway set associative caches COSC 6385 Computer Architecture Edgar Gabriel 0 Merging write buffers Check in the write buffer whether multiple entries can be merged to a single one 100 1 Mem100 108 1 Mem108 116 1 Mem116 124 Mem124 100 Mem100 Mem108 Mem116 Mem124 108 116 1 124 1 COSC 6385 Computer Architecture Edgar Gabriel 0 Victim caches Question how often is a cache block which has just been replaced by another cache block required soon after that again Victim cache fully associative cache between the real cache and the memory keeping blocks that have been discarded from the cache Typically very small COSC 6385 Computer Architecture Edgar Gabriel 0 Reducing miss rate Three categories of cache misses Comgulsom Misses rst access to a block cannot be in the cache cold start misses Cagacity Misses cache cannot contain all blocks required forthe execution gt increase cache size Conflict Misses cache block has to be discarded because of block replacement strategy gt increase cache size andor associativity COSC 6385 Computer Architecture Edgar Gabriel CS UH Reducing miss rate II Five techniques to reduce the miss rate Larger cache block size Larger caches Higher associativity Way prediction and pseudoassociative caches Compiler optimization COSC 6385 Computer Architecture Edgar Gabriel CSLUH
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'