Popular in Course
Popular in Computer Information Technology
verified elite notetaker
This 123 page Class Notes was uploaded by Kathleen Cartwright on Monday September 28, 2015. The Class Notes belongs to CIS371 at University of Pennsylvania taught by A.Roth in Fall. Since its upload, it has received 37 views. For similar materials see /class/215367/cis371-university-of-pennsylvania in Computer Information Technology at University of Pennsylvania.
Reviews for COMPORGANDDESIGN
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/28/15
Instruction Set Architecture ISA APP App App 0 What is an ISA System software 0 And what is a good ISA 0 As ects of ISAs Mem CPU lO p o RISC vs CISC Computer Organization and DeSIgn Compatibility isa powerful force 0 Tricks binary translation uISAs Unit 1 Instruction Set Architectures C15 371 RothMartin Instruction Set Architectures C15 371 RothMartin Instruction Set Architectures Readings What Is An ISA 0 Introduction 0 ISA instruction set architecture 0 PH Chapter 1 o A welldefined hardwaresoftware interface 0 The contract between software and hardware 0 ISAS 0 Functional definition of operations modes and storage locations supported by hardware 0 Precise description of how to invoke and access them Not in the contract 0 PH Chapter 2 o How operations are implemented 0 Which operations are fast and which are slow and when 0 Which operations take more power and which take less 0 Instruction gt Insn 0 Instruction is too long to write in slides C15 371 RothMartin Instruction Set Architectures C15 371 RothMartin Instruction Set Architectures A Language Analogy for ISAs 0 Communication 0 Persontoperson a softwaretohardware 0 Similar structure 0 Narrative a program 0 Sentence a insn o Verb a operation add multiply load branch 0 Noun a data item immediate register value memory value c Adjective a addressing mode 0 Many different languages many different ISAs 0 Similar basic structure details differ sometimes greatly 0 Key differences between languages and ISAs 0 Languages evolve organically many ambiguities inconsistencies o ISAs are explicitly engineered and extended unambiguous C15 371 RothMartin Instruction Set Architectures 5 ISA Design Goals C15 371 RothMartin Instruction Set Architectures 7 The Sequential Model 0 Implicit model of all modern ISAs F PC 0 Often called VonNeuman but in ENIAC before e 0 Basic feature the program counter PC 7 Decode o Defines total order on dynamic instruction Readv39nPUtS 0 Next PC is PC unless insn says otherwise Execute 0 Order and named storage define computation W ie 39O ujpui 0 Value flows from insn X to Y via storage A iff z o X names A as output Y names A as input Next RC 0 And Y after X in total order 0 Processor logically executes loop at left 0 Instruction execution assumed atomic 0 Instruction X finishes before insn X1 starts 0 More parallel alternatives have been proposed C15 371 RothMartin Instruction Set Architectures 6 What Is A Good ISA o Lends itself to highperformance implementations 0 Every ISA can be implemented 0 Not every ISA can be implemented well 0 Background CPU performance equation 0 Execution time secondsprogram o Convenient to factor into three pieces 0 insnsprogram cyclesinsn secondscycle o Insnsprogram dynamic insns executed o Secondscycle clock period c Cyclesinsn CPI hmmm o For high performance all three factors should be low C15 371 RothMartin Instruction Set Architectures 8 InsnsProgram Compiler Optimizations o Compilers do two things 0 Translate highlevel languages to assembly functionally o Deterministic and fast compile time gcc oo o Canonical not an active research area 0 C15 341 o Optimize generated assembly code 0 Optimize Hard to prove optimality in a complex system c In systems optimize means improve hopefully o Involved and relatively slow compile time gcc o4 0 Some aspects reverseengineer programmer intention 0 Not canonical being actively researched C15 570 CIS 371 RothMartin Instruction Set Architectures SecondsCycle and CycleInsn Hmmm o For singlecycle datapath o Cycleinsn 1 by definition 0 Secondscycle proportional to complexity of datapath 0 ISA can make secondscycle high by requiring a complex datapath CIS 371 RothMartin Instruction Set Architectures Compiler Optimizations o Primarily reduce insn count 0 Eliminate redundant computation keep more things in registers Registers are faster fewer loadsstores An ISA can make this difficult by having too few registers But also 0 Reduce branches and jumps later 0 Reduce cache misses later 0 Reduce dependences between nearby insns later An ISA can make this difficult by having implicit dependences How effective are these Can give 4X performance over unoptimized code Collective wisdom of 40 years Proebsting s Law 4 per year 0 Funny but shouldn t leave 4X performance on the table CIS 371 RothMartin Instruction Set Architectures Foreshadowing Pipelining o Sequential model insn X nishes before insn X1 starts 0 An illusion designed to keep programmers sane o Pipelining important performance technique 0 Hardware overlaps processing iterations for insns Variable insn lengthformat makes pipelining difficult Complex datapaths also make pipelining difficult or clock slow 0 More about this later InsnD Insn 1 Inan Insn3 Insn4 InsnS Fetch nuts 91 E m quot wamrousur Execute Readlinputs 39Deoode Next Insn Write 6utput Execute Reaa39 Inputs bebode Eetcri CIS 371 RothMartin Instruction Set Architectures Instruction Granularity RISC vs CISC o RISC Reduced Instruction Set Computer ISAs o Minimalist approach to an ISA simple insns only Low cyclesinsn and secondscycle Higher insnprogram but hopefully not as much 0 Rely on compiler optimizations o CISC Complex Instruction Set Computing ISAs o A more heavyweight approach both simple and complex insns Low insnsprogram Higher cyclesinsn and secondscycle c We have the technology to get around this problem 0 More on this later but first ISA basics C15 371 RothMartin Instruction Set Architectures 13 Length and Format 0 Length Fetchm o Fixed length 0 Most common is 32 bits keradlnpug Simple implementation next PC often just PC4 JE xecdte Code density 32 bits to increment a register by 1 Wfitev hau 0 Variable length Alla Code density 0 x86 can do increment in one 8bit instruction Complex fetch where does next instruction begin 0 Compromise two lengths o Eg MIPSl6 or ARM s Thumb o Encoding o A few simple encodings simplify decoder 0 x86 decoder one of nastiest pieces of logic C15 371 RothMartin Instruction Set Architectures Aspects of ISAs C15 371 RothMartin Instruction Set Architectures 14 LC3MIPSx86 Length and Format 0 LC3 2byte insns 3 formats LC4 similar 0469 0i439 O emz 1 3reg 0P1333R3U339R3 o MIPS 4 byte insns 3 formats Rtype 0p6 Rs5 Rt5 Rd5gt 39Sh5 39Eun c6 ltype Gp i39 Rs 7 R39t39395j simmeat isf Jtype OM61 T rge tiz t s39 o X86 1 16 byte insns Pre x14 yep OpExtquot ModRM SIBquot Disp 14 lmmquot14 C15 371 RothMartin Instruction Set Architectures 16 Operations and Datatypes o Datatypes y 0 Software attribute of data D59 0 Hardware attribute of operation data is just 01 s ECO e Readymus o All processorssupport I I 0 2C integer arithmeticlogic 8163264bit W tepHW o IEEE754 floatingpoint arithmetic 3264 bit 39NAeyt Insn 0 Intel has 80bit floatingpoint o More recently most processors support 0 Packedinteger insns eg MMX o Packedfp insns eg SSESSE2 o For multimedia more about these later 0 Processor no longer support 0 Decimal other fixedpoint arithmetic C15 371 RothMartin Instruction Set Architectures 17 Where Does Data Live 0 Memory v o Fundamental storage space Eetcl Decode o Registers Execute o Faster than memory quite handy Write Output NextIns 0 Most processors have these too 0 Immediates 0 Values spelled out as bits in instructions 0 Input only C15 371 RothMartin Instruction Set Architectures 19 LC4MIPSx86 Operations and Datatypes 0 LC4 0 16bit integer add and not sub mul div or gtltor shifts o No floatingpoint MIPS 0 3264 bit integer add sub mul div shift rotate and or not gtltor 0 3264 bit floatingpoint add sub mul div 0 X86 0 3264 bit integer add sub mul div shift rotate and or not gtltor 0 80bit floatingpoint add sub mul div sqrt 0 64bit packed integer MMX padd pmul o 64128 bit packed floatingpoint SSE2 padd pmul C15 371 RothMartin Instruction Set Architectures 18 How Many Registers o Registers faster than memory have as many as possible 0 No 0 One reason registers are faster there are fewer of them 0 Small is fast hardware truism 0 Another they are directly addressed no address calc More of them means larger specifiers Fewer registers per instruction or indirect addressing 0 Not everything can be put in registers 0 Structures arrays anything pointedto 0 Although compilers are getting better at putting more things in More registers means more saving restoring 0 Trend more registers 8 X86 gt32 MIPS gt128 IA64 0 64bit X86 has 16 64bit integer and 16 128bit FP registers C15 371 RothMartin Instruction Set Architectures 20 LC4MIPSx86 Registers 0 LC4 o 8 16bit integer registers o No floatingpoint registers 0 MIPS o 32 32bit integer registers 0 hardwired to 0 o 32 32bit floatingpoint registers or 16 64bit registers o 86 o 8 81632bit integer registers not general purpose 0 No floatingpoint registers 0 64bit X86 0 16 64bit integer registers o 16 128bit floatingpoint registers CIS 371 RothMartin Instruction Set Architectures 21 LC4MIPSx86 Memory Size 0 LC4 0 16bit 216 16bit words X 2 split data and instruction memory 0 MIPS 0 32bit 0 64bit o X86 0 8086 16bit c 80286 24bit c 80386 32bit 0 AMD OpteronAthlon64 Intel s newer Pentium4 Core 2 64bit CIS 371 RothMartin Instruction Set Architectures 23 How Much Memory Address Size 0 What does 64 bit in a 64bit ISA mean 0 Support memory size of 254 0 Alternative wrong definition width of calculation operations 0 Virtual address size 0 Determines size of addressable usable memory 0 Current 32bit or 64bit address spaces 0 All ISAs moving to if not already at 64 bits 0 Most critical inescapable ISA design decision 0 Too small Will limit the lifetime of ISA 0 May require nasty hacks to overcome Eg gtlt86 segments 0 gtlt86 evolution 4bit 4004 8bit 8008 16bit 8086 24bit 80286 0 32bit protected memory 80386 0 64bit AMD s Opteron amp Intel s EM64T Pentium4 o All ISAs moving to 64 bits if not already there CIS 371 RothMartin Instruction Set Architectures 22 How Are Memory Locations Specified 0 Registers are specified directly 0 Register names are short can be encoded in instructions 0 Some instructions implicitly readwrite certain registers 0 How are addresses specified 0 Addresses are as big or bigger than insns 0 Addressing mode how are insn bits converted to addresses 0 Think about what highlevel idiom addressing mode captures CIS 371 RothMartin Instruction Set Architectures 24 Memory Addressing 0 Addressing mode way of specifying address 0 Used in memorymemory or loadstore instructions in register ISA 0 Examples 0 RegisterIndirect R1memR2 0 Displacement R1memR2immed o Indexbase R1memR2R3 o Memoryindirect R1memmemR2 Autoincrement R1memR2 R2 R21 Scaled R1memR2R3immed1immed2 o PCrelative R1memPCimm C15 371 RothMartin Instruction Set Architectures Autoindexing R1memR2immed R2R2immed What highlevel program idioms are these used for What implementation impact What impact on insn count 25 LC4MIPSx86 Addressing Modes 0 MIPS 0 Displacement R1offset 16bit 0 Experiments showed this covered 80 of accesses on VAX 0 LC4 0 Displacement R1offset 6bit o LC3 had two more modes 0 PCdisplacement PCoffset 9bit o MemoryindirectPCdisplacement memPCoffset9bit o X86 MOV instructions 0 Absolute zero offset 81632bit o Registerindirect R1 0 Indexed R1R2 0 Displacement R1offset 81632bit o Scaled R1 R2Scae offset81632bit C15 371 RothMartin Instruction Set Architectures Scale 1 2 4 8 27 MIPS Addressing Modes 0 MIPS implements only displacement 0 Why Experiment on VAX ISA with every mode found distribution 0 Disp 61 regind 19 scaled 11 memind 5 other 4 o 80 use small displacement or register indirect displacement 0 o Itype instructions 16 bit displacement o Is 16bits enough 0 Yes VAX experiment showed 1 accesses use displacement gt16 Hype 0P6 RS5 Rt5 0 SPARC adds RegReg mode 0 Why What impact on both implementation and insn count C15 371 RothMartin Instruction Set Architectures 26 Two More Addressing Issues 0 Access alignment address size 0 o Aligned load word xxxxoo load half xxxxxo o Unaligned load word xxxxlo load half xxxxx1 0 Question what to do with unaligned accesses uncommon case 0 Support in hardware Makes all accesses slow 0 Trap to software routine Possibility 0 Use regular instructions 0 Load shift load shift and o MIPS ISA support unaligned access using two instructions lwl XXXXlO lwr XX 10 o Endianness arrangement of bytes in a word 0 Bigendian sensible order eg MIPS PowerPC o A 4 byte integer 00000000 00000000 00000010 00000011quot is 515 o Littleendian reverse order eg x86 A 4 byte integer 00000011 00000010 00000000 00000000 quot is 515 0 Why little endian To be different To be annoying Nobody knows C15 371 RothMartin Instruction Set Architectures 28 How Many Explicit Operands ALU Insn o Operand model how many explicit operands ALU insn o 3 generalpurpose add R1R2 R3 means R1 R2 R3 MIPS uses this 0 2 multiple explicit accumulators output doubles as input add RlR2 means R1 R1 R2 x86 uses this 0 1 one implicit accumulator add R1 means ACC ACC R1 0 0 hardware stack add means STKTOS STKTOS STKTOS 0 4 useful only in special situations 0 Examples show register operands 0 But operands can be memory addresses or mixed registermemory o ISAs with registeronly ALU insns are loadstore CIS 371 RothMartin Instruction Set Architectures 29 Operand Model Pros and Cons 0 Metric I static code size 0 Want many Implicit operands stack high level insns 0 Metric II data memory traffic 0 Want as many longlived operands in onchip storage loadstore 0 Metric III CPI 0 Want short latencies little variability loadstore o CPI and data memory traffic more important these days 0 In most niches 0 Trend most new ISAs are loadstore or hybrids CIS 371 RothMartin Instruction Set Architectures 31 How Do Values Get FromTo Memory 0 How do values move fromto memory primary storage 0 tofrom registersaccumulatorstack 0 Assume displacement addressing for these examples Registers loadstore load rl 8r2 means R1 memR2 8 store rl 8r2 means memR2 8 R1 Accumulator loadstore load 8 r2 means ACC memR2 8 store 8r2 means memR2 8 ACC Stack pushpop push 8 r2 means STKI39OS memR2 8 pop 8r2 means memR2 8 STKTOS CIS 371 RothMartin Instruction Set Architectures 30 LC4MIPSx86 Operand Models 0 LC4 o Integer 8 generalpurpose registers loadstore o Floatingpoint none 0 MIPS o Integerfloatingpoint 32 generalpurpose registers loadstore o X86 Integer 8 registers regreg regmem memreg but no memmem o Floating point stack why x86 floatingpoint lagged for years 0 Note integer push pop for managing software stack 0 Note also regmem and memmem string functions in hardware 0 X8664 o Integerfloatingpoint 16 registers CIS 371 RothMartin Instruction Set Architectures 32 Control Transfers 0 Default nextPC is PC sizeofcurrent insn V Eetcti Decode o Branches and jumps can change that theadIn39puts Execute Writ ou ut in 0 Otherwise dynamic program static program 0 Not useful Computing targets where to jump to o For all branches and jumps 0 Absolute PCrelative indirect 0 Testing conditions whether to jump at all 0 For conditional branches only 0 Comparebranch conditioncodes condition registers CIS 371 RothMartin Instruction Set Architectures 33 Control Transfers II Testing Conditions 0 Compare and branch insns branch less than R1 10 target Simple Two ALUs one for condition one for target address Egtlttra latency o Implicit condition codes x86 LC4 subtract R2Rl10 sets negative CC branch neg target Condition codes set for free Implicit dependence is tricky 0 Conditions in regs separate branch MIPS set less than R2 R1 10 branch not equal zero R2 target Additional insns one ALU per insn explicit dependence CIS 371 RothMartin Instruction Set Architectures 35 Control Transfers I Computing Targets o The issues 0 How far statically do you need to jump 0 Not far within procedure further from one procedure to another 0 Do you need to jump to a different place each time PCrelative o Positionindependent within procedure 0 Used for branches and jumps within a procedure Absolute 0 Position independent outside procedure 0 Used for procedure calls Indirect target found in register 0 Needed for jumping to dynamic targets 0 Used for returns dynamic procedure calls switch statements CIS 371 RothMartin Instruction Set Architectures 34 LC4 MIPS x86 Control Transfers C4 0 9bit offset PCrelative branchesjumps uses condition codes 0 11bit offset PC relative calls and indirect calls 0 16bit offset PC relative conditional branches uses register for condition 0 Simple banc e 0 Compare two registers beq lane 0 Compare reg to zero bgtz bgez hltz blez Don39t need adder for these cover 80 of cases o Explicit set condition into registers slt sltu slti sltiu etc 0 26bit target absolute jumps and function calls 0 8 bit offset PCrelative branches uses condition codes 0 Explicit compare instructions to set condition codes 0 816bit target absolute jumps and function calls within segment 0 Far jumps and calls change code segment for longer jumps CIS 371 RothMartin Instruction Set Architectures 36 Later ISA Include Support For 0 Operating systems amp memory protection 0 Privileged mode 0 System call TRAP o Exceptions amp interrupts o Interacting with IO devices 0 Multiprocessor support 0 Atomic operations for synchronization o Datalevel parallelism 0 Pack many values into a wide register 0 Intel s SSE2 four 32bit floatpoint values into 128bit register 0 Define parallel operations four adds in one cycle C15 371 RothMartin Instruction Set Architectures RISC and CISC o RISC reducedinstruction set computer 0 Coined by Patterson in early 80 s 0 Berkeley RISCI Patterson Stanford MIPS Hennessy IBM 801 Cooke Examples PowerPC ARM SPARC Alpha PARISC o CISC complexinstruction set computer 0 Term didn t exist before RISC 0 Examples x86 VAX Motorola 68000 etc o Philosophical war one of several started in mid 1980 s 0 RISC won the technology battles o CISC won the highend commercial war 19905 to today 0 Compatibility a stronger force than anyone but Intel thought 0 RISC won the embedded computing war C15 371 RothMartin Instruction Set Architectures The RISC vs CISC Debate C15 371 RothMartin Instruction Set Architectures The Setup 0 Pre 1980 0 Bad compilers so assembly written by hand 0 Complex highlevel ISAs easier to write assembly 0 Slow multichip microprogrammed implementations o Vicious feedback loop 0 Around 1982 o Moore s Law makes singlechip microprocessor possible 0 but only for small simple 15A 0 Performance advantage of this integration was compelling o Compilers had to get involved in a big way 0 RISC manifesto create ISAs that o Simplify singlechip implementation 0 Facilitate optimizing compilation C15 371 RothMartin Instruction Set Architectures The RISC Tenets Singlecycle execution o CISC many multicycle operations Hardwired control 0 CISC microcoded multicycle operations Loadstore architecture 0 CISC registermemory and memorymemory Few memory addressing modes 0 CISC many modes Fixed instruction format 0 CISC many formaB and lengths Reliance on compiler optimizations o CISC hand assemble to get good performance Many registers compilers are better at using them 0 CISC few registers CIS 371 RothMartin Instruction Set Architectures 41 The Debate 0 RISC argument 0 CISC is fundamentally handicapped o For a given technology RISC implementation will be better faster 0 Current technology enables singlechip RISC c When it enables singlechip CISC RISC will be pipelined c When it enables pipelined CISC RISC will have caches c When it enables CISC with caches RISC will have next thing 0 CISC rebuttal o CISC flaws not fundamental can be fixed with more transistors o Moore s Law will narrow the RISCCISC gap true 0 Good pipeline RISC 100K transistors CISC 300K 0 By 1995 2M transistors had evened playing field 0 Software costs dominate compatibility is paramount CIS 371 RothMartin Instruction Set Architectures 43 CISCs and RISCs The CISCs X86 VAX Virtual Address eXtension to PDP11 Variable length instructions 1321 bytes 14 GPRs PC stackpointer condition codes Data sizes 8 16 32 64 128 bit decimal string 0 Memorymemory instructions for all data sizes 0 Special insns crc insque polyf and a cast of hundreds o X86 Difficult to explain and impossible to love The RISCs MIPS PARISC SPARC PowerPC Alpha 0 32bit instructions 0 32 integer registers 32 floating point registers loadstore 0 64bit virtual address space 0 Few addressing modes Alpha has one SPARCPowerPC have more 0 Why so many basically similar ISAs Everyone wanted their own CIS 371 RothMartin Instruction Set Architectures 42 Compatibility Noone buys new hardware if it requires new software 0 Intel greatly benefited from this IBM too 0 ISA must remain compatible no matter what 0 X86 one of the worst designed ISAs EVER but survives c As does IBM s 360370 the rst ISA family Backward compatibility 0 New processors must support old programs can t drop features 0 Very important FonNard upward compatibility 0 Old processors must support new programs with software help 0 New processors redefine only previouslyillegal opcodes 0 Allow software to detect support for specific new instructions 0 Old processors emulate new instructions in lowlevel software CIS 371 RothMartin Instruction Set Architectures 44 Intel s Compatibility Trick RISC Inside 0 1993 Intel wanted outoforder execution in Pentium Pro 0 Hard to do with a coarse grain ISA like x86 0 Solution Translate X86 to RISC pops in hardware push eax becomes we think uops are proprietary store eax esp 4 addi esp esp 4 Processor maintains X86 ISA externally for compatibility But executes RISC pISA internally for implementability 0 Given translator x86 almost as easy to implement as RISC 0 Intel implemented outof order before any RISC company 0 Also 000 also benefits x86 more because ISA limits compiler o Idea coopted by other x86 companies AMD and Transmeta C15 371 RothMartin Instruction Set Architectures 45 Translation and Virtual ISAs 0 New compatibility interface ISA translation software 0 Binarytranslation transform static image run native o Emulation unmodified image interpret each dynamic insn 0 Typically optimized with justintime JIT compilation 0 Examples Fgtlt32 x86 on Alpha Rosetta PowerPC on x86 0 Performance overheads reasonable many recent advances o Transmeta s code morphing translation layer 0 Performed with a software layer below OS 0 Looks like x86 to the OS amp applications different ISA underneath 0 Virtual ISAs designed for translation not direct execution 0 Target for highlevel compiler one per language 0 Source for lowlevel translator one per ISA 0 Goals Portability abstract hardware nastiness flexibility over time 0 Examples Java Bytecodes C CLR Common Language Runtime C15 371 RothMartin Instruction Set Architectures 47 More About Microops 0 Even better Two forms of hardware translation 0 Hardcoded logic fast but complex 0 Table slow but off to the side doesn t complicate rest of machine X86 average 16 pops X86 insn 0 Logic for common insns that translate into 1 4 pops 0 Table for rare insns that translate into 5 pops X8664 average 11 pops X86 insn o More registers can pass parameters too fewer pushespops o Core2 logic for 1 2 pops Table for 3 props More recent macroop fusion and microop fusion 0 Intel s recent processors fuse certain instruction pairs C15 371 RothMartin Instruction Set Architectures 46 Ultimate Compatibility Trick 0 Support old ISA by 0 having a simple processor for that ISA somewhere in the system c How first Itanium supported x86 code 0 x86 processor comparable to Pentium on chip 0 How PlayStationZ supported PlayStation games 0 Used PlayStation processor for IO chip ampemulation C15 371 RothMartin Instruction Set Architectures 48 Current Winner Revenue CISC o X86 was rst 16bit chip by 2 years 0 IBM put it into its PCs because there was no competing choice 0 Rest is historical inertia and financial feedback 0 x86 is most difficult ISA to implement and do it fast but 0 Because Intel sells the most nonembedded processors 0 It has the most money 0 Which it uses to hire more and better engineers 0 Which it uses to maintain competitive performance 0 And given competitive performance compatibility wins 0 So Intel sells the most nonembedded processors 0 AMD as a competitor keeps pressure on x86 performance 0 Moore s law has helped Intel in a big way 0 Most engineering problems can be solved with more transistors C15 371 RothMartin Instruction Set Architectures 49 Aside Post RISC VLIW and EPIC ISAs explicitly targeted for multipleissue superscalar cores 0 VLIW Very Long Insn Word 0 Later rebranded as EPIC Explicitly Parallel Insn Computing IntelHP IA64 Itanium 2000 o EPIC 128bit 3operation bundles o 128 64bit registers Some neat features Full predication explicit cache control 0 Predication every instruction is conditional to avoid branches But lots of difficult to use baggage as well software speculation 0 Every new ISA feature suggested in last two decades Relies on younger less mature compiler technology Not doing well commercially C15 371 RothMartin Instruction Set Architectures 51 Current Winner Volume RISC o ARM Acorn RISC Machine gt Advanced RISC Machine 0 First ARM chip in mid19805 from Acorn Computer Ltd c 12 billion units sold in 2004 gt50 of all 3264bit CPUs o Lowpower and embedded devices iPod for example 0 Significance of embedded ISA Compatibility less powerful force 32bit RISC ISA o 16 registers PC is one of them 0 Many addressing modes eg auto increment 0 Condition codes each instruction can be conditional Multiple implementations o Xscale design was DEC s bought by Intel sold to Marvel 0 Others Freescale was Motorola Texas Instruments STMicroelectronics Samsung Sharp Philips etc C15 371 RothMartin Instruction Set Architectures Redux Are ISAs Important 0 Does quality of ISA actually matter 0 Not for performance mostly 0 Mostly comes as a design complexity issue 0 Insnprogram everything is compiled compilers are good 0 Cyclesinsn and secondscycle uISA many other tricks 0 What about power efficiency 0 Maybe o ARMs are most power efficient today 0 but Intel is moving x86 that way eg Intel s Atom 0 Does nastiness of ISA matter 0 Mostly no only compiler writers and hardware designers see it 0 Even compatibility is not what it used to be 0 Software emulation C15 371 RothMartin Instruction Set Architectures Summary App App App 0 What is an ISA System software 0 A functional contract 0 All ISAs are basically the same 0 But many design choices in details 0 Two philosophies CISCRISC 0 Good ISA enables highperformance 0 At least doesn t get in the way 0 Compatibility is a powerful force 0 Tricks binary translation pilSAs Mem CPU HO 0 Next singlecycle datapathcontrol C15 371 RomMartin Instruction Set Architectures 53 C15 371 Computer Organization and Design Unit 2 SingleCycle Datapath and Control Part 1 of 2 Digitial Logic Review 215371 RomMartin Datapath and Control 1 Readings 0 PH Chapter5 215371 RomMartin Datapath and Control 3 This Unit SingleCycle Datapaths 0 Digital logic basics System SO Ware Focus on useful components 0 Mapping an ISA to a datapath MIPS example 0 Singlecycle control 0 Implementing exceptions using control Mem CPU lO 215371 RomMartin Datapath and Control 2 So You Have an ISA o not useful without a piece of hardware to execute it 215371 RomMartin Datapath and Control 4 Implementing an ISA Datapath performs computation registers ALUs etc 0 ISA specific can implement every insn singlecycle in one pass 0 Control determines which computation is performed Routes data through datapath which regs which ALU Op 0 Fetch get insn translate opcode into control 0 Fetch gt Decode gt Execute cycle 215371 RomMartin Datapalh and Control 5 Building Blocks Logic Gates 0 Logic gates implement Boolean functions 0 Basic gates NOT NAND NOR Underlying CMOS transistors are naturally inverting Q NOT NOT INV NAND NOR A po A I D AB 23 May NAND NOR are Boolean complete BUF AND OR A A AB us Bap As 23 AND3 ANDNOT XOR A A A AB AB A B B B B Ana 7 C 215371 RomMartin Datapalh and Control Two Types of Components datapath o Purely combinational stateless computation ALUs muxes control Arbitrary Boolean functions 0 CombinationaIsequential storage 0 PC insndata memories register file Internally contain some combinational components 215371 RomMartin Datapalh and Control 6 Boolean Functions and Truth Tables 0 Any Boolean function can be represented as a truth table 0 Truth table pointwise input gt output mapping 0 Function is dig unction of all rows in which 0 is 1 gt gt gt gt oooo y lllllllll i gt gt oooooo Example above 0 AB C ABC ABC 215371 RomMartin Datapalh and Control 8 Truth Tables and PLAs 0 Implement Boolean function by implementing its truth table Takes two levels of logic Assumes inputs and inverses of inputs are available usually are 0 First level ANDs product terms 0 Second level ORs sums of product terms 0 PLA programmable logic array 0 Flexible circuit for doing this C15371 RomMartin Datapalh and Conlrol Boolean Algebra 0 Boolean Algebra rules for rewriting Boolean functions 0 Useful for simplifying Boolean functions Simplifying reducing gate count reducing gate levels 0 Rules similar to logic 01 FT 0 Identity A1 A A0 A 0 01 A0 0A1 1 Inverses A A Idempotency AA A AA A Tautology AA 0 AA 1 Commutativity AB BA AB BA Associativity ABC ABC ABC ABC Distributivity ABC ABAC ABC ABAC DeMorgan s AB A B AB A B C15371 RomMartin Datapalh and Conlrol PLA Example 0 PLA with 3 inputs 2 outputs and 4 product terms 00 AB C ABC ABC Permanent connections Programmable connections unconnected C15371 RomMartin Datapalh and Conlrol Logic Minimization 0 Logic minimization Iterative application of rules to reduce function to simplest form 0 There are tools for automatically doing this 0 Example below function from slide 8 0 AB C ABC ABC 0 AB C BC BC distributivity 0 AB C BC BC associativity 0 AB C BC C distributivity on B 0 AB C B1 tautology 0 AB C B 01 0 AB BCB distributivity on B 0 A1BC tautology 0 ABC 01 C15371 RomMartin Datapalh and Conlrol NonArbitrary Boolean Functions Decoder o PLAs implement Boolean functions pointwise Eg represent fX X5 as 0amp5 166 267 368 Mainly useful for arbitrarY functions no compact representation 0 Many useful Boolean functions are not arbitrary Have a compact representation Eg represent fX X5 as X5 Examples Decoder Multiplexer Adder eg X5 or more generally XY 215371 RomMartin Datapaih and Conirol 13 Multiplexer Mux Decoder converts binary integer to 1hot representation Binary representation of 02N 1 N bits 0 1 hot representation of 02N 1 2N bits 0 J represented as Jth bit 1 all other bits 0 0 Example below 2to4 decoder BIO 1H 0 Bm gt 1H1 3 1H 1H2 1H3 215371 RomMartin Datapaih and Conirol 14 Adder o Multiplexer mux selects output from N inputs 0 Example 1bit 4to1 mux Not shown Nbit 4to1 mux N 1bit 4to1 muxes 1 decoder S binary s 1hot S binary A B C D 215371 RomMartin Datapaih and Conirol 15 o Adder addssubtracts two 2C binary integers Half adder adds two 1bit integers no carryin Full adder adds three 1bit integers includes carryin Ripplecarry adder N chained full adders add 2 Nbit integers To subtract negate B input set bit 0 carryin to 1 215371 RomMartin Datapaih and Conirol 16 Full Adder o What is the logic for a full adder Lookattruth table Cl CIABacos oooaoo 001a01 A T 7 010601 39T 011610 n 1006011 101610 lloalo 111a11 co 0 SC39A39BC39AB39CA B39CABCquotAquotB 0 C0C39ABCA39BCAB39CABCACBAB C15371 RothMartin Datapath and Control 17 Datapath Storage Elements Nbit AdderSubtracter datapath 0 Three main types of storage elements Singleton registers PC 0 Register files ISA registers Memories insndata memory C15371 RothMartin Datapath and Control 19 o o 03gt 03gt BN4 C15371 RothMartin Datapath and Control CrossCoupled Inverters CCIs o Crosscoupled inverters CCIs Simplest storage element 0 Most storage arrays regfile caches implemented this way 0 Where is the input and where is the output 0 Forget about this for a while 215371 RothMartin Datapath and Control 0 SR Latch SR setreset latch Crosscoupled NOR gates 3 Distinct inputsoutputs JO Q SR gtQ s 006 oldQ 016 0 106 1 11 a o S 50 R07 circuit degenerates to crosscoupled INVs 5 1 R17 not very useful 0 Not really used except as component in something else 215371 RomMartin Datapalh and Conh39ol Timing Diagrams 0 Voltage 01 diagrams for different nodes in system Digitally stylized changes are vertical lines instantaneous Reality is analog changes are continuous and smooth 0 Timing diagram for a D latch El l l Dlll ll l Qlll l 215371 RomMartin Datapalh and Conh39ol D Latch o D latch SR latch 0 control that makes 5R1 impossible ED gt Q 00 a oldQ 01 a oldQ 10 a 0 11 a 1 o In other words 01 a oldQ 11 a D In words 0 When E is 1 Q gets D 0 When E is 0 Q retains old value E 215371 RomMartin Datapalh and Conh39ol Triggering Level vs Edge El l l Dlll ll l Qlll l o The Dlatch is leveltriggered The latch is open for writing as long as E is 1 o If D changes continuously so does Q May not be the functionality we want 0 Often easier to reason about an edgetriggered latch The latch is open for writing only on E transition 0 a 1 or 1 a 0 Don t need to worry about fluctuations in value of D 215371 RomMartin Datapalh and Conh39ol D FlipFlop o D FlipFlop also called masterslave ipflop Sequential D latches Enabled by inverse signals First latch open when E 0 Second latch open when E 1 Overall effect 2 DFF latches D on 0amp1 transition I o How about a DFF that latches on 1amp0 D Q 215371 RomMartin Datapaih and Control FFWE FF with Separate Write Enable 0 FFWE FF with separate write enable FF Data input is MUX of D and Q WE selects WE Alternative FF Enable input is AND of CLK and WE Fewer gates Creates timing problems Do not try to do logic on CLK in Verilog No really 215371 RomMartin Datapaih and Control Synchronous Systems o Processors are complex FSMs Combinational compute blocks separated by FFs o Synchronous systems 0 Clock global signal acts as write enable for all FFs Typically marked as triangle on FFs All FFs write together values move forward in lockstep Simplifies design design combinational blocks independently 0 Aside asynchronous systems 0 Same thing but no clock 0 Values move forward using explicit handshaking i Have many advantages but difficult to design 215371 RomMartin Datapaih and Control Singleton Register 0 Register one Nbit storage word WE Nonmultiplexed inputoutput data buses writeread same word 0 Implementation FFWE array with shared writeenable WE FFs written on CLK edge if WE is 1 or if there is no WE Continuous read value available as soon as it is written 215371 RomMartin Datapaih and Control Register File 0 Register file M Nbit storage words Multiplexed inputoutput data buses writeread random word 0 Port set of buses for accessing a random word in array 0 Data bus Nbits address bus logZMbits optional WE bit 0 P ports P parallel and independent accesses o MIPS integer register file 0 32 32bit words two read ports one write port why C15371 RothMartin Datapath and Control 29 Add a Read Port RS1 VAL gt 0 Output of each register into 4t01 mux R51VAL R51 is select input of R51VAL mux C15371 RothMartin Datapath and Control 31 Register File Port Implementation Hl 0 Register file with four registers C15371 RothMartin Datapath and Control 30 Add Another Read Port RS1 VAL 0 Output of each register into another 4t01 mux RSZVAL R52 is select input of RSZVAL mux 215371 RothMartin Datapath and Control 32 Add a Write Port R DVAL RS1 VAL 0 Input RDVAL into each register 0 Enable only one register s WE Decoded RD amp WE o What if we needed two write ports CIS371 RomMartin Datapalh and Conh39ol 33 Uni ed Memory Architecture datapath 0 Harvard architecture 0 More common today why later 0 Unified architecture unified insndata memory 0 Build in 372 more conducive for P37X CIS371 RomMartin Datapalh and Conh39ol 35 Another Useful Component Memory DATAI N DATAOUT ADDRES WE 0 Register file M Nbit storage words 0 Few words lt 256 many ports dedicated read and write ports Logically static contents Synchronous 0 Memory M Nbit storage words yet not a register le 0 Many words gt 1024 few ports 1 2 shared readwrite ports Logically dynamic contents Leads to different implementation choices CIS371 RomMartin Datapalh and Conh39ol 34 Register File Memory Implementations 0 Real register files and memories not implemented with FFs Much too inefficient 0 Actual implementations use grids of crosscoupled inverters CCI and circuit magic 0 A bit more on this when we talk about caches and main memory CIS371 RomMartin Datapalh and Conh39ol 36 C15 371 Computer Organization and Design Unit 2 SingleCycle Datapath and Control Part 2 of 2 MIPS Datapath amp Control 215371 RomMartin Datapath and Control 37 Start With Fetch Datapath for MIPS ISA PC and instruction memory Harvard architecture 0 A 4 incrementer computes default next instruction PC 215371 RomMartin Datapath and Control 39 0 Consider only the following instructions add 1 2 3 addi 1 2 3 1W 1 4 3 sw 14 3 beq 1 2 PCrelativetarget j absolutetarget syscall mch 0 Why only these 0 Most other instructions are the same from datapath viewpoint The one s that aren t are left for you to figure out 215371 RomMartin Datapath and Control 38 First Instruction add i 39 in liviirg39n i gt2 Add register file and ALU 215371 RomMartin Datapath and Control 40 Second Instruction addi Third Instruction Iw A 116 Rs5 Rt5 mmed16 Ema Rs5 Rt5 mmed16 Destination register can now be either Rd or Rt 0 Add data memory address is ALU output 0 Add sign extension unit and mux into second ALU input 0 Add register write data mux to select memory output or ALU output C15371 RomMartin Datapalh and Conh39ol 41 C15371 RomMartin Datapalh and Conh39ol 42 Fourth Instruction sw Fifth Instruction beq m1 RS5 Rti5i Add left shift unit and adder to compute PCrelative branch target 0 Add PC input mux to select PC4 or branch target A 116 Rs5 Rt5 mmed16 Add path from second input register to data memory data input mmed16 C15371 RomMartin Datapalh and Conh39ol 43 C15371 RomMartin Datapalh and Conh39ol 44 Sixth Instruction j is We 0 Add shifter to compute left shift of 26bit immediate Add additional PC input mux for jump target C15371 RothMartin Datapath and Control Edge Read Datapath Timing Inverters delay global clock and create multiple negative edges 0 All writes occur on global positive edge C15371 RothMartin Datapath and Control Continuous Read Datapath Timing Read IMem Read ngisters Read EMEM Write DM EM Write Registers Write PC 0 Works because writes PC Reg File DMem are independent 0 And because no read logically follows any write C15371 RothMartin Datapath and Control 46 What Is Control o ALUop DMwe o 9 signals control flow of data through this datapath MUX selectors or registermemory write enable signals 0 A real datapath has 300500 control signals 215371 RothMartin Datapath and Control 48 Example Control for add Example Control for sw ALUOP0DMweZ1 ALUinB1 0 Difference between sw and add is 5 signals 0 3 if you don t count the X don t care signals 215371 RomMartin Datapalh and Conh39ol 215371 RomMartin Datapalh and Conh39ol Example Control for beq How Is Control Implemented 0 Difference between sw and beq is only 4 signals Control 215371 RomMartin Datapalh and Conh39ol 215371 RomMartin Datapalh and Conh39ol Implementing Control 0 Each insn has a unique set of control signals 0 Most are function of opcode Some may be encoded in the instruction itself Eg the ALUop signal is some portion of the MIPS Func field Simplifies controller implementation Requires careful ISA design 215371 RomMartin Datapath and Control Control Implementation Random Logic 0 Real machines have 100 insns 300 control signals 30000 control bits 4KB Not huge but hard to make faster than datapath important 0 Alternative random logic random nonrepeating Exploits the observation many signals have few ls or few Os 0 Example random logic control for 6insn MIPS datapath Control Implementation ROM 0 ROM read only memory like a RAM but unwritable Bits in data words are control signals 0 Lines indexed by opcode Example ROM control for 6insn MIPS datapath X is don t care BR JP ALUinB ALUop DMwe Rwe Rdst Rwd add 0 0 0 0 0 1 0 0 addi 0 0 1 0 0 1 1 0 opcode lw o o 1 o o 1 1 1 sw 0 0 1 0 1 0 X X beq 1 0 0 1 0 0 X X j 0 1 0 0 0 0 X X 215371 RomMartin Datapath and Control Datapath and Control Timing i Control ROMrandom logic Read lMem Read Rigisters Read EMEM Write DMEM Read Control ROM Write Registers Write PC 215371 RomMartin Datapath and Control 56 Operating System Features Operating system OS Superapplication app manages hardware for userapps Isolates userapps from hardware nastiness and each other Most ISAs provide support for operating systems OSes Privileged mode OS is privileged userapp s are not Privileged insnsdata only OS can executereadwrite TrapsSyscalls jump to preset address in OS upgrade privilege Userapp invokes when it wants something privileged done 0 Return from trap return to userapp downgrade privilege Exceptions jump to preset address in OS upgrade privilege Happens automatically when userapp does something illegal Executes privileged insn writes privileged address 0 overflow Interrupts jumps to preset address in OS upgrades privilege Happens automatically on some external event eg timer 215371 RomMartin Datapaih and Connol 57 Foreshadowing Pipelined Datapath Register File gt39squot1s2 T 0 Split datapath into multiple stages 0 Assembly line analogy 5 stages results in up to 5x clock amp performance improvement 215371 RomMartin Datapaih and Connol 59 SingleCycle Datapath Performance Register File gts 1 2 d T 0 One instruction per cycle 1 IPC or 1 CPI Clock cycle time proportional to worstcase logic delay 0 In this datapath insn fetch decode register read ALU data memory access write register 0 Can we do better 215371 RomMartin Datapaih and Connol 58 Summary 0 Digital logic review 0 Singlecycle datapath and control 0 Next up 0 Performance amp metrics 215371 RomMartin Datapaih and Connol 60 C15 371 Computer Organization and Design Unit 6 Superscalar Pipelines C15 371 RomMarlin Superscalar Pipelines This Unit InOrder Superscalar Pipelines APP APP APP System software 0 Superscalar hardware issues 0 Bypassing and register file M CPU quot0 o Stall logic em 0 Fetch and branch prediction Multipleissue designs 0 Superscalar o VLIWEPIC C15 371 RomMarlin Superscalar Pipelines A Key Theme of C15 371 Parallelism 0 Last unit pipelinelevel parallelism 0 Work on execute of one instruction in parallel with decode of next 0 Next instructionlevel parallelism ILP o Execute multiple independent instructions fully in parallel 0 Today multiple issue 0 Later 0 Static amp dynamic scheduling 0 Extract much more ILP o Datalevel parallelism DLP o Singleinstruction multiple data one insn four 64bit adds 0 Threadlevel parallelism CI39LP 0 Multiple software threads running on multiple cores C15 371 RomMarlin Superscalar Pipelines Readings 0 PampH 0 Chapter 410 215 371 RomMarlin Superscalar Pipelines Scalar Pipeline and the Flynn Bottleneck MultipleIssue Pipeline reg le 4 reg le I o Overcome this limit using multiple issue 0 So far we have looked at scalar plpellnes Also called superscaiar 39 one lnStrUCth Per Stage 0 Two instructions per stage at once or three or four or eight With control speculation bypassing etc InstructionLevel Parallelism ILP Fisher IEEE TC 81 Performance limit aka Flynn Bottleneck is CPI IPC 1 Today typically 4Widen Intel Core 2 AMD Opteron L39m39t 39s fever even aCh39e Td haz rds n 0 Some more PowerS is Sissue Itanium is 6issue Diminishing returns from superpipelining hazards overhead Some less dualissue is common for simple cores C15 371 RomMarlin Superscalar Pipelines 5 C15 371 RomMarlin Superscalar Pipelines Scalar Pipelines Superscalar Pipeline Diagrams Ideal 39 scalar 1 2 3 4 5 6 7 8 9 10 11 12 gt59 p p gtltgtgt lt 1w 0r1r2 F D X M W gt4 p p p 1w 4r1r3 F D X M W gt I 1w 8r1r4 F D X M W intR F y add r14r15r5 F D X M W bpb gt my r quot gt gt DM gt add r12r13r7 F D X M W add r17r1sra F D X M W 1w 0r18r9 F D X M W 0 So far we have looked at scalar pipelines 2way superscalar 1 2 3 4 5 6 7 8 9 10 11 12 0 One instruction per stage 1quot olrlwrz F D X M W 1w 4r1r3 F D X M W 1w 8r1r4 F D X M W 0 With control speculation add r14r15r6 F D X M W 0 With bypassing not shown add r12Vr139r7 F D X M W Wth t t add r17r1sra F D X M W 39 39 0a quot 9 pom 1w 0r18r9 F D X M W 215 371 RomMarlin Superscalar Pipelines 7 C15 371 RomMarlin Superscalar Pipelines Superscalar Pipeline Diagrams Realistic scalar 1 2 1w 0r1r2 F D 1w 4r1r3 F 1w 8r1r4 add r4r5rs add r2r339r7 add r7r639r8 1w 0rar9 6789101112 39nUXw 39nUXZ b xz m W M W D X F D F 39nUXZ oxz 2waysuperscalar 1 6 7 8 9 1011 12 1w 0r1r2 F 2 D 1w 4r1r3 F D F F X334 1w 8r1r4 add r4r5r6 add r2r3r7 add r7rsra 1w 0rar9 C15 371 RomMarlin Superscalar Pipelines 9 nDUgtltgtltw fink mmUUz m Ogoxx 0X33 xz 3 How Much ILP is There 0 The compiler tries to schedule code to avoid stalls 0 Even for scalar machines to fill loaduse delay slot 0 Even harder to schedule multipleissue superscalar o How much ILP is common 0 Greatly depends on the application 0 Consider memory copy 0 Unroll loop lots of independent operations 0 Other programs less so 0 Even given unbounded ILP superscalar has limits 0 IPC or CPI vs clock frequency tradeoff C15 371 RomMarlin Superscalar Pipelines 11 Superscalar CPI Calculations Base CPI for scalar pipeline is 1 Base CPI for Nway superscalar pipeline is 1N Amplifies stall penalties o Assumes no data stalls an overly optmistic assumption Example Branch penalty calculation 0 20 branches 75 taken no explicit branch prediction Scalar pipeline 0 1 020752 13 a 131 13 a 30 slowdown 2way superscalar pipeline 0 05 020752 08 a 0805 16 a 60 slowdown 4 way superscalar o 025 020752 055 a 055025 22 a 120 slowdown C15 371 RomMarlin Superscalar Pipelines 10 A Typical DualIssue Pipeline reg le I o Fetch an entire 16B or 32B cache block 0 4 to 8 instructions assuming 4byte fixed length instructions 0 Predict a single branch per cycle 0 Parallel decode 0 Need to check for conflicting instructions 0 Output of 11 is an input to 12 o Other stalls too for example loaduse delay C15 371 RomMarlin Superscalar Pipelines 12 A Typical DualIssue Pipeline reg le I lt o Multiported register file 0 Larger area latency power cost complexity 0 Multiple execution units 0 Simple adders are easy but bypass paths are expensive 0 Memory unit 0 Single load per cycle stall at decode probably okay for dual issue 0 Alternative add a read port to data cache 0 Larger area latency power cost complexity C15 371 RomMarlin Superscalar Pipelines 13 Superscalar Challenges Front End 0 Wide instruction fetch o Modest need multiple instructions per cycle 0 Aggressive predict multiple branches 0 Wide instruction decode o Replicate decoders 0 Wide instruction issue 0 Determine when instructions can proceed in parallel 0 Not all combinations possible 0 More complex stall logic order szor Nwide machine 0 Wide register read 0 One port for each register read 0 Each port needs its own set of address and data wires 0 Example 4wide superscalar 8 read ports C15 371 RomMarlin Superscalar Pipelines 15 Superscalar Execution 0 Common design functional unit mix cx insn type mix 0 Integer apps 20 30 loads 10 15 stores 15 20 branches 0 Floating point apps 30 FP 20 loads 10 stores 5 branches 0 Rest 40 50 are nonbranch integer ALU operations 0 Intel Pentium 2way superscalar 1 any 1 integer ALU 0 Alpha 21164 2 integer incl 2 loads or 1 store 2 floating point 0 Execution units 0 Simple ALUs are cheap have N of these for Nwide processor 0 Complex ALUs are less cheap have fewer of these 0 Data memory bandwidth expensive 0 Multiport replicate or quotbankquot more later in memory unit 215 371 RomMarlin Superscalar Pipelines 14 Superscalar Challenges Back End 0 Wide instruction execution o Replicate arithmetic units 0 Perhaps multiple cache ports 0 Wide instruction register writeback 0 One write port per instruction that writes a register 0 Example 4wide superscalar 4 write ports 0 Wide bypass paths 0 More possible sources for data values 0 Order N2 P for Nwide machine with execute pipeline depth P o Fundamental challenge 0 Amount of ILP instructionlevel parallelism in the program 0 Compiler must schedule code and extract parallelism C15 371 RomMarlin Superscalar Pipelines 16 Superscalar Register File gt b b b IntRF gt gt gt gt gt QM b b b b o Nway superscalar register file 2N read N write ports 0 lt N write ports stores branches 35 insns don t write registers o lt 2N read ports many inputs come from immediatesbypasses Latency and area olt ports2 olt 3N2 slow for large N 215 371 RomMarlin Superscalar Pipelines 17 Superscalar Bypass 0 O gt E O gt gt b D inIRF E s gt gt p gt D P 0 Consider WX bypass for lst input of each insn 2 nonregfile inputs to bypass mugtlt in general N 4 pointtopoint connections in general N2 Bypass wires long slow and are difficult to route 0 And this isjust one bypass stage and one input per insn o N2 bypass C15 371 RomMarlin Superscalar Pipelines 19 N2 Dependence Cross Check Stall logic for 1wide pipeline with full bypassing 0 Full bypassing a loaduse stalls only XMopLOAD ampampDXrslXMrd DXrsZXMrd 0 Two terms olt 2N Now same logic for a 2wide pipeline XM1opLOAD ampampDX1rslXM1rd DX1r52XM1rd XM1opLOAD ampampDgtlt2rs XM1rd DX2rs gtltM1rd XM2opLOAD ampampDX1rslXM2rd DX1r52XM2rd XM2opLOAD ampampDgtlt2rs XM2rd DX2rs XM2rd Eight terms olt 2N2 o N2 dependence crosscheck Not quite done also need 0 DX2rslDX1rd DX2r52DX1rd C15 371 RomMarlin Superscalar Pipelines 18 Not All N2 Problems Created Equal o N2 bypass vs N2 stall logic amp dependence crosscheck 0 Which is the bigger problem N2 bypass by a lot 0 32 or 64 bit quantities vs 5bit 0 Multiple levels MX WX of bypass vs 1 level of stall logic 0 Must fit in one clock period with ALU vs not Dependence crosscheck not even 2nd biggest N2 problem 0 Regfile is also an N2 problem think latency where N is ports 0 And also more serious than crosscheck C15 371 RomMarlin Superscalar Pipelines 20 Avoid N2 BypassRegFile Clustering O clusterO imRFO 39 D Dr C O I gt a gt F cluster1 IntRF1 gt gt QM F P P F o Clustering group ALUs into K clusters 0 Full bypassing within cluster limited or no bypassing between them 0 Get values from regfile with 1 or 2 cycle delay NK nonregfile inputs at each mugtlt NZK pointtopoint paths 0 Key to performance steer dependent insns to same cluster 0 Hurts IPC but helps clock frequency or wider issue at same clock 0 Typically used with replicated regfile replica per cluster 0 Alpha 21264 4way superscalar 2 clusters static steering C15 371 RomMarlin Superscalar Pipelines 21 Wide NonSequential Fetch Two related questions 0 How many branches predicted per cycle 0 Can we fetch across the branch if it is predicted taken Simplest most common organization 1 and No 0 One prediction discard postbranch insns if prediction is taken Lowers effective fetch width and IPC 0 Average number of instructions per taken branch 0 Assume 20 branches 50 taken a 10 instructions 0 Consider a 10instruction loop body with an 8 issue processor 0 Without smarter fetch ILP is limited to 5 not 8 o Compiler can help 0 Reduce taken branch frequency eg unroll loops C15 371 RomMarlin Superscalar Pipelines Superscalar Fetch Decode BP gt gtltgtlt gt gtRc o What is involved in fetching N insns per cycle 0 Mostly wider instruction memory ports 0 Read N instructions in parallel 0 Most tricky aspects involve branch prediction 0 What about Decode o Easier with fixedwidth instructions MIPS Alpha PowerPC ARM 0 Harder with variablelength instructions X86 C15 371 RomMarlin Superscalar Pipelines 22 Aside VLIWEPIC o VLIW Very Long Insn Word 0 Intel EPIC Explicit Parallel Instruction Computing 0 Effectively a 1wide pipeline but unit is an Ninsn group 0 Group travels down pipeline as a unit 0 Compiler guarantees insns within a VLIW group are independent 0 If no independent insns slots filled with nops 0 Typically slotted lst insn must be ALU 2nd mem etc o Eg Itanium two 3wide bundles per cycle 6way issue Simplifies fetch and branch prediction Simplifies pipeline control no rigid vs fluid business Doesn t help bypasses or regfile which are bigger problems 0 Can expose these issues to software too yuck Not really compatible across machines of different widths o How does Itanium deal with noncompatibility Transmeta C15 371 RomMarlin Superscalar Pipelines 24 Predication 0 Branch mispredictions hurt more on superscalar 0 Replace difficult branches with something else 0 Convert control flow into data flow amp dependencies o Predication o Conditionally egtltecuted insns unconditionally fetched 0 Full predication ARM Intel Itanium 0 Can tag every insn with predicate but egtlttra bits in instruction Conditional moves Alpha X86 0 Construct appearance of full predication from one primitive cmoveq rl r2 r3 if rl0 r3r2 May require some code duplication to achieve desired effect Only good way of adding predication to an existing ISA o Ifconversion replacing control with predication C15 371 RomMarlin Superscalar Pipelines 25 Multiple Issue Summary App App App 0 Superscalar hardware issues System software 0 Bypassing and register file 0 Stall logic Mem V0 0 Fetch and branch prediction 0 Multipleissue designs 0 Superscalar o VLIW 0 Next up 0 Midterm C15 371 RomMarlin Superscalar Pipelines 27 Trends in SingleProcessor Multiple Issue 486 Pentium PentiumII Pentium4 Itanium ItaniumII CoreZ Year 1989 1993 1998 2001 2002 2004 2006 Width 1 2 3 3 3 6 4 0 Issue width has saturated at 46 for highperformance cores 0 Canceled Alpha 21464 was 8way issue 0 No justification for going wider 0 Hardware or compiler scheduling needed to exploit 46 effectively 0 For highperformance per watt cores issue width is NZ 0 Advanced scheduling techniques not needed 0 Multithreading a little later helps cope with cache misses C15 371 RomMarlin Superscalar Pipelines 26 Instruction Set Architecture ISA App What is an ISA System software 0 And what is a good ISA CSE 371 El CPU 0 Aspects of ISAs 0 WIth examples LC3 MIPS x86 Computer Organization and DeSIgn RISC vs else 0 Compatibility is a powerful force Tricks binary translation uISAs Unit 1 Instruction Set Architectures C15 371 RothMartin Instruction Set Nchitecmres 1 C15 371 RothMartin Instruction Set Nchitecmres Readings What Is An ISA o PH 0 ISA instruction set architecture 0 Chapter 2 Awelldefined hardwaresoftware interface 0 The contract between software and hardware 0 Functional definition of operations modes and storage locations supported by hardware Precise description of how to invoke and access them 0 Not in the contract 0 How operations are implemented 0 Which operations are fast and which are slow and when 0 Which operations take more power and which take less 0 Instruction gt Insn Instruction is too long to write in slides C15 371 RothMartin Instruction Set Nchitecmres 3 C15 371 RothMartin Instruction Set Nchitecmres A Language Analogy for ISAs 0 Communication Persontoperson a softwaretohardware 0 Similar structure Narrative a program Sentence a insn Verb a operation add multiply load branch Noun a data item immediate register value memory value Adjective a addressing mode 0 Many different languages many different ISAs Similar basic structure details differ sometimes greatly 0 Key differences between languages and ISAs Languages evolve organically many ambiguities inconsistencies ISAs are explicitly engineered and extended unambiguous C15 371 RothMartin Instruction Set Nchitzectures 5 LC3 LC3 highlights 0 1 datatype 16bit 2C integer Addressible of memory locations 16 bits Instructions are 16 bits 3 arithmetic operations add and not 0 Build everything else from these 8 registers loadstore model three addressing modes Condition codes for branches Support for traps and interrupts Why is LC3 this way and not some other way What are some other options C15 371 RothMartin Instruction Set Nchitzectures 7 The Sequential Model 0 Basic structure of all modern ISAs 0 Processor logically executes loop at left Atomically insn X finishes before insn X1 starts 0 Program order total order on dynamic insns Order and named storage define computation Value flows from insn X to Y via storage A iff AX s output XY s input Y after X in program order 0 No interceding insn Z where AZ s output Convenient feature program counter PC Insn itself at memoryPC Next PC is PC unless insn says otherwise C15 371 RothMartin Instruction Set Nchitzectures 6 P37X Similar to LC3 in some ways but better Similarities 0 16bit data types 0 16bit instructions fourbit opcode o SimilarTRAPs and devices 0 Differences o More ALU ops Add Sub Mul Or Not And Xor Shift LeftRight o No LDI STI indirect loadstores o No condition codes 0 Designed for CIS372 o PennSim suppers this with a commandline mode switch C15 371 RothMartin Instruction Set Nchitzectures 8 Some Other ISAs LC3 amp P37X has the basic features of a realworld ISA t Lacks a good bit of realism 0 Only 16bit Not byte addressable Fewer arithmetic insns more for LC3 than P37X Little support for system software none for multiprocessing Two real world ISAs 0 Intel X86 IA32 a CISC ISA o MIPS a real world RISC ISA also used in book P37X ISA used in 372 o A more RISC39y LC3 0 What is this RISCCISC thing C15 371 RothMartin Instruction Set Nchitecmres 9 InsnsProgram Compiler Optimizations What Is A Good ISA o Compilers do two things 0 Translate HLLs to assembly functionally Deterministic and fast compile time gas 00 Canonical not an active research area 0 C15 341 o Optimize generated assembly code Optimize Hard to prove optimality in a complex system 0 In systems optimize means improve hopefully Involved and relatively slow compile time gcc o4 Some aspects reverseengineer programmer intention Not canonical being actively researched C15 570 C15 371 RothMartin Instruction Set Nchitecmres 11 o Lends itself to highperformance implementations Every ISA can be implemented 0 Not every ISA can be implemented well 0 Background CPU performance equation Execution time secondsprogram Convenient to factor into three pieces insnsprogram cyclesinsn secondscycle Insnsprogram dynamic insns executed Secondscycle clock period Cyclesinsn CPI hmmm o For high performance all three factors should be low C15 371 RothMartin Instruction Set Nchitecmres 10 Compiler Optimizations o Primarily reduce insn count 0 Eliminate redundant computation keep more things in registers Registers are faster fewer loadsstores An ISA can make this difficult by having too few registers 0 But also 0 Reduce branches and jumps later 0 Reduce cache misses later 0 Reduce dependences between nearby insns later An ISA can make this difficult by having implicit dependences o How effective are these Can give 4X performance over unoptimized code Collective wisdom of 40 years Proebsting s LaW 4 per year 0 Funny but shouldn t leave 4X performance on the table C15 371 RothMartin Instruction Set Nchitecmres 12 SecondsCycle and CycleInsn Hmmm o For singlecycle datapath o Sequential model insn X nishes before insn X1 starts Cycleinsn 1 by definition 0 An illusion designed to keep programmers sane Foreshadowing Pipelining Secondscycle proportional to complexity of datapath 0 ISA can make secondscycle high by requiring a complex datapath P39pelmmg Important performance teChmque Hardware overlaps processing iterations for insns Variable insn lengthformat makes pipelining difficult Complex datapaths also make pipelining difficult or clock slow 0 More about this later InsnO Insn1 Inan Insn3 Insn4 InsnS time Fetch 14 CIS 371 RothMartin Instruction Set Nchitect39ures CIS 371 RothMartin Instruction Set Nchitect39ures RI SC CI SC 0 RISC Reduced Instruction Set Computer ISAs Minimalist approach to an ISA simple insns only ISA Basics Aspects of ISAs o VonNeumann model Low cyclesinsn and secondscycle Data types and operations n o Operand model Higher Insnprogram but hopefully not as much Control Rely on compiler optimizations Encodin 0 Operating system support 0 CISC Complex Instruction Set Computing ISAs o Multiprocessing support 0 A more heavyweight approach both simple and complex insns Low insnsprog ram 0 Examples Higher cyclesinsn and secondscycle LC3 P37X We have the technology to get around this problem Mg 0 x o More detail and context later CIS 371 RothMartin Instruction Set Nchitect39ures CIS 371 RothMartin Instruction Set Nchitect39ures Operations and Datatypes o Datatypes Software attribute of data 0 Hardware attribute of operation data isjust 01 s o All processors support 0 2C integer arithmeticlogic 8163264 bit e IEEE754 floatingpoint arithmetic 3264 bit 0 Intel has 80bit floatingpoint More recently most processors support Packedinteger insns eg MMX Packedfp insns eg SSESSE2 For multimedia more about these later Processor no longer support Decimal other fixedpoint arithmetic Binarycoded decimal BCD C15 371 RothMartin Instruction Set Nchitecmres 17 Where Does Data Live 0 Memory Fundamental storage space 0 Processor wo memory table wo chairs o Registers Faster than memory quite handy Most processors have these too o Immediates Values spelled out as bits in insns Input only C15 371 RothMartin Instruction Set Nchitecmres 19 LC3MIPSx86 Operations and Datatypes 0 LC3 16bit integer add and not P37X also has sub mul or xor shifts No floatingpoint 0 MIPS 3264 bit integer add sub mul div shift rotate and or not xor 3264 bit floatingpoint add sub mul div 0 X86 0 3264 bit integer add sub mul div shift rotate and or not xor 80bit floatingpoint add sub mul div sqrt 64 bit packed integer MMX padd pmul 64128 bit packed floatingpoint SSE2 padd pmul BCD not really used obviously C15 371 RothMartin Instruction Set Nchitecmres 18 How Many Registers o Registers faster than memory have as many as possible 0 No 0 One reason registers are faster there are fewer of them 0 Small is fast hardware truism 0 Another they are directly addressed no address calc More of them means larger specifiers Fewer registers per insn or indirect addressing 0 Not everything can be put in registers Structures arrays anything pointedto Compilers are getting better at putting more things in More registers means more saving restoring C15 371 RothMartin Instruction Set Nchitecmres 20 LC3MIPSx86 Registers 0 LC3P37X 8 16bit integer registers No floatingpoint registers 0 MIPS 32 32bit integer registers 0 hardwired to 0 o 32 32bit floatingpoint registers or 16 64bit registers o X86 0 8 81632bit integer registers not general purpose 0 No floatingpoint registers 0 64bit X86 EM64T 16 64bit integer registers 16 128bit floatingpoint registers CIS 371 RothMartin Instruction Set Nchitzectures 21 LC3MIPSx86 Memory Size 0 LC3P37X 16bit 216 16bit words 0 MIPS 0 32bit 0 64bit o X86 0 8086 16bit 80286 24 bit 80386 32bit AMD OpteronAthlon64 Intel s newer Pentium4 Core 2 64bit CIS 371 RothMartin Instruction Set Nchitzectures 23 How Much Memory 0 What does 64bit in 64bit ISA mean 0 Each program can address ie use 264 bytes 0 64 is the virtual address VA size 0 Alternative wrong definition width of arithmetic operations 0 Most critical inescapable ISA design decision 0 Too small 0 Limits the lifetime of ISA May require nasty hacks to overcome eg x86 segments 0 All ISAs moving to 64 bits if not already there CIS 371 RothMartin Instruction Set Nchitzectures 22 How Are Memory Locations Specified 0 Registers are specified directly 0 Register names are short can be encoded in insn Some insns implicitly readwrite certain registers o How are addresses specified Addresses are as big or bigger than insns Addressing mode how are insn bits converted to addresses 0 Think about what highlevel idiom addressing mode captures CIS 371 RothMartin Instruction Set Nchitzectures 24 LC3MIPSx86 Addressing Modes 0 LC3 Displacement R1offset 6bit PCdisplacement PCoffset 9bit MemoryindirectPCdisplacement memPCoffset9bit Nasty requires accessing memory twice P37X doesn t have this 0 MIPS Displacement R1offset 16bit Experiments showed this covered 80 of accesses on VAX o X86 MOV instructions 0 Absolute zero offset 81632bit Register indirect R1 0 Indexed R1R2 Displacement R1offset 81632bit Scaled R1 R2Scale offset81632bit Scale 1 2 4 8 C15 371 RothMartin Instruction Set Nchitectures 25 How Do Values Get FromTo Memory 0 How do values move fromto memory primary storage 0 tofrom registersaccumulatorstack Assume displacement addressing for these examples 0 Registers loadstore load r1 8r2 means R1 memR2 8 store r1 8 r2 means memR2 8 R1 0 Accumulator loadstore load 8 r2 means ACC memR2 8 store 8 r2 means memR2 8 ACC o Stack pushpop push 8 r2 means STKfl39OS memR2 8 pop 8 r2 means memR2 8 STKfl39OS C15 371 RothMartin Instruction Set Nchitectures 27 How Many Explicit Operands ALU Insn o Operand model how many explicit operands ALU insn 3 generalpurpose add R1R2R3 means R1 R2 R3 MIPS uses this 0 2 multiple explicit accumulators output doubles as input add R1R2 means R1 R1 R2 x86 uses this 0 1 one implicit accumulator add R1 means ACC ACC R1 0 0 hardware stack like Java bytecodes add means STK39OS STKTOS STKTOS 4 useful only in special situations 0 Examples show register operands But operands can be memory addresses or mixed registermemory ISAs with registeronly ALU insns are loadstorequot C15 371 RothMartin Instruction Set Nchitectures 26 Operand Model Pros and Cons 0 Metric I static code size 0 Want many Implicit operands stack high level insns 0 Metric II data memory traffic 0 Want as many longlived operands in onchip storage loadstore 0 Metric III CPI Want short latencies little variability loadstore o CPI and data memory traffic more important these days 0 In most niches C15 371 RothMartin Instruction Set Nchitectures 28 LC3MIPSX86 Operand Models 0 LC3 Integer 8 accumulator registers Floatingpoint none 0 MIPS Integerfloatingpoint 32 generalpurpose registers loadstore o X86 Integer 8 registers regreg regmem memreg but no mem mem Floating point stack why x86 floatingpoint sucked for years 0 Note integer push pop for managing software stack 0 Note also regmem and mem mem string functions in hardware 0 X8664 Integerfloatingpoint 16 registers CIS 371 RothMartin Instruction Set Nchitectures 29 Control Transfers I Computing Targets o The issues 0 How far statically do you need to jump Not far within procedure further from one procedure to another 0 Do you need tojump to a different place each time o PCrelative Positionindependent within procedure 0 Used for branches and jumps within a procedure 0 Absolute Position independent outside procedure 0 Used for procedure calls 0 Indirect target found in register Needed for jumping to dynamic targets 0 Used for returns dynamic procedure calls switch statements CIS 371 RothMartin Instruction Set Nchitectures 31 Control Transfers 0 Default nextPC is PC sizeofcurrent insn o Branches and jumps can change that 0 Otherwise dynamic program static program 0 Not useful Next Insn 0 Computing targets where to jump to o For all branches and jumps 0 Testing conditions whether to jump at all 0 For conditional branches only CIS 371 RothMartin Instruction Set Nchitectures 30 Control Transfers II Testing Conditions 0 Compare and branch insns branch less than R1 10 target Simple Two ALUs one for condition one for target address Extra latency o Implicit condition codes x86 LC3 subtract R2R110 sets negative CC branch neg target Condition codes set for free Implicit dependence is tricky 0 Conditions in regs separate branch MIPS P37X set less than R2 R1 10 branch not equal zero R2 target Additional insns one ALU per insn explicit dependence CIS 371 RothMartin Instruction Set Nchitectures 32 LC3 MIPS x86 Control Transfers LC3 o 9bit offset PCrelative branchesjumps uses condition codes 0 11bit offset PCrelative calls and indirect calls P37X o 6bit offseB PCrelative simple branches uses register for condition 0 12bit offset on calls and unconditional branchess MIPS 0 16bit offset PCrelative conditional branches uses register for condition 0 Compare 2 regs beq bne or reg to 0 bgtz bgez hltz blez Don39t need adder for these cover 80 of cases o Explicit set condition into registersquot slt sltu slti sltiu etc 0 26bit target absolute jumps and function calls o 8 bit offset PCrelative branches uses condition codes 0 816bit target absolutejumps and function calls within segment 0 Far jumps and calls change code segment for longer jumps C15 371 RothMartin Instruction Set Nchitzectures 33 LC3MIPSx86 Length and Format 0 LC3 2byte insns 3 formats P37X is similar 0reg om 1reg R 3 ffset 9 2reg E R a A 3reg R3 39 o MIPS 4 byte insns 3 formats Rtype Hype m Hype o X86 1 16 byte insns Pre x14 Op OpExtquot ModRMquot SIBquot IDisp14 lmm14 a E C15 371 RothMartin Instruction Set Nchitzectures 35 Length and Format Length 0 Fixed length 0 Most common is 32 bits Simple implementation next PC PC4 Longer reach for branchjump targets Code density 32 bits to increment a register by 17 0 Variable length Complex implementation Code density Compromise two lengths MIP516 or ARM s Thumb o Encoding A few simple encodings simplify decoder C15 371 RothMartin Instruction Set Nchitzectures 34 Operating System Support 0 ISA support required to implement an operating system 0 At least two privilege modes user low kernel high 0 Some operations storage locations accessible in all modes 0 Others accessible only in high privilege mode 0 Deal with IO exceptions virtual memory privilege itself 0 Anything that allows one process to interfere with another 0 Support for safely upgrading and downgrading privilege Programmatically system calls Transparently interrupts C15 371 RothMartin Instruction Set Nchitzectures 36 Traps and System Calls 0 What if a user process wanted to access an IO device 0 Can t actually call kernel procedures Kernel is shared by all user applications a a separate process 0 Should not be allowed to call or jump into arbitrary kernel code 0 Should not be allowed to upgrade privilege outside of kernel How does this work then Kernel publishes a set of service codes not function addresses 0 User processes use special insn to invoke desired service TRAP INTERRUPT SYSCALL a processchanging call only Specifies function code rather than address Upgrades privilege only way to do this Returnfrominterrupt a processchanging return only Downgrades privilege C15 371 RothMartin Instruction Set Nchitzectures 37 Multiprocessing Support 0 ISA support also required for multiprocessing Memory mode 0 Atomic conditional regmem swap insns Fence insns o LC3 No multiprocessing support 0 MI PSx86 Yes please 0 More about this later C15 371 RothMartin Instruction Set Nchitzectures 39 LC3MIPSx86 OS Support 0 LC3 Trap return from interrupt Interrupts supported but not used in C15 240 o MIPS Trap return from trap Exception coprocessor Interrupts o X86 Trap return from trap Exception flags Multilevel interrupts C15 371 RothMartin Instruction Set Nchitzectures 38 RISC and CISC RISC reducedinstruction set computer Coined by Patterson in early 80 s 0 Berkeley RISCI Stanford MIPS IBM 801 0 Examples PowerPC ARM SPARC Alpha PA RISC CISC complexinstruction set computer 0 Term didn t exist before RISC Examples x86 Motorola 68000 VAX makes x86 look like LC3 etc Religious war started in mid 1980 s RISC won the technology battle CISC won the commercial war Compatibility a stronger force than anyone but Intel thought 0 Intel amp AMD beat RISC at its own game C15 371 RothMartin Instruction Set Nchitzectures 40 Pre 1980 The Setup 0 Vicious feedback pendulum Bad compilers 6 complex ISAs 6 slow multichip designs 0 Assembly commonly written by hand C15 371 RothMartin Instruction Set Nchitzectures Complex ISAs e Slow Implementations 0 Complex ISAs have nasty datapaths o Nasty datapaths are difficult to pipeline 0 And pipelining doesn t help that much 0 If you aren t going to pipeline you want a highlevel ISA To amortize fetchdecode Highlevel ISA Highlevel ISA Lowlevel ISA Singlecycle Pipelined Singlecycle Execute C15 371 RothMartin Instruction Set Nchitzectures C15 371 RothMartin Instruction Set Nchitzectures Complex ISAs 9 Bad Compilers 0 Who is generating assembly code 0 Humans like highlevel CISC ISAs close to HLLs Can concretize drill down move down a layer Can abstract see patterns move up a layer Can deal with few things at a time a like things at a high level 0 Computers compilers like lowlevel RISC ISAs Can deal with many things at a time a can do things at any level Can concretize 1tomany lookup functions databases Difficulties with abstraction manytol lookup functions AI 0 Translation should move strictly down levels 0 Stranger than ction 0 People once thought computers would execute HLLs directly C15 371 RothMartin Instruction Set Nchitzectures Early 19805 The Tipping Point Moore s Law makes singlechip microprocessor possible 0 but only for small simple ISAs Performance advantage of integration was compelling RISC manifesto create ISAs that Simplify implementation Facilitate optimizing compilation Some guiding principles tenets Single cycle executionhardwired control 0 Fixed instruction length format 0 Lots of registers loadstore architecture few addressing modes 0 No equivalent CISC manifesto The Debate Compatibility o RISC argument CISC is fundamentally handicapped For a given technology RISC implementation will be better faster 0 Current technology enables singlechip RISC When it enables singlechip CISC RISC will be pipelined When it enables pipelined CISC RISC will have caches When it enables CISC with caches RISC will have next thing 0 CISC rebuttal CISC flaws not fundamental can be fixed with more transistors Moore s Law will narrow the RISCCISC gap true 0 Good pipeline RISC 100K transistors CISC 300K 0 By 1995 2M transistors had evened playing field 0 Software costs dominate compatibility is paramount C15 371 RothMartin Instruction Set Nchitecmres 45 Intel s Trick RISC Inside 0 Noone buys new hardware if it requires new software 0 Intel greatly benefited from this IBM too 0 ISA must remain compatible no matter what 0 x86 one of the worst designed ISAs EVER but survives As does IBM s 360370 the rst ISA family 0 Backward compatibility New processors must support old programs can t drop features 0 Very important 0 Forward upward compatibility Old processors must support new programs with software help 0 New processors redefine only previouslyillegal opcodes Allow software to detect support for specific new instructions 0 Old processors emulate new instructions in lowlevel software C15 371 RothMartin Instruction Set Nchitecmres 46 More About Uops o 1993 Intel wanted outoforder execution in Pentium Pro 0 000 was very hard to do with a coarse grain ISA like x86 0 Solution Translate X86 to RISC uops in hardware push eax becomes we think uops are proprietary store eax esp 4 addi esp esp 4 Processor maintains x86 ISA externally for compatibility Executes RISC pISA internally for datapath implementation 0 Given translator x86 almost as easy to implement as RISC Intel implemented outof order before any RISC company Idea coopted by other x86 companies AMD and Transmeta The one company that resisted Cyrix couldn t keep up C15 371 RothMartin Instruction Set Nchitecmres 47 0 Even better Two forms of hardware translation 0 Optimized logic for common insns that translate into 1 4 uops Fast Complex 0 Table for rare insns or nasty insns that translate into 5 uops Slow Off to the side doesn t complicate rest of machine 0 X86 average 16 uops X86 insn o X8664 average 11 uops X86 insn More registers can pass parameters too fewer pushespops Speculation about Core 2 PLA for 1 2 uops Table for 3 uops C15 371 RothMartin Instruction Set Nchitecmres 48 Transmeta s Take Code Morphing Code morphing X86 translation in software CrusoeAstro are x86 emulators no actual x86 hardware anywhere Only code morphing translation software written in native ISA Native ISA is invisible to applications OS even BIOS Different Crusoe versions have slightly different ISAs can t tell How was it done 0 Code morphing software resides in boot ROM On startup boot ROM hijacks 16MB of main memory Translator loaded into 512KB rest is translation cache Software starts running in interpreter mode Interpreter profiles to find hot regions procedures loops Hot region compiled to native optimized cached Gradually more and more of application starts running native C15 371 RothMartin Instruction Set Nchitzectures 49 Actually The Volume Winner is RISC ARM Acorn RISC Machine gt Advanced RISC Machine 0 First ARM chip in mid19805 from Acorn Computer Ltd 0 12 billion units sold in 2004 gt50 of all 3264 bit CPUs Lowpower and embedded devices iPod for example 32bit RISC ISA 16 registers PC is one of them to branch just write to the PC 0 Many addressing modes eg auto increment Predication Condition codes each instruction can be conditional Multiple compatible implementations Intel s Xscale original design was DECS bought by Intel 0 Others Freescale was Motorola IBM Texas Instruments Nintendo STMicroelectronics Samsung Sharp Philips etc C15 371 RothMartin Instruction Set Nchitzectures 51 How x86 Won the Commercial War 0 x86 was rst 16 bit chip by 2 years 0 IBM put it into its PCs there was no competing choice Rest is Moore s Law inertia and financial feedback 0 x86 is most difficult ISA to implement and do it fast but 0 Because Intel sells the most nonembedded processors 0 It has the most money 0 Which it uses to hire more and better engineers Which it uses to maintain competitive performance 0 And given equal performance compatibility wins 0 So Intel sells the most nonembedded processors AMD keeps pressure on x86 performance C15 371 RothMartin Instruction Set Nchitzectures 50 Redux Are ISAs Important Does quality of ISA actually matter 0 Not for performance mostly Mostly comes as a design complexity issue Insnprogram everything is compiled compilers are good Cyclesinsn and secondscycle piISA many other tricks Does nastiness of ISA matter 0 No only compiler writers and hardware designers see it Even compatibility is not what it used to be 0 Software emulation C15 371 RothMartin Instruction Set Nchitzectures 52 Compatibility Trap Door o Trap add some ISA feature for 5 gain Must support feature forever even if gain turns to loss 0 Classic SPARC s register windows hardware activation records 0 Trap insn makes lowlevel function call to OS handler Compatibilitys friend 0 Backward compatibility rid yourself of some ISA mistakes 0 New design mistake feature opcodes trap emulated in software 0 Performance of that feature suffers Legal ISA says nothing about performance 0 Actually good feature will be used less deprecation cycle 0 FonNard compatibility Reserve set of trap opcodes don t define uses 0 Add ISA functionality by overloading traps Release firmware patch to add to old implementation Translation and Virtual ISAs C15 371 RothMartin Instruction Set Nchitzectures 53 Sum ma ry App 0 What is an ISA System software 0 A functional contract All ISAs are basically the same 0 But many design choices in details 0 Two philosophies CISCRISC Good ISA enables highperformance At least doesn t get in the way Compatibility is a powerful force Tricks binary translation MISAs Mem CPU a 0 Next singlecycle datapathcontrol C15 371 RothMartin Instruction Set Nchitzectures 55 0 New compatibility interface ISA translation software Binarytranslation transform static image run native Emulation unmodified image interpret each dynamic insn Typically optimized with justintime JlT compilation Examples FX32 x86 on Alpha Rosetta PowerPC on x86 0 Performance overheads not that high 0 Virtual ISAs designed for translation not direct execution Target for highlevel compiler one per language 0 Source for lowlevel translator one per ISA Goals Portability abstract hardware nastiness flexibility over time 0 Examples Java Bytecodes C15 371 RothMartin Instruction Set Nchitzectures 54 C15 371 Computer Organization and Design Unit 6 Pipelining 215371 RomMartin Pipelining Readings 0 PH 0 Chapter 6 61 68 215371 RomMartin Pipelining This Unit Scalar InOrder Pipelining 0 Basic Pipelining o Pipeline control 0 Data Hazards 0 Software interlocks and scheduling 0 Hardware interlocks and stalling o Bypassing 0 Control Hazards 0 Branch prediction 0 Multicycle operations 0 Exceptions CPU i a l E 215371 RomMartin Pipelining 2 SingleCycle Datapath Performance R giste39r ile gt 51152 d f o Goes against make common case fast MCCF principle Low CPI 1 Long clock period to accommodate slowest instruction 215371 RomMartin Pipelining 4 Alternative MultiCycle Datapath MultiCycle Datapath Performance 3 S Register Register File File gt s 1 s2 d gt s1 s2 d ssi i o Multicycle datapath attacks high clock period 0 Opposite performance split of singlecycle datapath 0 Cut datapath into multiple stages 5 here isolate using FFs Short clock period c FSM control walks insns thru stages by staging control signals High CPI Insns can bypass stages and exit early memory ops vs alu ops 215371 RothMartin Pipelining 5 215371 RothMartin Pipelining 6 Clock Period and CPI Pipelining 0 Singlecycle datapath o Pipelining important performance technique Low CPI 1 o Improves insn throughput rather than insn latency Long clock period to accommodate slowest insn 0 Exploits parallelism at insnstage level to do so insno fetch dec exec 0 Begin with multicycle design insn1fetch dec exec insn0fetch insn0dec insn0exec 0 MUItICycle datapath insn1fetch Short CIOCk perIOd 0 When insn advances from stage 1 to 2 next insn enters stage 1 High CPI insnOfetch In no mm lnsnOIdec In no mm Iinsm fetchI insn1 dec insn1 exec insn1fem 0 Can we have both low CPI and short clock period IndiVidua39 insns take same number Stages NO good way to make a single insn go faster But insns enter and leave at a much faster rate Insn latency doesnlt matter anyway insn throughput matters 0 Breaks fetchexecutequot Von Neumann VN loop but maintains illusion 0 Key exploit interinsn parallelism 39 AUtOmOt39Ve assembly I39ne analogy 215371 RothMartin Pipelining 7 215371 RothMartin Pipelining 8 5 Stage MultiCycle Datapath 5 Stage Pipelined Datapath Registeri m Register is ile 7 1 s2 d gt s1 s2 0 Temporary values PCIRABOD relatched every stage 0 Why 5 insns may be in pipeline at once they share a single PC 0 Notice PC not latched after ALU stage why not 215371 RomMartin Pipelining 9 215371 RomMartin Pipelining 10 Pipeline Terminology Some More Terminology o Scalar pipeline one insn per stage per cycle 0 Alternative superscalar later o Inorder pipeline insns enter execute stage in VN order R gisfer 0 Alternative outof order maybe later is 39 gt3 s1 52 f o Pipeline depth number of pipeline stages 0 Nothing magical about five 0 Trend has been to deeper pipelines 0 Stages Fetch Decode eXecute Memory Writeback o Latches pipeline registers PC FD DX XM MW 215371 RomMartin Pipelining 11 215371 RomMartin Pipelining 12 Pipeline Example Cycle 1 Pipeline Example Cycle 2 Register Register i e 39 V39 i e s1 52 quot v quot3 gt s1 s2 add 535251 add 535251 o 3 instructions 215371 RothMartin Pipelining 13 215371 RothMartin Pipelining 14 Pipeline Example Cycle 3 Pipeline Example Cycle 4 Register ile gt s1 52 f 5w 6457 1 m srii add 535251 5w 6457 2w M add 535251 o 3 instructions 215371 RothMartin Pipelining 15 215371 RothMartin Pipelining 16 Pipeline Example Cycle 5 Register File gts 152 sw 56457 m 215371 RothMartin Pipelining 17 Pipeline Example Cycle 7 Register File gts 152 215371 RothMartin Pipelining 19 Pipeline Example Cycle 6 sw 647 215371 RothMartin Pipelining 18 Pipeline Diagram o Pipeline diagram shorthand for what we just saw 0 Across cycles 0 Down insns 0 Convention X means lw 40 5 finishes execute stage and writes into XM latch at end of cycle 4 1 2 3 4 5 6 7 8 9 add 535251 F D X M W 1w 4055 F D X M W sw 6457 F D X M W 215371 RothMartin Pipelining 20 What About Pipelined Control 0 Should it be like singlecycle control 0 But individual insn signals must be staged 0 Should it be like multicycle control 0 But all stages are simultaneously active 0 How many different controllers are we going to need 0 One for each insn in pipeline 0 Solution use simple singlecycle control but pipeline it 0 Single controller 215371 RomMartin Pipelining 21 Example Pipeline Perf Calculation o Singlecycle 0 Clock period 50ns CPI 1 0 Performance 50nsinsn o Multicycle 0 Branch 20 3 cycles load 20 5 cycles ALU 60 4 cycles 0 Clock period 11ns CPI 023025064 4 0 Why is clock period 11ns and not 10ns 0 Performance 44nsinsn o Pipelined 0 Clock period 12ns o CPI 15 on average insn completes every 15 cycles 0 Performance 18nsinsn 215371 RomMartin Pipelining 23 Pipelined Control R gviste39r File gt s1 s2 F D EV 215371 RomMartin Pipelining Q1 Why Is Pipeline Clock Period o gt delay thru datapath number of pipeline stages 0 Latches FFs add delay 0 Pipeline stages have different delays clock period is magtlt delay 0 Both factors have implications for ideal number pipeline stages 215371 RomMartin Pipelining 24 Q2 Why Is Pipeline CPI Dependences and Hazards o gt 1 0 Dependence relationship between two insns o CPI for scalar inorder pipeline is 1 stall penalties 0 Data two insns use same storage location 0 Stalls used to resolve hazards 0 Control one insn affects whether another executes at all 0 Hazard condition thatjeopardizes VN illusion 0 Not a bad thing programs would be boring without them 0 Stall artificial pipeline delay introduced to restore VN illusion o Enforced by making older insn go before younger one o Happens naturally in singlemulticycle designs 0 Calculating pipeline CPI But not in a pipeline 0 Frequency of stall stall cycles 0 Penalties add stalls generally don t overlap in inorder pipelines 0 Hazard dependence 8L possibility of wrong insn order 0 1 stallfreq1stallcyc1 stallfreq2stallcyc2 0 Effects of wrong insn order cannot be externally visible 0 Stall for order by keeping younger insn in same stage 0 CorrectnessperformanceMCCF o Hazards are a bad thing stalls reduce performance 0 Long penalties OK if they happen rarely eg 1 001 10 11 o Stalls also have implications for ideal number of pipeline stages 215371 RomMartin Pipelining 25 215371 RomMartin Pipelining 26 Why Does Every Insn Take 5 Cycles Structural Hazards 0 Structural hazards 0 Two insns trying to use same circuit at same time o Eg structural hazard on regfile write port 0 To fix structural hazards proper ISApipeline design 0 Each insn uses every structure exactly once 0 For at most one cycle 0 Always at same stage relative to F fetch o Tolerate structure hazards 0 Add stall logic to stall pipeline when hazards occur o Couldshould we allow add to skip M and go to W No It wouldn t help peak fetch still only 1 insn per cycle Structural hazards imagine add follows lw 215371 RomMartin Pipelining 27 215371 RomMartin Pipelining 28 Data Hazards Regi stger Filef gt 152 F D D mi sw 60S7 1w 4055 add 535251 0 Let s forget about branches and the control for a while 0 The three insn sequence we saw earlier executed fine 0 But it wasn t a real program 0 Real programs have data dependences 0 They pass values via registers and memory 215371 RomMartin Pipelining Data Hazards Regi stger Filef gt 152 F D 0 mi 5w 3057 addi 56153 1w 4053 add 535251 0 Would this program execute correctly on this pipeline 0 Which insns would execute with correct inputs add is writing its result into 3 in current cycle lw read 3 2 cycles ago a got wrong value addi read 3 1 cycle ago a got wrong value sw is reading 3 this cycle a OK regfile timing write first half 215371 RomMartin Pipelining 31 Dependent Operations 0 Independent operations add 535251 add 565554 Would this program execute correctly on a pipeline add 535251 add 565553 What about this program add 535251 215371 RomMartin Pipelining Memory Data Hazards Register File gt39 57162 FD DIX gt s gt m m I IN 54051 5w 55051 o What about data hazards through memory No 0 1w following sw to same address in next cycle gets right value 0 Why DMem readwrite take place in same stage 0 Data hazards through registers Yes previous slide 0 Occur because register write is 3 stages after register read 0 Can only read a register value 3 cycles after writing it 215371 RomMartin Pipelining Fixing Register Data Hazards 0 Can only read register value 3 cycles after writing it 0 Option 1 make sure programs don t do it o Compiler puts two independent insns between writeread insn pair 0 If they aren t there already 0 Independent means do not interfere with register in question 0 Do not write it otherwise meaning of program changes 0 Do not read it otherwise create new data hazard 0 Code scheduling compiler moves around existing insns to do this 0 If none can be found must use nops nooperation o This is called software interlocks o MIPS Microprocessor wout Interlocking Pipeline Stages CIS371 RomMartin Pipelining 33 Software Interlock Performance Software Interlock Example 0 Same deal 0 Branch 20 load 20 store 10 other 50 Software interlocks o 20 of insns require insertion of 1 nop 0 5 of insns require insertion of 2 nops o CPI is still 1 technically 0 But now there are more insns o insns 1 0201 0052 13 30 more insns 30 slowdown due to data hazards CIS371 RomMartin Pipelining 35 add s3s2s1 nop lw 403 addi 354 0 Can any of last three insns be scheduled between first two 0 sw 703 No creates hazard with add 32 sl add 62 ss OK 0 addi 354 No lw would read 3 from it 0 Still need one more insn use nop add 321 add 628 12 1w 403 5w 703 addi 354 CIS371 RomMartin Pipelining 34 Hardware Interlocks 0 Problem with software interlocks Not compatible 0 Where does 3 in read register 3 cycles after writing come from c From structure depth of pipeline 0 What if negtltt MIPS version uses a 7 stage pipeline 0 Programs compiled assuming 5 stage pipeline will break 0 A better more compatible way hardware interlocks 0 Processor detects data hazards and fixes them 0 Two aspects to this 0 Detecting hazards o Fixing hazards CIS371 RomMartin Pipelining 36 Detecting Data Hazards Register Filej gt 152 0 Compare FD insn input register names with output register names of older insns in pipeline Hazard FDIRR51 DXIRRD FDIRR52 DXIRRD FDIRR5 XMIRRD FDIRR52 XMIRRD 215371 RomMartin Pipelining Aside Insert NOPReset Register D 0 RSTWE 2 0 Earlier registers support separate clock write enable 0 Useful for writes into register file 0 Also useful for implementing stalls o Registers can also support synchronous reset clear Useful for implementing stalls Implement as additional hardwired 0 input to FF data mugtlt Resetting pipeline registers equivalent to inserting a NOP o If NOP is all zeros o If zero means don t write for all writeenable control signals 0 Design ISAcontrol signals to make sure this is the case 215371 RomMartin Pipelining Fixing Data Hazards Register File gt S o Prevent FD insn from reading advancing this cycle 0 Write hop into DXIR effectively insert hop in hardware 0 Also reset clear the datapath control signals 0 Disable FD latch and PC write enables why 0 Reevaluate situation next cycle 215371 RomMartin Pipelining Hardware Interlock Example cycle 1 Register File gt S 1quot 540 53 add 535251 FDIRR51 DXIRRD FDIRRS2 DXIRRD FDIRR5 XMIRRD FDIRR52 XMIRRD 1 215371 RomMartin Pipelining Hardware Interlock Example cycle 2 1w 540 53 add 535251 FDIRR51 DXIRRD FDIRR52 DXIRRD FDIRR51 XMIRRD FDIRRS2 XMIRRD 1 215371 RomMartin Pipelining Pipeline Control Terminology 0 Hardware interlock maneuver is called stall or bubble o Mechanism is called stall logic 0 Part of more general pipeline control mechanism 0 Controls advancement of insns through pipeline 0 Distinguish from pipelined datapath control 0 Controls datapath at each stage 0 Pipeline control controls advancement of datapath control 215371 RomMartin Pipelining Hardware Interlock Example cycle 3 Register File gt S 1w 4053 add 535251 FDIRR51 DXIRRD FDIRR52 DXIRRD FDIRR51 XMIRRD FDIRR52 XMIRRD 0 215371 RomMartin Pipelining Pipeline Diagram with Data Hazards 0 Data hazard stall indicated with d o Stall propagates to younger insns 1 2 3 4 5 6 7 8 9 add 535251 F D X M W 1w 4053 F d d D X M W sw 6457 F D X M W o This is not good why 1 2 3 4 5 6 7 8 9 add 535251 F D X M W 1w 4053 F d d D X M W sw 6457 F D X W 215371 RomMartin Pipelining Hardware Interlock Performance Same deal 0 Branch 20 load 20 store 10 other 50 Hardware interlocks same as software interlocks o 20 of insns require 1 cycle stall Ie insertion of 1 nop 0 5 of insns require 2 cycle stall Ie insertion of 2 mops o CPI 1 0201 0052 13 0 So either CPI stays at 1 and insns increases 30 software 0 Or insns stays at 1 relative and CPI increases 30 hardware 0 Same difference 0 Anyway we can do better C15371 RomMartin Pipelining Bypassing Registle 39File gt 3951 quot FD 1w 4053 add 535251 o Bypassing 0 Reading a value from an intermediate parchitectural source 0 Not waiting until it is available from primary source 0 Here we are bypassing the register file 0 Also called forwarding C15371 RomMartin Pipelining Observe Register File 5152 FD DIX 6 39 xnvr 1w 4053 add 535251 o Technically this situation is broken 0 1w 403 has already read 3 from regfile 0 add 3 2 1 hasn t yet written 3 to regfile 0 But fundamentally everything is OK 0 1w 4 0 3 hasn t actually used 3 yet 0 add 3 2 1 has already computed 3 C15371 RomMartin Pipelining WX Bypassing Register File gt SIS2 FD DIX 1w 4053 add 535251 o What about this combination 0 Add another bypass path and MUX input 0 First one was an MX bypass o This one is a WX bypass C15371 RomMartin Pipelining ALUinB Bypassing WM Bypassing Register File gt slis2 add 545253 add 535251 0 Can also bypass to ALU input B 215371 RothMartin Pipelining 49 Bypass Logic Register File gt 51752 FD DIX r 5w 3054 1w 3052 0 Does WM bypassing make sense 0 Not to the address input why not 0 But to the store data input yes 215371 RothMartin Pipelining 50 Bypass and Stall Logic Register File gt slis2 FD 0 Each MUX has its own here it is for MUX ALUinA DXIRR51 XMIRRD gt 0 DXIRR51 MWIRRD gt 1 Else gt 2 215371 RothMartin Pipelining 51 0 Two separate things 0 Stall logic controls pipeline registers o Bypass logic controls MUXs 0 But complementary o For a given data hazard if can t bypass must stall 0 Slide 43 shows full bypassing all bypasses possible 0 Is stall logic still necessary 215371 RothMartin Pipelining 52 Yes Load Output to ALU Input Register File gt s 1 s 2 add 545253 1w 3452 add 54525f 1w 53452 Stall DXIROP LOAD ampamp FDIRR51 DXIRRD FDIRR52 DXIRRD ampamp FDIROP STORE 215371 RomMartin Pipelining 53 Pipelining and MultiCycle Operations o What if you wanted to add a multicycle operation 0 Eg 4cycle multiply o PW separate output latch connects to W stage 0 Controlled by pipeline control and multiplier FSM 215371 RomMartin Pipelining 55 Pipeline Diagram With Bypassing 6789 4 add M 1w 4453 addi sss41 0 Use compiler scheduling to reduce loaduse stall frequency 0 Like software interlocks but for performance not correctness 1 2 3 4 5 6 7 8 9 add 535251 F D X M I 1w 54453 F D X M Iw sub831 F D X M W addi 5541 F D X M W 215371 RomMartin Pipelining 54 A Pipelined Multiplier POP1 P1P2 Multiplier itself is often pipelined what does this mean 0 Productmultiplicand registerALUslatches replicated 0 Can start different multiply operations in consecutive cycles 215371 RomMartin Pipelining 56 What about Stall Logic POP1 P1P2 Stall OldStallLogic FDIRR51 POP1IRRD FDIRR52 POP1IRRD FDIRR51 P1P2IRRD FDIRR52 P1P2IRRD FDIRR51 P2P3IRRD FDIRR52 P2P3IRRD 215371 RomMartin Pipelining 57 Actually We re Not Done Actually It s Somewhat Nastier POP1 P1P2 0 And what about this Stall OldStallLogic FDIRRD DXIRRD ampamp DXIROP MULT 215371 RomMartin Pipelining 59 POP1 P1P2 o What does this do Stall OldStallLogic FDIRRD 1 ampamp FDIROP MULT ampamp P0P1IRRD 1 215371 RomMartin Pipelining 58 Pipeline Diagram with Multiplier 1l2l3l4l5l6l7l8l9 Imu1435 FlDlpolp1lP2lP3lWl l I Iaddi641 lFldldldlollelW o This is the situation that slide 58 logic tries to avoid 0 Two instructions trying to write regfile in same cycle 1 2 3 4 5 6 7 8 9 mul 545355 F D P0 P1 P2 P3 W addi 56511 F D X M W add 5555510 F D X M W 215371 RomMartin Pipelining 60 More Multiplier Nasties o This is the situation on slide 59 tries to avoid o Misordered writes to the same register 0 Software thinks add gets 4 from addi actually gets it from mul 1 2 3 4 5 6 7 8 9 mul 545355 F D P0 P1 P2 P3 W addi 4511 F D X M W add 51054ss F D X M W 0 Common Not for a 4cycle multiply with 5stage pipeline 0 More common with deeper pipelines o In any case must be correct 215371 RomMartin Pipelining 61 What About Branches Corrected Pipeline Diagram Register File gt s1 s2 d 0 Control hazards o Fetch past branch insns before branch outcome is known 0 Default assume nottaken at fetch can t tell it s a branch 215371 RomMartin Pipelining 63 0 With the correct stall logic 0 Prevent mis ordered writes to the same register 0 Why two cycles of delay 1 2 3 4 5 6 7 8 9 mul 545355 F D P0 P1 P2 P3 W addi 4511 F d d D X M W add 51054ss F D X M W o Multicycle operations complicate pipeline logic 215371 RomMartin Pipelining 62 Branch Recovery Register lie 0 Branch recovery what to do when branch is actually taken 0 Insns that will be written into FD and DX are wrong 0 Flush them ie replace them with nops They haven t had written permanent state yet regfile DMem 215371 RomMartin Pipelining 64 Branch Recovery Pipeline Diagram 6789 1 2 3 F D X F D targ targ addi 8571 0 Convention don t fill in flushed insns 0 Taken branch penalty is 2 cycles 1 2 3 4 5 6 7 8 9 addi 3 50 y 1 F D X M W bnez 3 targ F D X yl targ addi 8571 F D X M W 215371 RomMartin Pipelining 65 One MIPSSpecific Way Fast Branch Register File gtv sls2 d R 0 Fast branch can decide at D not X 0 Test must be comparison to zero or equality no time for ALU New taken branch penalty is 1 Additional insns sit for more complex tests Must bypass to Dstage too 215371 RomMartin Pipelining 67 Branch Performance 0 Back of the envelope calculation 0 Branch 20 load 20 store 10 other 50 o 75 of branches are taken 0 Why not 5050 Loop back edges 0 CPI120752 1 0200752 13 Branches cause 30 slowdown 0 Even worse with deeper pipelines o How do we reduce this penalty 215371 RomMartin Pipelining 66 Fast Branch Performance 0 Assume Branch 20 75 of branches are taken 0 CPI 1 20 75 1 1 0200751 115 o 15 slowdown better than the 30 from before 0 But wait fast branches assume only simple comparisons 0 Fine for P37gtlt amp MIPS 0 But not fine for ISAs with branch if 1 gt 2 operations 0 In such cases say 25 of branches require an extra insn CPI120 75 1 20251egtlttra insn 12 0 Example of ISA and microarchitecture interaction 0 Type of branch instructions 0 What about condition codes 215371 RomMartin Pipelining 68 More Generally Speculative Execution o Speculation risky transactions on chance of profit 0 Speculative execution Execute before all parameters known with certainty Correct speculation Avoid stall improve performance Incorrect speculation mis speculation Must abortflushsquash incorrect insns Must undo incorrect changes recover prespeculation state 0 The game correct gain 1 correct penalty 0 Control speculation speculation aimed at control hazards 0 Unknown parameter are these the correct insns to execute next 215371 RomMartin Pipelining 69 Dynamic Branch Prediction 0 Dynamic branch prediction guess outcome 0 Start fetching from guessed address 0 Flush on mis prediction notice new recovery circuit 215371 RomMartin Pipelining 71 Control Speculation Mechanics 0 Guess branch target start fetching at guessed position 0 Doing nothing is implicitly guessing target is PC4 0 Can actively guess other targets dynamic branch prediction Execute branch to verify check guess 0 Correct speculation keep going 0 Misspeculation Flush misspeculated insns o Hopefully haven t modified permanent state Regfile DMem Happens naturally in inorder Sstage pipeline Game for inorder 5 stage pipeline o Gain cycles Penalty 0 cycles a mis speculation no worse than stalling 215371 RomMartin Pipelining 70 Simple Branch Target Buffer BTB 0 Big idea learn from past predict the future 0 Record the past in a hardware structure 0 Branch target buffer BTB simplest branch predictor o guess the future PC based on base behavior 0 Last time the instruction at address X was followed by address Y o 50 in the future if address X is fetched fetch address Y next 0 Operation 0 A small RAM like a regfile address PC data targetPC 0 Access at Fetch in parallel with instruction memory 0 predictedtarget BTBPC 0 Updated at X whenever target predictedtarget o BTBPC target 215371 RomMartin Pipelining 72 Branch Target Buffer continued 0 At Fetch how does insn know that it s a branch amp should read BTB 0 Answer it doesn t have to o All insns read BTB o If insn isn t a branch entry is PC4 o BTB can t hold all PCs 0 what if 2 PCs alias map to same slot 0 Answer doesn t matter 0 Why BTB contents only used as a guess can be wrong 0 Which PC bits should be used to index BTB 215371 RomMartin Pipelining Why Does a BTB Work 0 Because most control insns use direct targets 0 Target encoded in insn itself a same target every time o What about indirect targets 0 Target held in a register a can be different each time 0 Indirect conditional jumps are not widely supported 0 Two indirect call idioms Dynamically linked functions DLLs target always the same 0 Dynamically dispatched virtual functions hard but uncommon 0 Also two indirect unconditional jump idioms o Switches hard but uncommon Function returns hard and common but 215371 RomMartin Pipelining More Ef cient BTB predicted target o Naive BTB is space inefficient 0 Many entries useless entries for nonbranches o More efficient BTB 0 Only explicitly represent taken branches 0 Implement by tagging each entry with PC 0 Update at X if target predictedtarget o BTBPCtag PC BTBPCtarget target 0 All insns access at Fetch in parallel with IMem o Predictedtarget BTBPCtag PC BTBPCtarget PC4 215371 RomMartin Pipelining 74 Return Address Stack RAS predicted target 0 Return address stack RAS 0 Call RA5TOS PC4 0 Return Predictedtarget RA5TOS c Q how can you tell if an insn is a callreturn before decoding it o Accessing RAS on every insn BTBstyle doesn t work 0 Option 1 just wait a cycle 0 Option 2 predecode bits in Imem written when first executed 215371 RomMartin Pipelining 76 Branch Direction Prediction 0 BTB uses implicit branch direction prediction 0 BTBPCtag PC amp BTBPCtarget PC4 a taken T o BTBPCtag PC amp BTBPCtarget PC4 a nottaken N o Implied policy predict last takennontaken Surprisingly effective captures loop idiom 75 Pathological in several ways can do much better 95 Branch history table BHT explicit direction predictor 0 RAM address PC data NT 01 typically untagged 0 Many more entries than BTB 0 Individual conditional branches often unbiased or weakly biased o 90 one way or the other considered biased 0 Advanced algorithms use interbranch correlation tournaments etc 0 Still actively researched 215371 RothMartin Pipelining TwoBit Saturating Counters 2bc o Twobit saturating counters 2bc Smith 0 Replace each singlebit prediction 0 0123 NntT o Adds hysteresis 0 Force DIRP to mis predict twice before changing its mind IStateprediction Nlnl t lTl t l T l T lTl t l T l T lT Outcome TlTlTlNlTlTlTlNlTlTlTlNI 0 One mispredict each loop execution rather than two Fixes this pathology which is not contrived by the way 0 Can we do even better 215371 RothMartin Pipelining Branch History Table BHT 0 Branch history table BHT simplest direction predictor 0 PC indexes table of bits 0 N 1 T no tags 0 Essentially branch will go same way it went last time 0 Problem consider inner loop branch below mis prediction for i0ilt100i for j0jlt3j whatever Stateprediction N T T T N T T T N T T T Outcome TTN TTN TTN Two builtin mis predictions per inner loop iteration Branch predictor changes its mind too quickly 215371 RothMartin Pipelining Correlated Predictor o Correlated twolevel predictor Patt 0 Exploits observation that branch outcomes are correlated o Maintains separate prediction per PC BHR 0 Branch history register BHR recent branch outcomes 0 Simple working example assume program has one branch 0 BHT one 1bit DIRP entry 0 BHTZBHR 22 4 1bit DIRP entries TTT TT N N NN NNNTN T T T N We didn t make anything better what s the problem 215371 RothMartin Pipelining Correlated Predictor Correlated Predictor o What happened 0 Design choice I one global BHR or one per PC local 0 BHR wasn39t ong enough to capture the pattern 0 Each one captures different kinds of patterns 0 Try again BHT3BHR 23 8 1bitD1Rp entries 0 Global is better captures local patterns for tight loop branches N T T T T 0 Design choice II how many history bits BHR size 0 Tricky o Given unlimited resources longer BHRs are better but BHT utilization decreases Many history patterns are never seen Many branches are history independent don t care 0 PC gtltor BHR allows multiple PCs to dynamically share BHT o BHR length lt logzBHT size Predictor takes longer to train 0 Typical length 8 12 N N T T N N N N N T T N T N T T No mispredictions after predictor learns all the relevant patterns 215371 RomMartin Pipelining 81 215371 RomMartin Pipelining 82 Hybrid Predictor When to Perform Branch Prediction 0 Hybrid tournament predictor McFarling 0 During Fetch o Attacks correlated predictor BHT utilization problem 39 Access BHT and BTB in Parallel With lnStrUCth memory Idea combine two predictors 0 Use BHT result to set negtltt PC to either PC4 or BTBPC 0 Simple BHT predicts history independent branches no final When corfc ypgredmted d I b h o Correlated predictor predicts only branches that need history eengaiingr gegde bsitgfkncggnabrafcnhises o Chooser assigns branches to one predictor or the other p 39 o Branches start in simple BHT move mis prediction threshold c ltd d39t b d ll h dl f b h Dur39ngDeCOde orre a 8 pre Ic or can 8 ma 8 sma er an es ewer ranc es 0 Look at instruction opcode to determine branch instructions 90 95 accuracy 0 Can calculate negtltt PC from instruction for PCrelative branches One cycle mis fetch penalty even if branch predictor is correct 0 Today s processors usually do some hybrid 0 Quick prediction at fetch better prediction during decode 215371 RomMartin Pipelining 83 215371 RomMartin Pipelining 84 Branch Prediction Performance Pipelining And Exceptions 0 Dynamic branch prediction 0 Pipelining makes exceptions nasty 0 Simple BTB at fetch branches predicted with 75 accuracy 5 insns in pipenne at once 39 CPI 1 20 25 2 1391 0 Exception happens how do you know which insn caused it o More advanced BTBBHT predictor at fetch 95 accuracy Exceptions propagate along pipeline in latches o n CPI 1 20A 5 2 13902 0 Two exceptions happen how do you know which one to take first 0 BTB during Fetch BHT during decode one be399ng39 9 to oldest lnsn 75 accuracy at fetch 95 accuracy at decode c When handling exception have to flush younger Insns CPI 1 20 25 1 20 5 1 106 o Piggyback on branch mis prediction machinery to do this 0 What about multicycle operations 0 Branch mispredictions still a big problem though 0 Pipelines are long typical mis prediction penalty is 10 cycles 0 Just FYI o Pipelines have full bypassing compiler schedules the rest 0 Pipelines are superscalar later 85 215371 RothMartin Pipelining 86 215371 RothMartin Pipelining Pipeline Depth Summary 0 No magic about 5 stages trend had been to deeper pipelines 0 Basics of pipelining o 486 5 stages 50 gate delays clock pipenne diagrams Data hazards 0 Pentium 7stages Pentium 11111 12 stages quot0 Software interlockscode scheduling 0 Pentium 4 22 stages 10 gate delays clock superpipelining o core12 14 stages 0 Hardware Interlocksstalllng 0 Increasing pipeline depth Bypassmg I o MultIcycle operations Increases clock frequency reduces period But decreases IPC increases CPI 0 Branch mis prediction penalty becomes longer 0 Nonbypassed data hazard stalls become longer 0 At some point CPI losses offset clock gains question is when 0 lGHz Pentium 4 was slower than 800 MHz PentiumIII o What was the point Customers buy frequency not IPCfrequency Control hazards 0 Branch prediction 215371 RothMartin Pipelining 87 215371 RothMartin Pipelining CIS 371 Computer Organization and Design Unit 10 Virtual Memory amp IO CIS 371 RomMartin Virtual Memory ampI0 This Unit Virtualization amp IO 0 The operating system OS SYStem 5 Ware o A superapplication 0 Hardware support for an OS Mem CPU IIO 0 Virtual memory 0 Page tables and address translation 0 TLBs and memory hierarchy issues 0 IO 0 Magnetic storage disks 0 Solidstate storage flash memory CIS 371 RomMartin Virtual Memory ampI0 Rest of the Semester Topics yet to cover 0 Virtual memory 0 DisksampIO 0 Power amp energy 0 Reliability 0 Datalevel parallelism SIMD Vectors and GPUs 0 XBOX 360 case study Perennial problem 0 Only three lectures remaining 0 By design we re actually on schedule 0 Will try to hit the highlights give some context 0 Try not to be overwhelmed CIS 371 RothMartin Virtual Memory ampI0 Readings o PampH 0 Virtual Memory 54 0 IO amp Disks 6165 68 CIS 371 RothMartin Virtual Memory ampI0 A Computer System Hardware 0 CPUs and memories 0 Connected by memory bus 0 IO peripherals storage input display network 0 With separate or builtin DMA 0 Connected by system bus which is connected to memory bus C15 371 RomMartin Virtual Memory 8110 5 A Computer System OS 0 Operating System OS virtualizes hardware for apps 0 Abstraction provides services eg threads files etc Simplifies app programming model raw hardware is nasty 0 Isolation gives each app illusion of private CPU memory 10 Simplifies app programming model Increases hardware resource utilization Memory bus System lO bus bndge CPU opu C15 371 RomMartin Virtual Memory 8110 A Computer System App Software 0 Application software computer must do something 39A lic atio n s39ofwa re Memory bus System lO bus bndge CPU opu C15 371 RomMartin Virtual Memory 8110 Operating System OS and User Apps o Sane system development requires a split 0 Hardware itself facilitatesenforces this split 0 Operating System OS a superprivileged process 0 Manages hardware resource allocationrevocation for all processes 0 Has direct access to resource allocation features 0 Aware of many nasty hardware details 0 Aware of other processes 0 Talks directly to inputoutput devices device driver software 0 Userlevel apps ignorance is bliss o Unaware of most nasty hardware details 0 Unaware of other apps and OS 0 Egtltplicitly denied access to resource allocation features 215 371 RomMartin Virtual Memory 8110 8 System Calls 0 Controlled transfers tofrom OS 0 System Call a userlevel app function call to OS 0 Leave description of what you want done in registers o SYSCALL instruction also called TRAP or INT 0 Can t allow userlevel apps to invoke arbitrary OS code 0 Restricted set of legal OS addresses to jump to trap vector 0 Processor jumps to OS using trap vector 0 Sets privileged mode 0 OS performs operation 0 OS does a return from system call 0 Unsets privileged mode C15 371 RomMartin Virtual Memory 8110 9 Typical IO Device Interface 0 Operating system talks to the IO device 0 Send commands query status etc 0 Software uses special uncached loadstore operations 0 Hardware sends these readswrites across IO bus to device 0 Direct Memory Access DMA o For big transfers the IO device accesses the memory directly 0 Example DMA used to transfer an entire block tofrom disk 0 Interruptdriven IO 0 The 10 device tells the software its transfer is complete 0 Tells the hardware to raise an interrupt door bell 0 Processor jumps into the OS 0 Inefficient alternative polling C15 371 RomMartin Virtual Memory 8110 11 Interrupts o Exceptions synchronous generated by running app 0 Eg illegal insn divide by zero etc o Interrupts asynchronous events generated externally o Eg timer IO requestreply etc o Interrupt handling same mechanism for both 0 Interrupts are onchip signalsbits 0 Either internal eg timer exceptions or connected to pins Processor continuously monitors interrupt status when one is high Hardware jumps to some preset address in OS code interrupt vector 0 Like an asynchronous nonprogrammatic SYSCALL o Tlmer programmable onchip interrupt o Initialize with some number of microseconds o Timer counts down and interrupts when reaches 0 C15 371 RomMartin Virtual Memory 8110 10 Virtualizing Processors 0 How do multiple apps and OS share the processors 0 Goal applications think there are an infinite of processors 0 Solution timeshare the resource 0 Trigger a context switch at a regular interval 1ms o Preemptive app doesn t yield CPU OS forcibly takes it Stops greedy apps from starving others 0 Architected state PC registers 0 Save and restore them on context switches 0 Memory state 0 Nonarchitected state caches branch predictor tables etc o Ignore or flush 0 Operating responsible to handle context switching 0 Hardware support is just a timer interrupt C15 371 RomMartin Virtual Memory 8110 12 Virtualizing Main Memory 0 How do multiple apps and the OS share main memory 0 Goal each application thinks it has infinite memory 0 One app may want more memory than is in the system c App s insndata footprint may be larger than main memory 0 Requires main memory to act like a cache 0 With disk as next level in memory hierarchy slow 0 Writeback writeallocate large blocks or pages o No notion of program not fitting in registers or caches why 0 Solution 0 Part 1 treat memory as a cache 0 Store the overflowed blocks in swap space on disk 0 Part 2 add a level of indirection address translation C15 371 RomMartin Virtual Memory ampIO 13 Virtual Memory VM 0 Virtual Memory VM Level of indirection like register renaming Application generated addresses are virtual addresses VAs 0 Each process thinks it has its own 2 bytes of address space Memory accessed using physical addresses PAs VAs translated to PAs at some coarse granularity 0 OS controls VA to PA mapping for itself and all other processes 0 Logically translation performed before every insn fetch load store 0 Physically hardware acceleration removes translation overhead OS App l App2 I II ll 08 controlled VAaPA mappings PAS physical memory 215 371 RomMartin Virtual Memory ampIO 15 Virtual Memory VM Program 0 Programs use virtual addresses VA code heap stack 0 02N 1 0 VA size also referred to as machine size 0 Eg 32bit embedded or 64bit server 0 Memory uses physical addresses PA 0 02quotquot 1 typically MltN especially if N64 0 2M is most physical memory machine supports 0 A gtPA at page granularity VP gtPP o By system 0 Mapping need not preserve contiguity 0 VP need not be mapped to any PP o Unmapped VPs live on disk swap C15 371 RothMartin Virtual Memory ampIO 14 VM is an Old Idea Older than Caches 0 Original motivation singleprogram compatibility 0 IBM System 370 a family of computers with one software suite Same program could run on machines with different memory sizes Prior programmers egtltplicitly accounted for memory size 0 But also fullassociativity software replacement 0 Memory tmES is high extremely important to reduce miss Parameter L2 Zns 10ns 10ns 30ns 8 64 KB 128 KB ZMB Block size 16 32B 32 256B N M RU N MRU C15 371 RothMartin Virtual Memory ampIO 16 Uses of Virtual Memory o More recently isolation and multiprogramming 0 Each app thinks it has 2N B of memory its stack starts 0gtltFFFFFFFF 0 Apps prevented from readingwriting each other s memory 0 Can t even address the other program s memory 0 Protection 0 Each page with a readwriteexecute permission set by OS 0 Enforced by hardware 0 Interprocess communication 0 Map same physical pages into multiple virtual address spaces 0 Or share files via the UNIX mmap call 08 App l App2 I C15 371 RomMartin Virtual Memory 8110 17 Address Translation Virtual Memory The Basics virtual address310 POFS 150 donttouch POFS 150 physical address250 o VA gtPA mapping called address translation 0 Split VA into virtual page number VPN amp page offset POFS 0 Translate VPN into physical page number PPN o POFS is not translated VAaPA VPN POFS a PPN POFS 0 Example above 0 64KB pages a 16bit POFS 0 32bit machine a 32bit VA a 16bit VPN 0 Maximum 256MB memory a 28bit PA a 12bit PPN C15 371 RomMartin Virtual Memory 8110 19 0 Programs use virtual addresses WA 0 VA size N aka machine size eg Core 2 Duo 48bit 0 Memory uses physical addresses PA 0 PA size M typically MltN especially if N64 0 2quotquot is most physical memory machine supports 0 VA gtPA at page granularity VP gtPP 0 Mapping need not preserve contiguity 0 VP need not be mapped to any PP o Unmapped VPs live on disk swap or nowhere if not yet touched OS App l App2 II lllll C15 371 RomMartin Virtual Memory 8110 18 Address Translation Mechanics I o How are addresses translated o In software for now but with hardware acceleration a little later 0 Each process allocated a page table PT 0 Software data structure constructed by OS 0 Maps VPs to PPs or to disk swap addresses 0 VP entries empty if page never referenced 0 Translation is table lookup PT struct int ppn int isvalid isdirty isswapped PTE struct PTE pagetab1eNUMVIRTUALPAGES int translateint vpn 39 pagetab1evpnisvalid return pagetab1evpnppn C15 371 RomMartin Virtual Memory 8110 Page Table Size 0 How big is a page table on the following machine 0 32bit machine 0 4B page table entries PTEs o 4KB pages 0 32bit machine a 32bit VA a 4GB virtual memory 0 4GB virtual memory 4KB page size a 1M VPs 0 1M VPs 4B PTE a 4MB How big would the page table be with 64KB pages How big would it be for a 64 bit machine Page tables can get big c There are ways of making them smaller C15 371 RomMartin Virtual Memory ampI0 MultiLevel Page Table PT 0 20bit VPN 0 Upper 10 bits indegtlt 1st level table 0 Lower 10 bits indegtlt 2ndlevel table 2ndlevel PTEs pt root struct int ppn int isvalid isdirty isswapped struct struct PTE ptes1024 LZPT struct LZPT pagetable1024 int translateint vpn indexl vpn gtgt 10 upper 10 bits index2 vpn amp exaff lower 10 bits struct L2PT 12pt pagetableindex1 if 12pt 1 NULL 5 12ptgtptesindex2 isvalid return 12ptgtptesindex2 ppn 23 MultiLevel Page Table PT 0 One way multilevel page tables 0 Tree of page tables 0 Lowestlevel tables hold PTEs o Upperlevel tables hold pointers to lowerlevel tables 0 Different parts of VPN used to index different levels Example twolevel page table for machine on last slide 0 Compute number of pages needed for lowestlevel PTEs o 4KB pages 4B PTEs a 1K PTEspage 0 1M PTEs 1K PTEspage a 1K pages 0 Compute number of pages needed for upperlevel pointers 0 1K lowestlevel pages a 1K pointers 0 1K pointers 32bit VA a 4KB a 1 upper level page 215 371 RomMartin Virtual Memory ampI0 MultiLevel Page Table PT 0 Have we saved any space 0 Isn t total size of 2nd level tables same as singlelevel table ie 4MB 0 Yes but 0 Large virtual address regions unused 0 Corresponding 2ndlevel tables need not exist 0 Corresponding 1stlevel pointers are null 0 Example 2MB code 64KB stack 16MB heap 0 Each 2ndlevel table maps 4MB of virtual addresses 0 1 for code 1 for stack 4 for heap 1 1st level o 7 total pages 28KB much less than 4MB C15 371 RomMartin Virtual Memory ampI0 A Computer System IO Subsystem 0 IO subsystem kind of boring kind of important 0 IO devices storage input display network 0 IO bus 0 Software 0 Virtualized by OS 0 Device drivers 0 Presents synchronous interface for asynchronous devices C15 371 RomMartin Virtual Memory ampIO 33 One Instance of IO Swap Space 0 Have brie y seen one instance of IO 0 Disk bottom of memory hierarchy 0 Use as swap space l W 0 Exploits large capacity of disks 0 Cheaper per bit than DRAM CPU o Other use of disk 0 Nonvolatile storage Retains state when power is off C15 371 RomMartin Virtual Memory ampIO 35 10 Devices 0 Primary characteristic data rate bandwidth 0 Latency really only an issue for disk and network 0 Contributing factors inputoutputboth partner 0 Interesting devices have high data rates C15 371 RomMartin Virtual Memory ampIO 34 Anatomy of a Disk Drive ad 0 Disk rotates like a CDDVD player 0 Except magnetic not optical 0 Collection of platters 0 Each with readwrite head 0 Platters divided into concentric tracks 0 Head seeks to track 0 All heads move in unison 0 Each track divided into sectors 0 More sectors on outer tracks 0 Sectors rotate under head 0 Controller 0 Seeks heads waits for sectors 0 Turns heads on o 0 May have its own cache a few M85 0 Egtltploit spatial locality C15 371 RomMartin Virtual Memory ampIO 36 Disk Latency 0 Disk readwrite latency has four components 0 Seek delay tmk head seeks to right track 0 Average of 5ms 15ms 0 Less in practice because of shorter seeks o Rotational delay twta on right sector rotates under head 0 On average time to go halfway around disk 0 Based on rotation speed RPM 0 10000 to 15000 RPMs o 3ms 0 Transfer time thansm data actually being transferred 0 Fast for small blocks 0 Controller delay twntmuu controller overhead on either side 0 Fast no moving parts tdisk Iseek trolalinn ttransfer tcnntrnller C15 371 RomMartin Virtual Memory ampIO 37 Disk Bandwidth Sequential vs Random 0 Disk is bandwidthinef cient for pagesized transfers 0 Sequential vs random accesses 0 Random accesses 0 One read each disk access latency 10ms o Randomly reading 4KB pages 0 10ms is 001 seconds a 100 access per second 0 4KB 100 accesssec a 400KBsecond bandwidth 0 Sequential accesses o Stream data from disk no seeks o 128 sectorstrack 512 Bsector 6000 RPM 0 64KB per rotation 100 rotationper sec 0 6400KBsec a 64MBsec o Sequential access is 10x or more bandwidth than random 0 Still no where near the lGBsec to loGBsec of memory 215 371 RomMartin Virtual Memory ampIO 39 Disk Latency Example Example time to read a 4KB chunk assuming 0 128 sectorstrack 512 Bsector 6000 RPM 10 ms geek 1 ms tcommller 6000 RPM a 100 Rs a 10 msRa tromth 10 ms 2 5 ms 0 4 KB page a 8 sectors a tIalnsfer 10 ms 8128 06 ms 39 tdisk tseek trolation thansfer tcommller 166 m5 tdi k105061166ms C15 371 RothMartin Virtual Memory ampIO 38 Some Old Example Disks Hitachi Diameter 300 40 Cache 8 2 RPM RPM 4200 RPM 3600 RPM Seek 45 12 12 Sustained Data Rate 100 40 10 Cost 0 Flash nonvolatile CMOS storage 0 The new disk replacing disk in many C15 371 RothMartin Virtual Memory ampIO 40 This Unit Code Scheduling App App App System software CIS 371 Mem quot0 0 Code scheduling Computer Organization and Design T reducepipe39i esta s c To increase ILP insn level parallelism o Pipelining and superscalar review 0 Two approaches Unit 8 Static and Dynamic Scheduling Static scheduling by the compiler 0 Dynamic scheduling by the hardware C15 371 RomMarlin Scheduling 1 C15 371 RomMarlin Scheduling Readings Pipelining Review o PH o Increases clock frequency by staging instruction execution 0 Chapter 69 again 0 Scalar pipelines have a bestcase CPI of 1 0 Challenges 0 Data and control dependencies further worsen CPI 0 Data With full bypassing loadtouse stalls 0 Control use branch prediction to mitigate penalty 0 Big win done by all processors today 0 How many stages depth 0 Five stages is pretty good minimum 0 Intel Pentium 11111 12 stages 0 Intel Pentium 4 22 stages 0 Intel Core 2 14 stages C15 371 RomMarlin Scheduling 3 C15 371 RomMarlin Scheduling Pipeline Diagram add 535251 F 1w 4453 addi 6541 sub 585351 39I39IUN 39nUXw 4 XZb D F 0 Use compiler scheduling to reduce loaduse stall frequency 5 6 7 8 9 W M W d VX M W d D X M W 0 Like software interlocks but for performance not correctness 1 add 535251 F 1w 4453 sub 585351 addi 6541 CIS371 RothMartin Pipelining 2 D F 39nUXw 4 M 39X 39nU ongm lt xz 3 6 8 9 Superscalar Pipeline Diagrams Ideal scalar 1 2 1w 0r1r2 F D 1w 4r1r3 F 1w 0r1ar9 2way superscalar 1w 0r1r2 1w 4r1r3 F 1w 3r1r4 11H TITIUUN 1w 0 r18r9 C15 371 RomMarlin Superscalar Pipelines 39nUXw 39n39nUUXXw 39nUXZ b TIUUXXZZb moxz m oxxzz m xzz Z 6789101112 moxz moxz oxz xz 3 00 9101112 Superscalar Pipeline Review Execute two or more instruction per cycle Challenges wide fetch branch prediction harder misprediction more costly wide decode stall logic wide execute more ALUs wide bypassing more possibly bypassing paths 0 Finding enough independent instructions and fill delay slots How many instructions per cycle max width 0 Really simple lowpower cores are still scalar single issue 0 Even lowpower cores a dualissue Intel Atom aka Silverthorne 0 Most desktoplaptop chips threeissue or fourissue o A few 5 or 6issue chips have been built IBM Power4 Itanium 11 C15 371 RomMarlin Scheduling 6 Superscalar Pipeline Diagrams Realistic scalar 1 2 1w 0r1r2 F D 1w 4r1r3 F 1w 3r1r4 add r4r5rs add r2r3r7 add r7rsra 1w 0rar9 6789101112 39nUXw 39nUXZ b xz m w M D F 39nUX 39nUXZ oxz 2waysuperscalar 1 6 7 8 9 1011 12 1w 0r1r2 F 2 D 1w 4r1r3 F D F F X334 1w 8r1r4 add r4r5r6 add r2r3r7 add r7r6r8 1w 0rar9 C15 371 RomMarlin Superscalar Pipelines 8 39nDg UXXw 060 mmUUz m ogoxx 0X33 xz 3 Code Scheduling 0 Scheduling act of finding independent instructions 0 Static done at compile time by the compiler software 0 Dynamic done at runtime by the processor hardware 0 Why schedule code 0 Scalar pipelines fill in loadtouse delay slots to improve CPI o Superscalar place independent instructions together 0 As above loadtouse delay slots 0 Allow multipleissue decode logic to let them execute at the same time 215 371 RomMarlin Scheduling Compiler Scheduling Requires 0 Large scheduling scope 0 Independent instruction to put between loaduse pairs Original example large scope two independent computations This example small scope one computation Before After ld r24sp ld r38sp add r3r2rl stall st rl0sp ld r24sp ld r38sp add r3r2rl stall st rl0sp 0 One way to create larger scheduling scopes 215371 RomMartin Pipelining 215371 RomMartin Pipelining Compiler Scheduling o Compiler can schedule move instructions to reduce stalls 0 Basic pipeline scheduling eliminate backto back loaduse pairs 0 Example code sequence a b c d f e Sp stack pointer sp0 is a sp4 is b etc Before After ld r24sp ld r38sp add r3r2rl stall st rl0sp add r3r2rl no stall ld r516sp ld r620sp 1a r62A0sp st r1Qsp sub r5r6r4 stall sub r5r6r4 no stall st r412sp st r412sp 215371 RomMartin Pipelining ld r24sp ld r38sp 39ld r516sp Compiler Scheduling Requires 0 Enough registers c To hold additional live values 0 Example code contains 7 different values including sp 0 Before max 3 values live at any time a 3 registers enough 0 After max 4 values live a 3 registers not enough Original Wrong 1a 4sp 1a quot4sp ld rl8sp ld rl8sp add rl1 Vl stall vld r216sp st rl0sp add rl3rl wrong rl ld r216sp ld E20Sp ld 20sp st rl0sp wrong rl sub r2 Lrl stall sub r2 r1 st rl12sp st r112sp Compiler Scheduling Requires 0 Alias analysis 0 Ability to tell whether loadstore reference same memory locations 0 Effectively whether loadstore can be rearranged 0 Example code easy a loadsstores use same base register sp 0 New example can compilertell that r8 sp 0 Must be conservative Before Wrong ld r24sp ld r24sp ld r38sp ld r38sp add r3r2rl stall 39ld r50r8 does r8sp st rliipji add r3r2rl ld r50r8 ld r64r8 does r84sp 1d r64r8 st rlil sub r5r6r4 stall sub r5r6r4 st r48r8 st r48r8 CIS371 RothMartin Pipelining New Metric Utilization o Utilization actual performance peak performance 0 Important metric for performancecost o No point to paying for hardware you will rarely use 0 Adding hardware usually improves performance amp reduces utilization 0 Additional hardware can only be exploited some of the time o Diminishing marginal returns 0 Compiler can help make better use of existing hardware 0 Important for superscalar CIS 371 RomMarlin Scheduling Code Example SAXPY o SAXPY Singleprecision A X Plus Y 0 Linear algebra routine used in solving systems of equations 0 Part of early Livermore Loops benchmark suite for i0 iltNi ZiAXiYi 0 ldf Xrl 9fl loop 1 mulf f0f19f2 A in f0 2 ldf Yrl 9f3 XYZ are constant addresses 3 addf f2f3f4 4 stf f439Zrl 5 addi rl4rl i in r1 6 blt rlr20 N4 in 12 CIS 371 RomMarlin Scheduling SAXPY Performance and Utilization 1234567891011121314151617181920 ldf Xr1f1 F D X M W mulf f0f1f2 F D d E E E E E W ldf Yr1f3 F p D X M W addf 2 3 4 F D d d clquot E E W stf f4Zr1 F p p p D X M W addi r14r1 F D X M W blt r1r20 F D X M W F D X M W ldf xr1 1 o Scalar pipeline 0 Full bypassing Scycle E 2cycle E branches predicted taken 0 Single iteration 7 insns latency 16 5 11 cycles 0 Performance 7 insns 11 cycles 0641PC o Utilization 064 actual IPC 1 peakIPC 64 CIS 371 RomMarlin Scheduling SAXPY Performance and Utilization 1234567891011121314151617181920 1df Xr1f1 F D X M W mulf 0 1 2 F D d d E E E E E W 1df Yr1f3 F D p X M W addf 2 3 4 F p p D d d d d E E W stf f4Zr1 F p D p p p p d X M W addi r14r1 F p p p p p D X M W blt r1r20 F p p p p p D d X M W 1df Xr1f1 F D X M W 2way superscalar pipeline 0 Any two insns per cycle split integer and floating point pipelines Performance 7 insns 10 cycles 070 IPC Utilization 070 actual IPC 2 peak IPC 35 More hazards a more stalls Each stall is more expensive CIS 371 RomMarlin Scheduling 17 Loop Unrolling SAXPY 0 Goal separate dependent insns from one another 0 SAXPY problem not enough flexibility within one iteration o Longest chain of insns is 9 cycles 0 Load 1 0 Forward to multiply 5 0 Forward to add 2 0 Forward to store 1 Can t hide a 9cycle chain using only 7 insns 0 But how about two 9cycle chains using 14 insns 0 Loop unrolling schedule two or more iterations together 0 Fuse iterations 0 Schedule to reduce stalls 0 Schedule introduces ordering problems rename registers to fix CIS 371 RomMarlin Scheduling 19 Compiler Instruction Scheduling o Idea place independent insns between slow ops and uses 0 Otherwise pipeline stalls while waiting for RAW hazards to resolve 0 Have already seen pipeline scheduling 0 To schedule well you need independent insns 0 Scheduling scope code region we are scheduling o The bigger the better more independent insns to choose from 0 Once scope is defined schedule is pretty obvious 0 Trick is creating a large scope must schedule across branches 0 Compiler scheduling really scope enlarging techniques 0 Loop unrolling for loops CIS 371 RomMarlin Scheduling 18 Unrolling SAXPY I Fuse Iterations 0 Combine two in general K iterations of loop 0 Fuse loop control induction variable i increment branch 0 Adjust implicit induction uses constants a constants 4 ldf Xrl fl ldf Xrl fl mulf f0f1f2 mulf f0f1f2 ldf Yrl f3 ldf Yrl f3 addf f2f3f4 addf f2f3f4 stf f4Zr1 stf f4Zr1 add r14r1 blt r1r20 gt ldf Xrl fl ldf X4 r1 fl mulf f0f1f2 mulf f0f1f2 ldf Yrl f3 ldf Y4 r1 f3 addf f2f3f4 addf f2f3f4 stf f4Zr1 stf f4Z4rl addi r14r1 addi r18r1 blt r1r20 blt r1r20 CIS 371 RomMarlin Scheduling 20 Unrolling SAXPY II Pipeline Schedule 0 Pipeline schedule to reduce stalls 0 Have already seen this pipeline scheduling ldf Xr1 fl ldf Xr1 fl mulf f0f1f2 ldf X4rl fl ldf Yr1 f3 mulf f0f1f2 addf 2153154 mulf f0f1f2 stf f4Zr1 ldf Yr1f3 ldf x4 r1 fl ldf Y4r1 f3 m f f2 p addf 2153154 ldf Y4r1 f3 add 2153154 addf 2153154 stf f4Zrl stf f4Z4r1 stf f4Z4r1 addi r1 8r1 addi r18r1 blt r1r20 blt r1r20 CIS 371 RomMarlin Scheduling Unrolled SAXPY PerformanceUtilization 39 1234567891011121314151617181920 1df xr1 1 F D X M W 1df X4r1f5 F D X M W mulf 0 1 2 F D E E E E E W mulf 0 5 5 F D E E E E E W No propagation 1 15 erll9f3 F D X M W Different pipelines 1df Y4r1f7 F D X M s s addf 2 3 4 F D dE E s W addf f6f7f8 F p D E p EW stf f4Zr1 F D X M W stf f8Z4r1 F D X M W with r198r F D X M W blt r1r20 F D X M W F D X M W 1df xr1 1 Performance 12 insn 13 cycles 0921PC Utilization 092 actual IPC 1 peak IPC 92 Speedup 2 11 cycles 13 cycles 169 CIS 371 RomMarlin Scheduling Unrolling SAXPY III Rename Registers o Pipeline scheduling causes reordering violations 0 Rename registers to correct ldf Xr1f1 ldf Xr1 fl ldf X4r1 fl ldf X4rl f5 mulf f0f1f2 mulf f0f1f2 mulf f0f1f2 mulf f0f5f6 ldf Yr1 f3 ldf Yr1 f3 ldf Y 4 r1 f3 1df Y4r1 f7 addf f2f3 f4 gt addf 2153154 addf 2153154 addf f6f7f8 stf f4 Zr1 stf f4Zrl stf f4Z4r1 stf f8Z4r1 addi r18r1 addi r18r1 blt r1r20 blt r1r20 CIS 371 RomMarlin Scheduling Loop Unrolling Shortcomings Static code growth a more 13 misses limits degree of unrolling Needs more registers to hold values Doesn t handle nonloops Doesn39t handle recurrences interiteration dependences for i0iltNi XiAXI l ldf X 4r1fl ldf X 4rl fl mulf f0f1f2 mulf f0f1f2 stf f2Xr1 F stf f2Xr1 addi r14r1 mulf f0f2f3 blt rlr20 stf f3X4 r1 ldf X 4rl fl addi r14r1 mulf f0f1f2 blt r1r20 stf f2Xr1 addi r14r1 0 Two mulf s are not parallel hit rl rz o o Other more advanced techniques hglp CIS 371 RomMarlin Scheduling Anything The Compiler Can Do 0 Dynamicallyscheduled processors 0 Hardware reschedules insns 0 within a sliding window of VonNeumann insns 0 Does loop unrolling transparently 0 Does equivalent of loop unrolling on nonloop code 0 Uses branch prediction to unroll branches 0 Pentium ProIIIII 3wide Core2 4wide Alpha 21264 4 wide MIPS R10000 4wide PowerS Swide 0 Quick overview of approach 0 Lots more information in CIS501 graduate level architecture 215 371 RomMarlin Scheduling Code Example Code 39Raw insns add 152 153151 sub 152 rf153 mul 152 153915339 div 15154le 0 Divide insn independent of subtract and multiply insns 0 Can execute in parallel with subtract 0 Many registers reused 0 Just as in static scheduling the register names get in the way 0 How does the hardware get around this 0 Approach step 1 rename registers step 2 schedule C15 371 RomMarlin Scheduling C15 501 MartinRom Dynamic SchedulingI The Problem With InOrder Pipelines 12 3 4 5 6 7 8 910111213141516 addf 0 1 2 F D E E E W mulf 2 3 2 F D d d E E E E E W subf f0f1f4 F p p D E E E W 0 What s happening in cycle 4 o mulf stalls due to data dependence 0 OK this is a fundamental problem 0 subf stalls due to pipeline hazard 0 Why subf can t proceed into D because addf is there 0 That is the only reason and it isn t a fundamental one o Maintaining inorder writes to register file 0 Why can t subf go into D in cycle 4 and E in cycle 5 C15 501 MartinRom Dynamic SchedulingI Step 1 Register Renaming 0 To eliminate register conflictshazards o Architected vs Physical registers 0 Names 151152 153 0 Locations pl p2 p3 p4 p5 p6 p7 0 Original mapping rlapl rzapz rsaps p4 p7 are available 39MapTable FreeList Raw insns Renamed insns 151 152 153 121 122 p3 p4p5p6p7 add 152153151 add 122123134 124 122 123 125126137 sub 152151153 sub p2p p5 124 122 125 126127 mul 152153153v mul 122123136 124 122 126 p7 div 151 41 51 div pi4p7 o Renaming conceptually write each register once Removes false dependences Leaves true dependences intact c When to reuse a physical register After overwriting insn done 28 Register Renaming Algorithm 0 Data structures 0 maptablearchitecturalreg physicalreg c Free list getput free register 0 Algorithm at decode for each instruction insnphys7inputl maptableinsnarchiinputl insnphys7input2 maptableinsnarchiinputZ insnphys7togfree maptablearchioutput newireg getifreeiphysireg insnphysioutput newireg maptablearchioutput newireg 0 At commit 0 Once all older instructions have committed free register e putifreegphysireginsnphys7toifre C15 371 RomMarlin Scheduling Dynamic Scheduling Algorithm 0 Data structures 0 Ready tablephysreg yesno Algorithm at schedule stage prior to read registers foreach instruction if tableinsnphys7inputl ready ampamp tableinsnphys7input2 ready then insn as quotreadyquot select the oldest quotreadyquot instruction tableinsnphysioutput ready C15 371 RomMarlin Scheduling Step 2 Dynamic Scheduling add 122 p3 p4 A sub 122 p4 p5 mul 122123136 reg le lt div 4 4 7 D B 39gt gt I P i s gt 4 4 ReadyTable P2 P3 P4 P5 P6 P7 YesYes add 122123134 YesYesYes 5 11 132134135 and div p44p7 YesYesYesYes Yes mul 122125126 Yes Yes Yes Yes Yes Yes 0 Instructions fetchdecodedrenamed into Instruction Bu er 0 Also called instruction window or instruction scheduler 0 Instructions conceptually check ready bits every cycle when read 0 Egtltecut y 215 501 MartinRom Dynamic SchedulingI 30 Dynamic Scheduling OoO Execution 0 Dynamic scheduling 0 Totally in the hardware 0 Also called outof order execution 000 o Fetch many instructions into instruction window 0 Use branch prediction to speculate past multiple branches 0 Flush pipeline on branch misprediction o Rename to avoid false dependencies o Execute instructions as soon as possible 0 Register dependencies are known 0 Handling memory dependencies more tricky o Commit instructions in order 0 Anything strange happens before commit just flush the pipeline 0 Current machines 100 instruction scheduling window 215 501 MartinRom Dynamic SchedulingI 32 Dynamically Scheduling Memory Ops Static vs Dynamic Scheduling o Compilers must schedule memory ops conservatively 0 Options for hardware 0 Don t execute any load until all prior stores execute conservative o Execute loads as soon as possible detect violations aggressive If we can do this in software why build complex slowclock highpower hardware Performance portability 0 Don t want to recompile for new machines 0 When a store executes it checks if any later loads executed too More information available early to same address If so flush pipeline M dd b h d t 0 Learn violations over time selectivel reorder predictive 8mm a re ses ranc lrec Ions Before Wrong More registers available 1d r2 4 sp 1d r2 4 sp 0 Compiler may not have enough to schedule well ld r38 Sp ld r38Sp Speculative memory operation reordering add r3fr2 71 Stall 39ld 5 390 r8 d e5 r8SP o Compiler must be conservative hardware can speculate St rl l add r3 r2 rl But compiler has a larger scope ld r50r8 ld r64r8 does r84sp 1d r6 4 r8 St r1 1 o Compiler does as much as it can not much sub r5r6r4 stall sub r5r6r4 Hardware does the reSt st r4 8 r8 CIS371 RothMartin Pipelining 33 C15 501 MartinRoth Dynamic SchedulingI 34 This Unit Code Scheduling App APP App 0 Pipelining and superscalar review System software Mem no 0 Code scheduling c To reduce pipeline stalls c To increase ILP insn level parallelism 0 Two approaches 0 Static scheduling by the compiler 0 Dynamic scheduling by the hardware 0 Up next memory system amp caches C15 371 RomMarlin Scheduling 35 C15 371 Computer Organization and Design Unit 3 Performance 215 371 RothMartin Performance Readings 0 PH Chapter4 C15 371 RothMartin Performance This Unit App 0 Performance System software Latency and throughput 39 0 Benchmarking Mem 0 CPU performance equation C15 371 RothMartin Performance 2 240 a 371 o CSE 240 build something that works 0 CSE 371 build something that works well 0 well means highperformance but also cheap lowpower etc 0 Mostly highperformance So what is the performance of this 0 What is performance 215 371 RothMartin Performance 4 Performance Latency vs Throughput o Latency execution time time to finish a fixed task 0 Throughput bandwidth number of tasks in fixed time 0 Different exploit parallelism for throughput not latency eg bread Often contradictory latency vs throughput Will see many examples of this 0 Choose definition of performance that matches your goals 0 Scientific program Latency web server throughput 0 Example move people 10 miles 0 Car capacity 5 speed 60 mileshour Bus capacity 60 speed 20 mileshour Latency car 10 min bus 30 min Throughput car 15 PPH count return trip bus 60 PPH C15 371 RothMartin Performance 5 Processor Performance and Workloads 0 Q what does latencyPentium or thruputPentium mean 0 A nothing there must be some associated workload Workload set of tasks someone you cares about Benchmarks standard workloads Used to compare performance across machines 0 Either are or highly representative of actual programs people run Microbenchmarks nonstandard nonworkloads Tiny programs used to isolate certain aspects of performance 0 Not representative of complex behaviors of real applications 0 Examples towersof hanoi 8queens etc 215 371 RothMartin Performance 7 Comparing Performance 0 A is X times faster than B if LatencyA LatencyB X ThroughputA ThroughputB X A is X faster than B if LatencyA LatencyB 1X100 ThroughputA ThroughputB 1X100 Carbus example Latency Car is 3 times and 200 faster than bus Throughput Bus is 4 times and 300 faster than car 215 371 RothMartin Performance SPEC Benchmarks o SPEC oh rformance Evaluation Corporation 1 Standard Pe l Consortium that collects standardizes and distributes benchmarks Suites for CPU Java IO Web Mail OpenMP multithreaded etc 0 Updated every few years so companies don t target benchmarks Post SPECmark results for different processors 0 1 number that represents performance for entire suite 0 CPU 2006 29 CPUintensive CCFortran programs integer bzip2 gcc perl hmmer genomics h264 etc floatingpoint wrf weather povray sphynx3 speech etc TPC Transaction Processing Council 0 Like SPEC but for webdatabase server workloads Much heavier on memory IO network than on CPU 0 Doesn t give you the source code only a description 215 371 RothMartin Performance SPECmark 0 Reference machine Sun SPARC 10 o Latency SPECmark For each benchmark Take odd number of samples on both machines 0 Choose median Take latency ratio Sun SPARC 10 your machine 0 Take GMEAN of ratios over all benchmarks o Throughput SPECmark Run multiple benchmarks in parallel on multipleprocessor system 0 Recent SPECmark latency leaders SPECint Intel 23 GHZ Core2 Extreme 3119 SPECfp IBM 21 GHZ Power5 4051 C15 371 RothMartin Performance 9 CPU Performance Equation Mean Average Performance Numbers Multiple aspects to performance helps to isolate them Latency seconds program insns program cycles insn seconds cycle Insns program dynamic insn count fprogram compiler ISA Cycles insn CPI fprogram compiler ISA microarch Seconds cycle clock period fmicroarch technology For low latency better performance minimize all three Difficult often pull against one another 0 Example we have seen RISC vs CISC ISAs l RISC low CPIclock period high insn count l CISC low insn count high CPIclock period C15 371 RothMartin Performance 11 o Arithmetic 1N ZP1HN LatencyP For units that are proportional to time eg latency 0 You can add latencies but not throughputs LatencyP1P2A LatencyP1A LatencyP2A ThroughputP1P2A ThroughputP1A ThroughputP2A 1 mile 30 mileshour 1 mile 90 mileshour Average is not 60 mileshour o Harmonic N ZP1HN 1ThroughputP For units that are inversely proportional to time eg throughput o Geometric NxI39IP1 N SpeedupP For unitless quantities eg speedups C15 371 RothMartin Performance 10 MIPS performance metric not the ISA 0 Factor out dynamic insn count CPU equation becomes Latency seconds insn cycles insn seconds cycle Throughput insns second insns cycle cycles second 0 MIPS millions of insns per second insns second 10395 Cycles second clock frequency in MHZ Example CPI 2 clock 500 MHZ what is MIPS 0 05 500 MHZ quot 10 6 250 MIPS o MIPS is OK for microarchitects Typically work in one ISAone compiler treat insn count as fixed 0 Not OK for general public Processors with different ISAscompilers have incomparable MIPS Wait it gets worse C15 371 RothMartin Performance 12 Mhz MegaHertz and Ghz GigaHertz o 1 Hertz 1 cycle per second 1 Ghz is 1 cycle per nanosecond 1 Ghz 1000 Mhz o Microarchitects often ignore instruction count 0 but general public mostly also ignores CPI Equates clock frequency with performance 0 Which processor would you buy 0 Processor A CPI 2 clock 5 GHZ Processor B CPI 1 clock 3 GHZ Probably A but B is faster assuming same ISAcompiler 0 Classic example 0 800 MHZ PentiumIII faster than 1 GHZ Pentium4 Same ISA and compiler o Metapoint danger of partial performance metrics C15 371 RothMartin Performance 13 Amdahl s Law NonCPU Performance Equation o Literally total speedup limited by nonaccelerated piece 0 Example can optimize 50 of program A 0 Even magic optimization that makes this 50 disappear only yields a 2X speedup o For consumers buy a balanced system 0 For microarchitects build a balanced system MCCF Make Common Case Fast 0 Focus your efforts on things that matter C15 371 RothMartin Performance 15 0 Clock frequency implies CPU clock 0 Other system components have their own clocks or not Eg increasing processor clock doesn t accelerate memory 0 Example 0 Processor A CPICPU 1 CPIMEM 1 clock 500 MHZ What is the speedup if we double clock frequency 0 Base CPI 2 a IPC 05 a MIPS 250 0 New CPI 3 a IPC 033 a MIPS 333 Clock 2 a CPIMEM 2 Speedup 333250 133 ltlt 2 o What about an infinite clock frequency 0 Only a 2X factor of 2 speedup Example of Amdahl s Law C15 371 RothMartin Performance 14 How Can We Make Common Case Fast o If we don t know what CC is o How is CPI actually measured Execution time time Unix wall clock CPU system CPI CPU time clock frequency dynamic insn count 0 How is dynamic insn count measured 0 Hardware event counters eg LC3P37X insn counter o More useful is CPI breakdown CPICPU CPIMEM etc 0 So we know what performance problems are and what to fix 0 Hardware event counters eg LC3P37X branchload stall counters Accurate Can t measure everything or evaluate modifications Cyclelevel microarchitecture simulation eg SimpleScalar Measure exactly what you want evaluate potential fixes Burden of accuracy is on the simulator writer C15 371 RothMartin Performance 16 Latency vs Throughput Revisited Latency and throughput two views of performance o at the program level 0 not at the insn level Single insn latency Nobody cares programs comprised of billions of insns Difficult to reduce anyway As number of dynamic instructions is large Insn throughput a program latency or throughput Can reduce using interinsn parallelism Most important example pipelining C15 371 RothMartin Performance 17 Pipelining Clock Frequency vs IPC Increase number of pipeline stages Increases clock frequency decreases clock period Decreases IPC increase CPI At some point actually causes performance to decrease Optimal pipeline depth is program and technology specific Remember example PentiumIII 12 stage pipeline 800 MHZ Pentium4 22 stage pipeline 1 GHZ Actually slower because of lower IPC Core2 15 stage pipeline Intel learned its lesson C15 371 RothMartin Performance 19 InterInsn Parallelism Pipelining T insnmem Tregfile TALU Tdatamem regfile o Pipelining cut datapath into N stages here 5 Tsinglecvcle One insn in each stage in each cycle l39 ClOCk PerlOd MAXCl insnrrnernl Tregfilel TALUI Tdatarrnern Base CPI 1 insn enters and leaves every cycle Actual CPI gt 1 pipeline must often stall Individual insn latency increases pipeline overhead not the point C15 371 RothMartin Performance 18 What Determines Clock Frequency T T insnmem regfile TALU Tdatamem regfile Technology transistors ampwires microarchitecture ALUs pipeline 39 ClOCk PerlOd MAXa insnrmeml Tregfilel TALUI Tdatarrnern 0 But which one of these is largest 0 Simple model each gate has delay of 1 wires have no delay o M is largest Nbit ripplecarry has 2N gate delays 0 Reality ALUs don39t use ripplecarry wire delay dominates storage C15 371 RothMartin Performance 20 Summary IApp 0 Performance System software Latency and throughput metrics 39 0 CPU performance equation 0 Next 0 Fast integer arithmetic C15 371 RothMartin Performance C15 371 Computer Organization and Design Unit 7 Superscalar Pipelines C15 371 RothMartin Superscalar Pipelines Readings 0 PH 0 Chapter 69 215 371 RothMartin Superscalar Pipelines This Unit InOrder Superscalar Pipelines o Superscalar hardware issues SYStem s ware o Bypassing and register file M CPU 0 Stall logic em 0 Fetch and branch prediction 0 Multipleissue designs 0 Superscalar VLIW C15 371 RothMartin Superscalar Pipelines 2 Scalar Pipelines 0 So far we have looked at scalar pipelines 0 One instruction per stage 0 With control speculation 0 With bypassing not shown 0 With floatingpoint C15 371 RothMartin Superscalar Pipelines 4 Floating Point Pipelines The Flynn Bottleneck o Floating point FP insns typically use separate pipeline 0 Splits at decode stage at fetch you don t know it s a FP insn Performance mit 0f scalar pipeline is CPI IPC 1 0 Most all FP insns are multicycle here 3cycle FP adder Hazards I39m39t 395 Qt even aCh39eVed Separate Fp register le Hazards latch overhead a diminishing returns on superpipelining o FP loads and stores execute on integer pipeline address is integer C15 371 RomMartin Superscalar Pipelines 5 C15 371 RomMartin Superscalar Pipelines 6 The Flynn Bottleneck Superscalar Pipeline Diagrams Ideal scalar 1 2 1w 0r1r2 F D 1w 4r1r3 F 1w 8r1r4 add r14r15r6 add r12r13r7 add nunsha 1w 0r1839r9 6789101112 39I39IUXW 39nUXZ b 39nUXZ ui moxz moxz oxz xz 3 2way I 39 1 1w 0r1r2 1w 4r1r3 F o Overcome IPC limit with superscalar pipeline 1w 8r139r4 0 Two insns per stage or three or four or six or eight 5 12 er a r r r 0 Also called multiple Issue add turns 0 Egtltp0It InstructionLevel Parallelism ILP 1w 0 r18r9 C15 371 RomMartin Superscalar Pipelines 7 ti RF 6789101112 n TI39I39IUUN 39n39rIUUXXw mUUXXZZA oxxzzggm xzz Z C15 371 RomMartin Superscalar Pipelines 8 Superscalar Pipeline Diagrams Realistic scalar 1 2 1w 0r1r2 F D 1w 4r1r3 F 1w 8r1r4 add r4r5r6 6789101112 39noxw 39nUXZ b ngzgm W M W D X F D F 39nUXZ oxz 1w 0r839r9 2way 39 1 1w 0r1r2 1w 4r1r3 F 1w 8r1r4 add r4r539r6 add r2r3r7 add r7r6r8 1w 0r8r9 6789101112 T 39I39I39I39IUUN XZZA 39nDg UXXw 00 as mmoozggm Ogoxx 0X33 xz 3 C15 371 RomMartin Superscalar Pipelines 9 How Much ILP is There Superscalar CPI Calculations o The compiler tries to schedule code to avoid stalls 0 Even for scalar machines to fill loaduse delay slot 0 Even harder to schedule multipleissue superscalar o How much ILP is common 0 Greatly depends on the application 0 Consider memory copy 0 Unroll loop lots of independent operations 0 Other programs less so 0 Even given unbounded ILP superscalar has limits 0 IPC or CPI vs clock frequency tradeoff C15 371 RomMartin Superscalar Pipelines 11 Base CPI for scalar pipeline is 1 Base CPI for Nway superscalar pipeline is 1N Amplifies stall penalties o Assumes no data stalls an overly optmistic assumption Example Branch penalty calculation 0 20 branches 75 taken no explicit branch prediction Scalar pipeline 0 1 020752 13 a13113 a 30 slowdown 2way superscalar pipeline 0 05 020752 08 a 0805 16 a 60 slowdown 4way superscalar o 025 020752 055 a 055025 22 a 120 slowdown C15 371 RomMartin Superscalar Pipelines 10 Challenges for Superscalar Pipelines 0 So you want to build an Nway superscalar Hardware challenges 0 Stall logic N2 terms 0 Bypasses 2N2 paths 0 Register file 3N ports 0 IMemDMem how many ports 0 Anything else Software challenges 0 Does program inherently have ILP of N 0 Even if it does compiler must schedule code to expose it Given these challenges what is a reasonable N 0 Current answer is 3 or 4 C15 371 RomMartin Superscalar Pipelines 12 Superscalar Execution Superscalar Execution 0 Common design functional unit mix cx insn type mix 0 Integer apps 20 30 loads 10 15 stores 15 20 branches 0 FP apps 30 FP 20 loads 10 stores 5 branches 0 Rest 40 50 are nonbranch integer ALU operations 0 Intel Pentium 2way superscalar 1 any 1 integer ALU 0 Alpha 21164 2 integer including 2 loads or 1 store 2 FP prF o Execution units 0 Simple ALUs are cheap have N of these for Nwide processor 0 Nway superscalar N of every kind of functional unit Complex ALUS are less cheap have fewer Ofthese o N ALUs OK ALUs are small and integer insns are common Data memory bandwidth expensive 0 N FP dividers No FP dividers are huge and fdiv is uncommon o Multiport replicate or bank more later 0 How many loadsstores per cycle How many branches C15 371 RothMartin Superscalar Pipelines 13 C15 371 RothMartin Superscalar Pipelines 14 Superscalar Register File Superscalar Bypass 0 Except DMem execution units are easy 0 Consider WX bypass for lst input of each insn 0 Getting values tofrom them is the problem 2 nonregfile inputs to bypass mux in general N 4 pointtopoint connections in general N2 o Nway superscalar register file 2N read N write ports Bypass wires long slow and are dif cult to route 0 lt N write ports stores branches 35 insns don t write registers 39 And this lSjUSt one bYPaSS Stage and one inPUt Per insn o lt 2N read ports many inputs come from immediatesbypasses 0 N2 bypass Latency and area olt ports2 olt 3N2 slow for large N 215 371 RothMartin Superscalar Pipelines 15 C15 371 RothMartin Superscalar Pipelines 16 Superscalar Stall Logic 0 Full bypassing gt loaduse stalls only 0 Ignore 2nd register input 0 Stall logic for scalar pipeline XMopLOAD ampamp DXrslXMrd o Stall logic for a 2way superscalar pipeline 0 Stall logic for older insn in pair also stalls younger insn in pair XM1opLOAD ampamp DX1rslXM1rd XM2opLOAD ampamp DX1rs gtltM2rd o Stall logic for younger insn in pair doesn t stall older insn XM1opLOAD ampamp DX2rslXM1rd XM2opLOAD ampamp DX2rsl XM2rd DX2r51DX1rd o 5 terms for 2 insns N2 dependence crosscheck 0 Actually N2N 1 C15 371 RothMartin Superscalar Pipelines 17 Not All N2 Problems Created Equal o N2 bypass vs N2 dependence crosscheck 0 Which is the bigger problem N2 bypass by a lot 0 32 or 64 bit quantities vs 5bit 0 Multiple levels MX WX of bypass vs 1 level of stall logic 0 Must fit in one clock period with ALU vs not Dependence crosscheck not even 2nd biggest N2 problem 0 Regfile is also an N2 problem think latency where N is ports 0 And also more serious than crosscheck C15 371 RothMartin Superscalar Pipelines 19 Superscalar Pipeline Stalls o If older insn in pair stalls younger insns must stall too 0 What if younger insn stalls 0 Can older insn from next group move up 0 Fluid yes l Helps CPI a little hurts clock a little 0 Rigid no l Hurts CPI a little but doesn t impact clock ngid 1 2 3 4 5 Fluid 1w 0r1r4 D X M w 1w 0r1r4 add1 r41r4 d d D X addi r41r4 sub r5r2r3 F D sub r5r2r3 sw r30r1 F D sw r30r1 1w 4r1r8 F 1w 4r1r8 C15 371 RothMartin Superscalar Pipelines 18 Avoid N2 BypassRegFile Clustering cluster 0 cluster 1 o Clustering group ALUs into K clusters 0 Full bypassing within cluster limited or no bypassing between them 0 Get values from regfile with 1 or 2 cycle delay NK nonregfile inputs at each mugtlt N2K pointtopoint paths 0 Key to performance steer dependent insns to same cluster 0 Hurts IPC but helps clock frequency or wider issue at same clock 0 Typically used with replicated regfile replica per cluster 0 Alpha 21264 4way superscalar 2 clusters static steering C15 371 RothMartin Superscalar Pipelines 20 Superscalar Fetch Decode o What is involved in fetching N insns per cycle 0 Mostly wider instruction memory data bus 0 Most tricky aspects involve branch prediction 0 What about Decode o Easier with fixedwidth instructions MIPS Alpha PowerPC ARM 0 Harder with variablelength instructions X86 0 Can be pipeline C15 371 RomMartin Superscalar Pipelines 21 Pipelined Branch Prediction Fetch 0 V l 0 To fetch across a taken branch 0 Must fetch from two separate IMem addresses in same cycle 0 Split IMem into evenodd banks to provide bandwidth for this 0 Pipeline branch prediction and fetch branch prediction first 0 Branch prediction sends two PCs to fetch PC amp target PC if any Elongates pipeline increases branch penalty 0 Pentium II amp 111 do something like this 215 371 RomMartin Superscalar Pipelines 23 Superscalar Fetch with Branches 0 Three related questions 0 How many branches are predicted per cycle 0 If multiple insns fetched which is assumed to be the branch 0 Can we fetch across the branch if it is predicted taken Simplest common design one doesn t matter no 0 One prediction discard postbranch insns if prediction is taken 0 Doesn t matter associate prediction with nonbranch to same effect Lowers effective fetch bandwidth width and IPC 0 Average number of insns per taken branch 8 10 in integer code Compiler can help 0 Reduce taken branch frequency eg unroll loops C15 371 RomMartin Superscalar Pipelines 22 Aside VLIW o VLIW Very Long Insn Word 0 Effectively a 1wide pipeline but unit is an Ninsn group 0 Group travels down pipeline as a unit 0 Compiler guarantees insns within a VLIW group are independent 0 If no independent insns slots filled with nops 0 Typically slotted 1st insn must be ALU 2nd mem etc o Eg Itanium two 3wide bundles per cycle 6way issue Simplifies fetch and branch prediction Simplifies pipeline control no rigid vs fluid business Doesn t help bypasses or regfile which are bigger problems 0 Can egtltpose these issues to software too yuck Not really compatible across machines of different widths o How does Itanium deal with noncompatibility Transmeta C15 371 RomMartin Superscalar Pipelines 24 Predication 0 Branch mispredictions hurt more on superscalar 0 Replace difficult branches with something else 0 Convert control flow into data flow amp dependencies o Predication o Conditionally egtltecuted insns unconditionally fetched 0 Full predication ARM Intel Itanium 0 Can tag every insn with predicate but egtlttra bits in instruction 0 Conditional moves Alpha X86 0 Construct appearance of full predication from one primitive cmoveq rlr2r3 if rl0 r3r2 May require some code duplication to achieve desired effect Only good way of adding predication to an existing ISA o Ifconversion replacing control with predication C15 371 RothMartin superscalar Pipelines CIS 372 Compuer Organization and Design Lab Prof Amir Roth Unit 0 Introduction CSE 372 MartinRoth Introduction 1 Actual Hardware We ll use FPGAs Field Programmable Gate Array 0 Also called Programmable Logic Devices PLDs An FPGA is a special type of programmable chip 0 Conceptually contains a grid of gates o The wiring connecting them can be reconfigured electrically 0 Using more transistors as switches 0 Once configured the FPGA can emulate any digital logic design 0 Synthesis tool converts gatelevel design to configuration 0 Uses 0 Hardware prototyping what we are doing 0 Lowvolume specialpurpose hardware 0 New computational offload CSE 372 MartinRoth Introduction 3 Computer Organization and Design 0 CIS 240 computer organization o How a computer works 0 Hardware instruction sets datapath amp control integer arithmetic 0 Software assembly language LC3 LC4 C o CIS 371 and design 0 How a computer works fast And using little power 0 Focus on hardware pipelining caches speculation wide issue etc I I I I o CIS 372 lab 0 Build an LC4 processor Run programs maelstrom on it CSE 372 MartinRoth Introduction 2 In Our Lab Digilent XUPV2P Boards 0 Program FPGA to run p37X o The project Hook up keyboard Boards have many features Use some for debugging 0 LEDs switches Others not at all 0 Ethernet flash reader 0 256MB SDRAM audio inout 0 Can boot Linugtlt CSE 372 MartinRoth Introduction 4 Hardware Design Methodologies How do we get there from here How is hardware designed Many design methodology choices 0 Level transistors vs gates no other choice with FPGAs 0 Layout manual vs synthesized 0 Tool schematic vs hardware description language HDL CSE 372MartinRo 391 Introduction Hardware Description Languages HDLs 0 Write code to describe hardware 0 HDL vs SDL o Specify wires gates modules also hierarchical Easier to create edit modify scales well Disconnect must still think visually gets easier with practice module mux2tolS A B Out input 5 A B output Out wire S AnS Bn gut not s 5quot and AnS A s and BnS B S39 or Out AnS BnS endmodule CSE 372MartinRo 391 Introduction Schematics S o 0 Draw pictures 0 Use a schematic entry program to draw wires logic blocks gates 0 Support hierarchical design arbitrary nesting Good match for hardware which is inherently spatial purty Time consuming nonscalable large designs are unreadable o Rarely used in practice realworld designs are big CSE 372MartinRo 391 Introduction Hierarchical HDL Example 0 Build up more complex modules using simpler modules 0 Example 4bit wide mugtlt from four 1bit mugtltes S 4 module mux2tol4S A B Out A 4 input 30 A 4 Out input 30 B B input 5 output 30 Out mux2tol muxO S A0 B r mux2tol muxl S A1 Ell mux2tol mux2 S A2 B2r mux2tol mux3 S A3 B3r endmodule Outrom Out1 Out2 Outrsm CSE 372 MartinRoth Introduction Verilog HDL o Verilog HDL we will be using c Syntactically similar to C by design i Ease of syntax hides fact that this isn t C or any SDL c We will use a few lectures to learn Verilog module mux2tol4S A B Out fnput 3 0 A These aren t variables 1nput 30 B 4 input 5 tput 30 out These aren tfunction calls mux2tol muxO S A0 B0 CIR0 mux2tol muxl S A1 B1 CIR1 mux2tol mux2 S A2 B2 CIR2 mux2tol mux3 S A3 B3 CIR3 endmodule CSE 372MartinRoth Introduction 9 Job Posting Example 0 Actualjob posting Stone Ridge Technology is seeking an Electrical or Computer Engineer ith FPGA desm n experience Tilt quali ed candld ate Will have ex enencc in FPGA develo ment for complex a in c 1 up quotcannon Mindoro VHDL or Ur39llcriitleviees Thccantlida r 39 39 and 5y HI 39 n E 1 should als Synplify or Place 8 route tool experience with ntllesis tools experience vi Ll t 39 Wltl1 specinlwation in one or more oniie following i Quzirtus is aim required A BS VI ompi s ompilers and opei ii ml 1 1 er s ste n arch tecture roees Integrated circuits VLSI design testing and cad Stone Ridge Technology is on early stage technology slarlilp working in the High Perlorrnzmet Computing industry The company has clients in both government and industry and is currentl39 Working on software applications in quantum chemistry ciassmai mid quantum molecular dynamics C v seismic processing reservoir simulation all eoirlputauon man e ntle ea ilge ge 0 l1 is developing a proprie my compl 6139 an opera mg sysren la enable the use of FPGAs for l 39an iates man I 7e eoin accelerated genera eom utng ortahlc with an aggressive lilghncnergy staniip cnvu onmelli CSE 372 MartinRoth Introduction 11 Why Study Digital Design 0 Learn by doing 0 If you can build it you understand it 0 Exposure to design issues 0 Lessons learned will be applicable in many domains 0 Testing design partitioning bottomup vs topdown etc 0 Get a job 0 Can t live off your parents forever 0 Project experience looks good 0 Pretty cool 0 Look ma 1 built my own processor 0 Hey what s your name Wanna see my processor CSE 372 MartinRoth Introduction 10 Byebye 372 o This is the last instance of 372 0 Starting 2010 371372 combine into a single 10 credit course 0 Workload will be reduced proportionately 0 Changes coming in 380381 also 0 Why More flexibility o This is the transition year 0 Somewhat reduced workload o 371 4 fewer lectures o 372 3 labs instead of 4 no individual labs 0 Still 15 credits CSE 372 MartinRoth Introduction 12 The Labs 0 Lab 0 due Thurs 25 0 Getting started tools intro timer device 0 Lab 1 due Thurs 226 before midterm o Singlecycle LC4 processor register file datapath controller 0 Lab 2 due Thurs 423 last week of class 0 A Pipelined LC4 processor B bypassing C branch prediction 0 Tiered grading o Labs 0 and 1 are mandatory for a B 0 Lab 2A 2B 2C optional for B A A o All done in groups of 2 o Boards and lockers handed out after class next Wednesday 128 0 Make sure partner wants to go as far as you do 0 Note final exam will decouple your grade somewhat CSE 372MartinRo 391 Introduction 13 Lab Logistics 0 KLab Moore 204 0 Home of the boards computers and later in semester you c 24 hour access keycode for door lock 0 TA Office hours project demos here too 0 Tools 0 Digilent XUPVZP boards 0 Xilingtlt ISE v82i 0 Warning all such tools notorious for being buggy and fragile 0 Working from home 0 Student version of Xilingtlt ISE is available 0 ICARUS087 fast but simulation only and not fully featured o All projects must run on the boards in the lab CSE 372MartinRo 391 Introduction 15 372 Lectures 5 372 lectures over the course of the semester 0 Stolen from 371 lectures we will try not to use the Friday slot 0 1 Verilog Lab 0 prep 0 2 VPI and Verilog simulation 0 3 FPGAs and synthesis Lab 1 prep 0 4 Design and testing Lab 2 prep 0 5 372 final exam 40 minutes Not covered 0 Transistorlevel design VLSI layout circuit analysis 0 Synthesis algorithms place 81 route 0 Embedded systems hardwaresoftware codesign ASICsASIPs CSE 372MartinRo 391 Introduction 14 Resources Instructor 0 Amir Roth 0 Office Levine 603 Hours TR 1303 TAs all undergrads from 372 last year 0 Nate conradj Michajlo michajlo Andres andres 0 Office Klab Hours MRF 3630 0 Web 0 Class web page h 0 Google group Academic Conduct 0 Anything with your name on it must be your own 0 Penn s code j CSE 372MartinRo 391 Introduction 16 This Unit Shared Memory Multiprocessors Threadlevel parallelism TLP Shared memory model o Multiplexed uniprocessor Computer Organization and Design CPU 2 gjigggfegrgghreadmg Synchronization 0 Lock implementation Unit 10 Shared Memory Multiprocessors Locking gotchas Cache coherence o Busbased protocols 0 Directory protocols Memory consistency models C15 371 MartinRoth Shared Memory Multiprocessors 1 C15 371 MartinRoth Shared Memory Multiprocessors 2 Multiplying Performance Multicore Mainstream Multiprocessors Multicore chips IBM Power5 0 Two 2GHz PowerPC cores 0 Shared 15 MB L2 L3 tags AMD Quad Phenom 39quot 0 Four 25GHz cores 0 Percore 512KB L2 cache 0 Shared 2MB L3 cache Intel Core 2 Quad 0 Four cores shared 4 MB L2 0 Two 4MB L2 caches Sun Niagara Why multicore What else would 8 cores each 4Way threaded you do With 500 million tranSIstors Shared 2MB L2 Shared FP C15 371 MartinRoth Shared Memory Multiprocessors 0 For servers not dESktOp 4 o A single processor can only be so fast 0 Limited clock frequency 0 Limited instructionlevel parallelism 0 Limited cache hierarchy What if we need even more computing power 0 Use multiple processors 0 But how Highend example Sun Ultra Enterprise 25k o 72 UltraSPARC IV processors 1SGhz o 1024 GBs of memory 0 Niche large database servers C15 371 MartinRoth Shared Memory Multiprocessors Application Domains for Multiprocessors Scientific computingsupercomputing 0 Examples weather simulation aerodynamics protein folding 0 Large grids integrating changes over time 0 Each processor computes for a part of the grid Server workloads 0 Example airline reservation database 0 Many concurrent updates searches lookups queries 0 Processors handle different requests Media workloads o Processors compressdecompress different parts of imageframes Desktop workloads Gaming workloads But software must be written to expose parallelism C15 371 MartinRom Shared Memory Multiprocessors 5 Multithreaded Programming Model o Programmer explicitly creates multiple threads 0 All loads amp stores to a single shared memory space 0 Each thread has a private stack frame for local variables 0 A thread switch can occur at any time o Preemptive multithreading by OS 0 Common uses 0 Handling user interaction GUI programming 0 Handling IO latency send network message wait for response 0 Expressing parallel work via ThreadLevel Parallelism TLP C15 371 MartinRom Shared Memory Multiprocessors 7 But First Uniprocessor Concurrency Software thread 0 Independent flow of execution o Context state PC registers 0 Threads generally share the same memory space 0 Process like a thread but different memory space 0 Java has thread support built in CC supports Pthreads library Generally system software the OS manages threads 0 Thread scheduling context switching o All threads share the one processor 0 Hardware timer interrupt occasionally triggers 05 o Quickly swapping threads gives illusion of concurrent execution 0 Much more in CIS380 C15 371 MartinRom Shared Memory Multiprocessors 6 Hardware Multithreading Hardware Multithreading MT 0 Multiple threads dynamically share a single pipeline caches o Replicate thread contexts PC and register file 0 Coarsegrain MT switch on L2 misses Why 0 Simultaneous MT no explicit switching finegrain interleaving o Pentium4 is 2way hyperthreaded leverages outof order core MT Improves utilization and throughput 0 Single programs utilize lt50 of pipeline branch misses 0 MT does not improve singlethread performance Individual threads run as fast or even slower 0 C15 371 MartinRom Shared Memory Multiprocessors 8 Simplest Multiprocessor Shared Memory Implementations o Multiplexed uniprocessor 0 Runtime system andor OS occasionally preempt amp swap threads 0 Interleaved but no parallelism 0 Hardware multithreading o Tolerate pipeline latencies higher efficiency 0 Same interleaved sharedmemory model o Replicate entire processor pipeline 0 Instead of replicating just register file amp PC 0 Exception share caches we ll address this bottleneck later 0 Same shared memory or multithreaded model 0 Loads and stores from two processors are interleaved o Advantagesdisadvantages over hardware multithreading o Multiprocessing 0 Multiply execution resources higher peak performance 0 Same interleaved sharedmemory model 0 Foreshadowing allow private caches further disentangle cores 0 All have same shared memory programming model C15 371 MartinRom Shared Memory Multiprocessors 9 C15 371 MartinRom Shared Memory Multiprocessors 10 ThreadLevel Parallelism Example An Example Execution struct acctt int bal Thread 0 Thread 1 Mem shared struct acctt accts MAXACCT 39 addi r1acctsr3 500 addi r1acctsr3 ld 0r3 1 int id amt if acctsid bal gt amt ld 0r3r4 blt r4r26 sub r4r2r4 st r40r3 400 addi r1acctsr3 1d 0r3lr4 blt r4r26 sub r4r2r4 st r40r3 acctsid bal amt givecash 39 st r40r3 call givecash U11thl O III U H a H N H uh a U11thl O call givecash Threadlevel parallelism Clquot LP 0 Collection of asynchronous tasks not started and stopped together 0 Data shared loosely sometimes yes mostly no dynamically Example databaseweb server each query is a thread 11 give Cash o accts is shared can t register allocate even if it were scalar TWO 100 Withdrawas from account 241 at two ATMS 0 id and amt are private variables register allocated to r1 r2 Each transaction maps to thread on different processor Running example Track accts24l bal address is in r3 U11thl O C15 371 MartinRom Shared Memory Multiprocessors 11 C15 371 MartinRom Shared Memory Multiprocessors 12 aui A Problem Execution Thread 0 addi r1 accts r3 Thread 1 0 1 ld 0r3 r4 2 blt r4r26 3 sub r4r2r4 ltltlt Interrupt gtgtgt emu 0 addi r1acctsr3 1 1d 0r3lr4a 2 blt r4r26 3 sub r4r2r4 4 st r40r3 5 call givecash 4 st r40r3 400 5 call givecash 0 Problem wrong account balance Why 0 Solution synchronize access to account balance C15 371 MartinRom Shared Memory Multiprocessors 13 A Synchronized Execution Thread 0 Thread 1 W call acquire lock 500 m 0 addi r1acctsr3 1 ld 0r3 r4 2 blt r4r26 3 sub r4r2r4 ltltlt Interrupt gtgtgt call acquirelockSPinS ltltlt Interrupt gtgtgt 4 st r40r3 400 call releaselock 5 call givecash still in acquire 0 addi r1acctsr3 o Fixed but how do 1 1 0 r3quotr4 2 blt r4r26 we implement 3 sub irzlr n W 4 acquire amp release 5 c 215 371 MartinRom Shared Memory Multiprocessors l givecash Synchronization o Synchronization a key issue for shared memory 0 Regulate access to shared data mutual exclusion 0 Software constructs semaphore monitor mutegtlt o Lowlevel primitive lock 0 OpemUOnSacquirelockand releaselock 0 Region between acquire and release is a critical section 0 Must interleave acquire and release 0 Interfering acquire will bloc struct acctt int bal 39 shared struct acctt acctsMAXACCT39 shared Lnt lock int id amt acquire lock 39 critical section C15 371 MartinRom Shared Memory Multiprocessors 14 Strawman Lock Incorrect Spin lock software lock implementation 0 acquire lock while lock 0 lock 1 0 Spin while lock is 1 wait for it to turn 0 ld 0 amplock r6 A1 A2 addi r61r6 A3 st r60amplock 0 release lock lock 0 R0 st r00amplock r0 holds 0 C15 371 MartinRom Shared Memory Multiprocessors 16 Strawman Lock Incorrect Thread 0 Thread 1 Mem A0 1d 0amploek r6 1 0 A1 bnez r6AO A0 1d r60amploek A2 addi r61r6 A1 bnez r6AO A3 st r60amploek A239 addi r6 1 r6 1 CRITICALSECTION A3 5t 1550 amplDCk 1 CRITIQLSECTION 0 Spin lock makes intuitive sense but doesn t actually work 0 Loadsstores of two acquire sequences can be interleaved 0 Lock acquire sequence also not atomic 0 Same problem as before 0 Note release is trivially atomic C15 371 MartinRom Shared Memory Multiprocessors 17 Better Spin Lock Use Atomic Swap 0 ISA provides an atomic lock acquisition instruction 0 Example atomic swap swap r1 0 amplock mov rl gtr2 o Atomically executes ld rl0amplock st r2 0amplock 0 New acquire sequence value of rl is 1 A0 swap rl 0 amplock Al bnez rlA0 o If lock was initially busy 1 doesn t change it keep looping o If lock was initially free 0 acquires it sets it to 1 break loop 0 Insures lock held by at most one thread 0 Other variants exchange compareandswap testandset or fetchandadd C15 371 MartinRom Shared Memory Multiprocessors 19 A Correct Implementation SYSCALL Lock ACQUIRELOCK A l disableinterrupts atomic A2 1d r60amploek A3 bnez r6AO A4 addi r61r6 A5 st r60amploek A6 enable 39 A7 return 0 Implement lock in a SYSCALL 0 Only kernel can control interleaving by disabling interrupts Works Large system call overhead But not in a hardware multithreading or a multiprocessor C15 371 MartinRom Shared Memory Multiprocessors 18 Atomic UpdateSwap Implementation o How is atomic swap implemented 0 Need to ensure no intervening memory operations 0 Requires blocking access by other threads temporarily yuck o How to pipeline it 0 Both a load and a store yuck 0 Not very RISClike 0 Some ISAs provide a loadlink and storeconditional insn pair C15 371 MartinRom Shared Memory Multiprocessors 20 Lock Correctness Thread 0 Thread 1 A0 swap rl0amploek A1 bnez r1A0 A0 swap r10amploek CRITICALSECTION Al bnez rlA0 A0 swap r10amploek Al bnez r1 A0 Testandset lock actually works 0 Thread 1 keeps spinning C15 371 MartinRoth Shared Memory Multiprocessors 21 CoarseGrain Locks Correct but Slow Programming With Locks Is Difficult o Coarsegrain locks eg one lock for entire database Easy to make correct no chance for unintended interference No P in TLP no two critical sections can proceed in parallel struet aeett int bal shared struet aeett aeetsMAXACCT int id amt shared int lock acquire lock if aeetsid bal gt amt aeetsid bal amt C15 371 MartinRoth Shared Memory Multiprocessors 23 o Multicore processors are the way of the foreseeable future 0 TLP anointed as parallelism model of choice 0 Just one problem 0 Writing lockbased multithreaded programs is dif cult o More precisely 0 Writing programs that are correct is easy not really 0 Writing programs that are highly parallel is easy not really Writing programs that are both correct and parallel is difficult 0 Very difficult true 0 Unfortunate goal but that s the whole point after all 0 Locking granularity issues 215 371 MartinRoth Shared Memory Multiprocessors 22 FineGrain Locks Parallel But Difficult o Finegrain locks eg multiple locks one per record Fast critical sections to different records can proceed in parallel Difficult to make correct easy to make mistakes 0 This particular example is easy 0 Requires only one lock per critical section 0 Consider critical section that requires two locks struet aeett int balloek shared struet aeett aeetsMAXACCT int idamt acquire acets id loek if aeetsid bal gt amt aeetsid bal amt giveeash release acets id loek C15 371 MartinRoth Shared Memory Multiprocessors 24 Multiple Locks Multiple locks eg acctto acct transfer 0 Must acquire both idfrom idto locks 0 Running example with accts 241 and 37 o Simultaneous transfers 241 37 and 37 a 241 o Contrived but even contrived examples must work correctly too struct acctt int ballock shared struct acctt acctsMAXACCT int idfromidtoamt acquire accts idfrom lock acquire accts id to lock if acctsidfrom bal gt amt acctsidfrom bal amt acctsidto bal amt release accts id to lock release accts idfrom lock C15 371 MartinRom Shared Memory Multiprocessors 25 Correct Multiple Lock Program 0 Always acquire multiple locks in same order 0 Just another thing to keep in mind when programming 0 Ho hum struct acctt int ballock shared struct acctt acctsMAXACCT int id fromid toamt int idfirst minidfrom idto int idsecond maxidfrom idto acquire accts idfirst lock acquire accts idsecond lock if acctsidfrom bal gt amt acctsid from bal amt39 release accts idsecond lock s idfirst lock C15 371 MartinRom Shared Memory Multiprocessors 27 Multiple Locks And Deadlock Thread 0 Thread 1 idfrom 241 idfrom 37 idto 37 idto 241 acquire accts241 lock acquire accts37 lock wait to acquire lock wait to acquire lock 241 37 waiting waiting still waiting 0 Deadlock circular wait for shared resources 0 Thread 0 has ock 241 waits for lock 37 0 Thread 1 has lock 37 waits for lock 241 0 Obviously this is a problem 0 The solution is 215 371 MartinRom Shared Memory Multiprocessors 26 Correct Multiple Lock Execution Thread 0 Thread 1 idfrom 241 idfrom 37 idto 37 idto 241 idfirst min2413737 idfirst min3724137 idsecond max37241241 idsecond max37241241 acquire accts37 lock wait to acquire lock 37 acquire accts241 lock waiting do stuff release accts241 lock release accts37 lock acquire accts37 lock 0 Great are we done No 215 371 MartinRom Shared Memory Multiprocessors 28 More Lock Madness What if 0 Some actions eg deposits transfers require 1 or 2 locks o and others eg prepare statements require all of them 0 Can these proceed in parallel What if c There are locks for global variables eg operation id counter c When should operations grab this lock What if what if what if So lockbased programming is difficult wait it gets worse C15 371 MartinRom Shared Memory Multiprocessors 29 Research Transactional Memory TM And To Make It Worse o Transactional Memory Programming simplicity of coarsegrain locks Higher concurrency parallelism of finegrain locks 0 Critical sections only serialized if data is actually shared No lock acquisition overhead 0 Hottest thing since sliced bread 0 No fewer than 9 research projects Brown Stanford MIT Intel 0 Penn too C15 371 MartinRom Shared Memory Multiprocessors 31 0 Acquiring locks is expensive 0 By definition requires a slow atomic instructions 0 Specifically acquiring write permissions to the lock 0 Ordering constraints see soon make it even slower o and 99 of the time unnecessary 0 Most concurrent actions don t actually share data You paying to acquire the locks for no reason 0 Fixing these problem is an area of active research 0 One proposed solution Transactional Memory C15 371 MartinRom Shared Memory Multiprocessors 30 Transactional Memory The Big Idea 0 Big idea I no locks just shared data 0 Look ma no locks 0 Big idea II optimistic speculative concurrency o Execute critical section speculatively abort on conflicts 0 Better to beg for forgiveness than to ask for permission strucl acctl int bal shared strucl acctl acctsMAXACCT int idfromidloaml heginlransaclion if acctsidfrom bal gt amt acctsid from bal amt acctsidlo hal amt C15 371 MartinRom Shared Memory Multiprocessors 32 Transactional Memory ReadWrite Sets o Read set set of shared addresses critical section reads 0 Example accts37 bal accts24l bal 0 Write set set of shared addresses critical section writes 0 Example accts37 bal accts24l bal struct acctt int bal shared struct acctt acctsMAXACCT int idfromidtoamt hegintransaction if acctsidfrom bal gt amt acctsid from bal amt acctsidto hal amt C15 371 MartinRom Shared Memory Multiprocessors 33 Transactional Memory End Transactional Memory Begin 0 endtransaction 0 Check read set is all data you read still valid ie no writes to any 0 Yes Commit transactions commit writes o No Abort transaction restore checkpoint struct acctt int bal shared struct acctt acctsMAXACCT int idfromidtoamt hegintransaction if acctsidfrom bal gt amt acctsid from bal amt acctsidto hal amt C15 371 MartinRom Shared Memory Multiprocessors 35 begintransaction 0 Take a local register checkpoint 0 Begin locally tracking read set remember addresses you read 0 See if anyone else is trying to write it o Locally buffer all of your writes invisible to other processors Local actions only no lock acquire struct acctt int bal shared struct acctt acctsMAXACCT int idfromidtoamt hegintransaction if acctsidfrom bal gt amt acctsid from bal amt acctsidto hal amt C15 371 MartinRom Shared Memory Multiprocessors 34 Transactional Memory Implementation How are readsetwriteset implemented 0 Track locations accessed using bits in the cache Readset additional transactional read bit per block 0 Set on reads between begintransaction and endtransaction c Any other write to block with set bit triggers abort 0 Flash cleared on transaction abort or commit Writeset additional transactional write bit per block 0 Set on writes between begintransaction and endtransaction 0 Flash cleared on transaction commit o On transaction abort blocks with set bit are invalidated C15 371 MartinRom Shared Memory Multiprocessors 36 Transactional Execution Thread 0 Thread 1 idfrom 241 idfrom 37 idto 37 idto 241 begintransaetion begintransaetion ifaeets241 bal gt 100 ifaeets37 bal gt 100 aeets37 bal amt write aeets241bal aets241 bal i amt abort endtransaetion no writes to aeets241bal no writes to aeets37 bal commit C15 371 MartinRoth Shared Memory Multiprocessors 37 So Let s Just Do Transactions o What if c Readset or writeset bigger than cache 0 Transaction gets swapped out in the middle 0 Transaction wants to do 10 or SYSCALL notabortable o How do we transactify existing lock based programs 0 Replace acquire with begintrans does not always work 0 Several different kinds of transaction semantics 0 Which one do we want 0 That s what these research groups are looking at 0 Industry adoption 0 Sun s Rock processor has besteffort hardware TM 0 Speculative locking Azul systems and Intel rumor C15 371 MartinRoth Shared Memory Multiprocessors 39 Transactional Execution II More Likely Thread 0 Thread 1 idfrom 241 idfrom 450 idto 37 idto 118 hegintransaetion begintransaetion ifaeets241bal gt 100 ifaeets450 bal gt 100 aeets241bal amt aeets450 bal amt aets37 bal i amt aets118 bal i amt endtransaetion endtransaetion no write to aeets240bal no write to aeets450 bal no write to aeets37 bal no write to aeets118 bal commit commit 0 Critical sections execute in parallel C15 371 MartinRoth Shared Memory Multiprocessors 38 Roadmap Checkpoint System software CPU O o Cache coherence o Busbased protocols 0 Directory protocols 0 Memory consistency models 215 371 MartinRoth Shared Memory Multiprocessors 40 NoCache NoProblem CRUD cram Mem Processor 0 Processor 1 addi r3r1ampaccts 1w r40r3 blt r4r26 sub r4r4r2 sw r40r3 E 400 jal dispensecash addi r3r1ampaccts 1w r40r3 quotquotquotquotquot quot400 0 1 2 blt r4r26 3 4 musmNI o sub r4r4r2 sw r40r3 5 jal dispensecash Scenario I processors have no caches o No problem mm 215 371 MartinRoth Shared Memory Multiprocessors 45 WriteThrough Doesn t Fix It Processor 0 Processor 1 addi r3r1ampaccts 1w r40r3 blt r4r26 sub r4r4r2 5w r40r3 jal dispensecash addi r3r1ampaccts 1w r40r3 quotquotquotquotquot quot 400 0 1 2 blt r4r26 3 4 musmNI o sub r4r4r2 sw r40r3 5 jal dispensecash Scenario IIb processors have writethrough caches o This time only 2 different copies of accts 241 bal o No problem What if another withdrawal happens on processor 0 C15 371 MartinRoth Shared Memory Multiprocessors 47 Cache Incoherence PUV 39CRUJ Mem Processor 0 Processor 1 addi r3r1ampaccts 1w r40r3 blt r4r26 sub r4r4r2 5w mloma jal dispensecash addi r3r1ampaccts 1w r40r3 blt r4r26 sub r4r4r2 sw r40r3 5 jal dispensecash 0 Scenario IIa processors have writeback caches o Potentially 3 copies of accts 241 bal memory p0 p1 0 Can get incoherent inconsistent U11hLJNI O PmNi O C15 371 MartinRoth Shared Memory Multiprocessors 46 What To Do o No caches Slow 0 Make shared data uncachable Faster but still too slow 0 Entire accts database is technically shared 0 Definition of loosely shared 0 Data only really shared if two ATMs access same acct at once 0 Flush all other caches on writes to shared data 0 May as well not have caches 0 Hardware cache coherence 0 Rough goal all caches have same data at all times Minimal flushing magtltimum caching a best performance 215 371 MartinRoth Shared Memory Multiprocessors 48 VI a MSI I BRgtBWgt 0 VI protocol is inefficient Only one cached copy allowed in entire system Multiple copies can t exist even if readonly 0 Not a problem in example 0 Big problem in reality 0 MSI modifiedsharedinvalid o Figtltes problem splits V state into two states 0 M modified local dirty copy 0 S shared local clean copy 0 Allows either 0 Multiple readonly copies Sstate OR 0 Single readwrite copy Mstate BWgtSD WBgtSD C15 371 MartinRoth Shared Memory Multiprocessors 53 Exclusive Clean Protocol Optimization to o foPu i Mem Processor 0 Processor 1 addi r3rlampaccts lw r40r3 blt r4r26 sub r4r4r2 sw r40r3 Nomiss jal dispensecash addi r3rlampaccts 1w r40r3 0 l 2 blt r4r26 3 4 musmNHo sub r4 r4 r2 5w r40r3 M300 5 jal dispensecash Most modern protocols also include E exclusive state 0 Interpretation 1 have the only cached copy and it s a clean copy 0 Why would this state be useful C15 371 MartinRoth Shared Memory Multiprocessors 55 MSI Protocol WriteBack Cache ePuo oRjui Mem Processor 0 Processor 1 addi r3rlampaccts lw r40r3 blt r4r26 sub r4r4r2 sw r40r3 jal dispensecash 0 addi r3rlampaccts 1 1w r40r3 2 blt r4r26 3 4 U11hLJNI O sub r4r4r2 sw r40r3 m 5 jal dispensecash 1w by processor 1 generates a BR 0 Processor 0 responds by sending its dirty copy transitioning to S sw by processor 1 generates a BW 0 Processor 0 responds by transitioning to I 215 371 MartinRoth Shared Memory Multiprocessors 54 Cache Coherence and Cache Misses o A coherence protocol can effect a cache s miss rate miss 0 Requests from other processors can invalidate evict local blocks 0 4C miss model compulsory capacity conflict coherence o Coherence miss miss to a block evicted by bus event 0 As opposed to a processor event Cache parameters interact with coherence misses Larger capacity more coherence misses 0 But offset by reduction in capacity misses Increased block size more coherence misses 0 False sharing sharing a cache line without sharing data 0 Creates pathological pingpong behavior 0 Careful data placement may help but is difficult C15 371 MartinRoth Shared Memory Multiprocessors 56 Cache Coherence and Cache Misses o A coherence protocol can effect a cache s miss rate miss 0 Requests from other processors can invalidate evict local blocks 0 4C miss model compulsory capacity conflict coherence o Coherence miss miss to a block evicted by bus event 0 As opposed to a processor event 0 Example directmapped 4B cache 18 blocks 4bit memory Cache contenis Snooping Bandwidth Requirements Setoo Set01 Set10 Set11 Event Outcome S0000 M0001 S0010 S0011 Wr0011 Upgrade Miss S0000 M0001 S0010 M0011 Bust0000 Nothing S0000 M0001 S0010 M0011 BusWr0010 SI Invalidation S0000 M0001 I0010 M0011 Rd1011 Compulsory Miss S0000 M0001 10010 51011 Rdmom cnherence Miss S0000 M0001 50010 S1011 C15 371 MartinRom Shared Memory Multiprocessors 57 More Snooping Bandwidth Problems 0 Bus bandwidth is not the only problem 0 Also processor snooping bandwidth 0 001 eventsinsn 2 insncycle 002 eventscycle per processor 0 16 processors 032 busside tag lookups per cycle 0 Add 1 port to cache tags Sure o Invalidate over upgrade Tags smaller data ports less expensive 128 processors 256 bus side tag lookups per cycle 0 Add 3 ports to cache tags Oy vey Implementing inclusion L1 is strict subset of L2 helps a little 0 2 additional ports on L2 tags only 0 Processor doesn t use existing tag port most of the time o If L2 doesn t care 99 of the time no need to bother L1 Still kind of bad though 0 Upshot busbased coherence doesn t scale well C15 371 MartinRom Shared Memory Multiprocessors 59 Coherence events generated on 0 L2 misses and writebacks Some parameters 0 2 GHz CPUs 2 IPC 33 memory operations 0 2 of which miss in the L2 648 blocks 50 dirty o 033 002 15 001 eventsinsn o 001 eventsinsn 2 insncycle 2 cyclens 004 eventsns 0 Address request 004 eventsns 4 Bevent 016 685 0 Data response 004 eventsns 64 Bevent 256 GBs That s 25 GBs per processor 0 With 16 processors that s 40 685 0 With 128 processors that s 320 GBsll 0 You can use multiple buses but that hinders global ordering C15 371 MartinRom Shared Memory Multiprocessors 58 Scalable Cache Coherence I BRBW 0 Part I bus bandwidth 0 Replace nonscalable bandwidth substrate bus 0 with scalable one pointtopoint network eg mesh 0 Part II processor snooping bandwidth 0 Most snoops result in no action 0 Replace nonscalable broadcast protocol spam everyone 0 with scalable directory protocol only notify processors that care C15 371 MartinRom Shared Memory Multiprocessors 60 Scalable Cache Coherence o Pointtopoint interconnects o Glueless MP no need for additional glue chips Can be arbitrarily large 1000 s of processors 0 Massively parallel processors MPPs 0 Only government DOD has MPPs 0 Companies have much smaller systems 32 64 processors 0 Scalable multiprocessors 0 AMD OpteronPhenom pointto point glueless MP uses broadcast C15 371 MartinRoda Shared Memory Multiprocessors 61 MSI Directory Protocol 0 Processor side 0 Directory follows its own protocol obvious in principle 0 Similar to busbased MSI 0 Same three states 0 Same five actions keep BRBW names 0 Minus grayed out arcsactions 0 Bus events that would not trigger action anyway Directory won t bother you unless you need to act C15 371 MartinRoda Shared Memory Multiprocessors 63 Directory Coherence Protocols Observe address space statically partitioned Can easily determine which memory module holds a given line 0 That memory module sometimes called home Can t easily determine which processors have line in their caches o Busbased protocol broadcast events to all processorscaches i Simple and fast but nonscalable Directories nonbroadcast coherence protocol Egtlttend memory to track caching information For each physical cache line whose home this is track 0 Owner which processor has a dirty copy Ie M state 0 Sharers which processors have clean copies Ie S state Processor sends coherence event to home directory 0 Home directory only sends events to processors that care 215 371 MartinRoda Shared Memory Multiprocessors 62 Directory MSI Protocol P 0 P 1 P0 P1 Directory rocessor rocessor 0 addi r1acctsr3 1 1d 0 r3 r4 2 blt r4r26 s5ooso5oo 3 sub 154152154 4 st r40r3 5 call dispensecash addi r1 accts r3 MAOO I M2530 0 1 1d 0r3 r4 2 blt r4r26 3 sub 154152154 4 st r4 0 r3 5 call dispex158cas 1d by P1 sends BR to directory 0 Directory sends BR to P0 P0 sends P1 data does WB goes to S st by P1 sends BW to directory 0 Directory sends BW to P0 P0 goes to I 215 371 MartinRoda Shared Memory Multiprocessors 64 Directory Flip Side Latency Directory protocols Lower bandwidth consumption a more scalable Longer latencies 2 hog miss ho miss 3P Unshared get data from memory 39 o Snooping 2 hops POamemoryaPO Two read miss situations 9 0 Directory 2 hops PoamemoryaPO 0 Assume cachetocache transfer optimization 0 Snooping 2 hops POAPlaPO Directory 3 hops POamemoryaPlaPO 0 Common with many processors high probability someone has C15 371 MartinRom Shared Memory Multiprocessors Coherence on Real Machines 0 Many uniprocessors designed with onchip snooping logic 0 Can be easily combined to form multiprocessors o Eg Intel Pentium4 Xeon 0 Multicore o Eg Sun Wildfire NUMAQ IBM Summit 0 Eg CRAYT3DE 0 Shared data is uncachable Shared or exclusive get data from other processor P1 it 65 Larger scale directory systems built from smaller MPs Some shared memory machines are not cache coherent o If you want to cache shared data copy it to private data section 0 Basically cache coherence implemented in software 0 Have to really know what you are doing as a programmer C15 371 MartinRom Shared Memory Multiprocessors Directory Flip Side Complexity 0 Latency not only issue for directories o Subtle correctness issues as well 0 Stem from unordered nature of underlying interconnect 0 Individual requests to single cache must be ordered 0 Busbased Snooping all processors see all requests in same order 0 Ordering automatic 0 Pointtopoint network requests may arrive in different orders 0 Directory has to enforce ordering explicitly 0 Cannot initiate actions on request B 0 Until all relevant processors have completed actions on request A o Requires directory to collect acks queue requests etc 0 Directory protocols o Obvious in principle Complicated in practice C15 371 MartinRom Shared Memory Multiprocessors Best of Both Worlds 0 Ignore processor snooping bandwidth for a minute 0 Can we combine best features of snooping and directories c From snooping fast twohop cacheto cache transfers 0 From directories scalable pointtopoint networks 0 In other words 0 Can we use broadcast on an unordered network 0 Yes and most of the time everything is fine 0 But sometimes it isn t protocol race 0 Research Proposal Token Coherence TC 0 An unordered broadcast snooping protocol without data races C15 371 MartinRom Shared Memory Multiprocessors Roadmap Checkpoint o i lireaciwlevel parallelism TLP Shared memory model M quoto o Multiplexed uniproceeeor CPU 0 I iardwaremultihreacling o Multiprocessing Synchronization 0 Lock implementation 0 Locking gotchas Cache coherence o Busbased protocols 0 Directory protocols Memory consistency models 215 371 MartinRoth Shared Memory Multiprocessors 69 Recall Write Misses and Write Buffers o Read miss 0 Load can t go on without the data it must stall 0 Write miss o Technically no instruction is waiting for data why stall 0 Write buffer a small buffer W3 0 Stores put addressvalue to write buffer keep going 0 Write buffer writes stores to D in the background I o Loads must search write buffer in addition to D 395 Eliminates stalls on write misses mostly was 0 Write buffer vs writeback buffer 0 Write buffer in front of D for hiding store misses o Writeback buffer behind D for hiding writebacks C15 371 MartinRoth Shared Memory Multiprocessors 71 Nextlevel cache Hiding Store Miss Latency 0 Recall back from caching unit 0 Hiding store miss latency o How Write buffer 0 Said it would complicate multiprocessors 0 Yes It does 215 371 MartinRoth Shared Memory Multiprocessors 70 Memory Consistency 0 Memory coherence o Creates globally uniform consistent view 0 Of a single memory location in other words cache line Not enough 0 Cache lines A and B can be individually consistent 0 But inconsistent with respect to each other 0 Memory consistency o Creates globally uniform consistent view 0 Of all memory locations relative to each other 0 Who cares Programmers Globally inconsistent memory creates mystifying behavior C15 371 MartinRoth Shared Memory Multiprocessors 72 Coherence vs Consistency Aflag0 Processor 0 Processor 1 A1 while Eflag spin flagl print A o Intuition says P1 prints A1 o Coherence says absolutely nothing 0 P1 can see PO s write of flag before write of A How 0 Maybe coherence event ofA is delayed somewhere in network 0 Or P0 has a coalescing write buffer that reorders writes 0 Imagine trying to gure out why this code sometimes works and sometimes doesn t 0 Real systems act in this strange manner C15 371 MartinRoth Shared Memory Multiprocessors SC Doesn t Happen Naturally Why 0 What is consistency concerned with 0 P1 doesn t actually view PO s committed loads and stores 0 Views their coherence events instead 0 Consistency model how observed order of coherence events relates to order of committed insns o What does SC say 0 Coherence event order must match committed insn order 0 And be identical for all processors 0 Let s see what that implies C15 371 MartinRoth Shared Memory Multiprocessors Sequential Consistency SC Aflag0 Processor 0 Processor 1 A1 while Eflag spin flagl print A Sequential consistency SC 0 Formal definition of memory view programmers expect 0 Processors see their own loads and stores in program order Provided naturally even with outof order execution 0 But also processors see others loads and stores in program order 0 And finally all processors see same global loadstore ordering Last two conditions not naturally enforced by coherence Lampolt definition multiprocessor ordering o Corresponds to some sequential interleaving of uniprocessor orders 0 Ie indistinguishable from multiprogrammed uni processor CIS 371 MartinRoth Shared Memory Multiprocessors 74 SC Write Buffers 0 Store misses are slow 0 Global acquisition of M state write permission Multiprocessors have more store misses than uniprocessors 0 Upgrade miss I have block in S require global upgrade to M o Apparent solution write buffer 0 Commit store to write buffer let it absorb store miss latency 0 But a write buffer means 0 I see my own stores commit before everyone else sees them 215 371 MartinRoth Shared Memory Multiprocessors
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'