COMP ARCHITECTURE CS 538
Popular in Course
Popular in ComputerScienence
This 53 page Class Notes was uploaded by Orrin Rutherford on Wednesday September 2, 2015. The Class Notes belongs to CS 538 at Portland State University taught by Staff in Fall. Since its upload, it has received 15 views. For similar materials see /class/168301/cs-538-portland-state-university in ComputerScienence at Portland State University.
Reviews for COMP ARCHITECTURE
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/02/15
Introduction to C8538 coo Umt1 0000 00000 0000 0000 no 00 Cr edi r where cr edi r39s due Some ma rer39ials adap red wi rh permission from Paiierson D Slides forw C5252 Guiaduaie Campuiew Amhiieciur e 2001 UCB Some ma rer39ials adap red wi rh permission from Koppelmaw D Slides for 8124720 Compuiew Ar chiieciur39a 2003 LSU Some ma rer39ials adap red fr om Henriessy J L and Pa r rarsam D A Compu39rer Archimciure A QLICli l39ii39i39O i39iVE Approach 3quot ed All o rher ma rer39ial is 2007 Dave Archer Por iland S ra re Universi ry On two occasions I have been asked by members of Parliament Pray Mr Babbage if you put into the machine wrong gures will the right answers come out I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question Charles Babbage Computer Architecture Is 9 0 Applying principles and exploiting technology trends o To meet goals of throughput costperformance power o Presenting a manageable set of abstractions c To guide Compiler engineers OS engineers IC designers 0 Developing standards o To promote synergy solidify market control Course Topics Basic principles technology trends Abstractions the interface to the machine Data types and addressing modes Instruction set classes encoding o Singleprocessor throughput Pipelining hazards and static scheduling Traps faults and interrupts Dynamic pipeline scheduling Speculative execution Superscalar design Memory hierarchy ok caches More memory hierarchy ok virtual memory o Multiprocessor implications for throughput Multiprocessors introduction Multiprocessors memory coherence synchronization Computer Architecture Is 9 0 Applying principles and exploiting technology trends o To meet goals of throughput costperformance power o Presenting a manageable set of abstractions c To guide Compiler engineers OS engineers IC designers 0 Developing standards o To promote synergy solidify market control Three Principles 3 0 Make the common case fast 0 Law of Diminishing Returns Amdahrs Law o After a while making the common case fast doesn t help 0 Locality of reference 0 Golden handcuffs o Architectures like diamonds are forever Amdahl s Law 3 o Characterizes overall system speedup when some fraction of a system is improved a Simple 1 Note that IFim1S Sf peedupoverau 1 FSf make the common case fast Note thatsiiLnaS 11F Less Simple law of diminishing returns 1 n FkSk peedupoverau kO Example 0 Suppose 40 of instructions in a program can be made to run twice as fast 0 How much faster is the whole program Example 339 0 Suppose 40 of instructions in a program can be made to run twice as fast 0 How much faster is the whole program 1 8overall 1 4 42 18 125 times as fast as before A more complex Example E 0 Suppose 10 of instructions run 3x faster 30 of instructions run 5x faster 40 of instructions run 8x faster 0 20 of instructions do not speed up 0 How much faster is the whole program A more complex Example E 0 Suppose 10 of instructions run 3x faster 30 of instructions run 5x faster 40 of instructions run 8x faster 0 20 of instructions do not speed up 0 How much faster is the whole program 1 Soverall 13 35 48 21 1343 291 Locality of Reference 339 a Most programs exhibit two characteristics 0 Data items tend to be accessed repeatedly in a short period 0 Examples summation variable in a loop loop counters o This is temporal locality o Data items close together in memory tend to be referenced close together in time o Examples program instructions elements of an array o This is spatial locality o All computer system designs rely heavily on these 9 Programs that do not exhibit them tend to perform poorly 9 Locality can bite you too e 9 false sharing Golden Handcuffs 0 Example x86 1978 Intel introduces 8086 8088 0 16bit addressing 477MHz 16bit 8086 8 bit 8088 external interface 1985 debut of the 80386 0 32bit addressing 0 Up to 33 MHz clock speed 1989 80486 DX 0 Integrated FPU on board cache 25100 MHz MMX 64bit extensions and on and on 2007 example Quadcore Intel Xeon 5300series processor o ISA memory model data types common across 30 years c Pain happens when people forget eg Itanium Computer Architecture Is 9 0 Applying principles and exploiting technology trends o To meet goals of throughput costperformance power o Presenting a manageable set of abstractions 0 To guide Compiler engineers OS engineers IC designers 0 Developing standards o To promote synergy solidify market control Technology Trends 3 0 Drill down on 4 technologies o Disks 0 Memory 0 Network a Processors 0 Compare 1980 vs 2000 0 Compare Bandwidth vs Latency improvements over time o Bandwidth number of events per unit time 0 Mb second over network MB second from disk memory 0 Latency reduction elapsed time for a single event a Network delay in microseconds average disk access time in milliseconds Disks Archaic vs Modern 555 CDC Wren I 1983 3600 RPM 003 GBytes capacity TracksInch 800 BitsInch 9550 Three 525 platters Bandwidth 06 MBytessec Latency 483 ms Cache none Seagate 373453 2003 15000 RPM 4X 734 GBytes 2500X TracksInch 64000 80X BitsInch 533000 60X Four 25 platters in 35 form factor Bandwidth 86 MBytessec 140X Latency 57 ms 8X Cache 8 MBytes 10000 0 Performance Milestones 1000 o Disk 3600 5400 7200 Remive 10000 15000 RPM 8x 143x I BW 100 a a a mprove ment 1 1 10 100 latency simple operation We contention BW bestcase Relative Latency Improvement Memory Archaic vs Modern 1980 DRAM asynchronous 006 Mbitschip 64000 xtors 35 mm2 16bit data bus per module 16 pinschip 13 Mbytessec Latency 225 ns no block transfer 2000 Double Data Rate Synchr clocked DRAM 25600 Mbitschip 4000X 256000000 xtors 204 mm2 64bit data bus per DIMM 66 pinschip 4X 1600 Mbytessec 120X Latency 52 ns 4X Block transfers page mode 10000 c Performance Milestones 1000 0 Memory Module 16bit plain DRAM Page Mode DRAM Relative 32b 64b SDRAM BW 100 DDR SDRAM 4x120x 39mnfggre Disk 3600 5400 7200 10000 15000 RPM 8x143x latency simple operation wo contention 1 10 100 BW bestcase Relative Latency Improvement LANs Archaic vs Modern Ethernet 8023 0 Year of Standard 1978 o 10 Mbitss 0 link speed Latency 3000 usec 0 Shared media 0 Coaxial cable 0 Coaxial Cable Plastic Covering Braided outer conductor Insulator i I Copper core V 539 Ethernet 8023ae Year of Standard 2003 10000 Mbitss link speed Latency 190 usec 1000X 15X Switched media Category 5 copper wire quotCat 5quot is 4 twisted pairs in bundle Twisted Pair gtCgtCgtCgtC Copper 1mm thick twisted to avoid antenna effect 10000 o Performance Milestones 0 Ethernet 10Mb 100Mb 1000Mb 1000 10000 MbS 16x1000x Relative o Memory Module 16bit plain BW 100 DRAM Page Mode DRAM 32b Imlgge 64b SDRAM DDR SDRAM 4mm 1 0 Disk 3600 5400 7200 10000 15000 RPM 8x143x latency simple operation wo contention 1 10 100 BW best case Relative Latency Improvement CPUs Archaic vs Modern NWTmil uuL ai 39 1982 Intel 80286 125 MHz 2 MIPS peak 134000 xtors 47 mm2 16bit data bus 68 pins Single ALU 0 single FPU on separate chi Microcoded CISC Not pipelined Strictly inorder No caches fill 0030 2001 Intel P4 1500 MHz 4500 MIPS peak 42000000 xtors 217 mm2 64bit data bus 423 pins 3way superscalar Dynamic translate to RISC Superpipelined 22 stage OutofOrder execution Onchip 8KB Data caches 96KB Instr Trace cache 256KB L2 cache I 2250X 10000 CPU high Memory low Relative BW 7 7 V Improve100 ment 1 10 100 Relative Latency Improvement Performance Milestones Processor 286 386 486 Pentium Pentium Pro P4 Ethernet 10Mb 100Mb 1000Mb 10000 MbS 16x1000x Memory Module 16bit plain DRAM Page Mode DRAM 32b 64b SDRAM DDR SDRAM 4mm Disk 3600 5400 7200 10000 15000 RPM 8x143x Increasingly Memory BWlags CPU BW Overall BWimproves by m m square of latency improvement 0 Moore 3 Law a Transistors I IC doubles eve 1224 months O Recently I r Intel Reaches 9065nm CrossOver Intel Core 2 291M transistors Intel Pentium D 376M transistors translstors 39 39 65mm stackup 8 metal layels 1000000000 MOORE39S LAW Gate length 35 nm momma tox 12 nm 1 00 00000 Intel 45mm test vehicle 153Mb SRAM 109 transistots per chip 1000000 I 00 000 10000 10 00 1970 1 975 1980 1985 1990 1995 2000 2005 21139 Source httpwwwinteIcomtechnologymooreslawindexhtm Power Trends 555 What s a Clock Cycle E5 Lafch combinational regcigter39 I 9ic g2 i i J V tcycle 1f eg 1ns corresponds to iGHz 0 Was dominated by gate delays 0 Now dominated by interconnect amp skew Power in CMOS PgtltV gtButldqldt AqAtgtltV gtButqCV 1C ACV At x v gt But C is constant CAVAtgtltV gtBut1tf C C A V x f 2 x V gt Every other cycle C x V2 x f2 gt Collecting terms 12 C V2 f gt So often fight fl with Vi Note C f driven transistors technology Implications of P 12 0 V2 f 555 o For a fixed task lowering freduces power but not energ important for batteries 0 Dropping voltage helps both hence from 5V to 1V o Dropping active C helps both so Most CPUs now turn off clock andor supply to inactive units o However as V falls and transistor lengths shrink leakage current current that flows even when a transistor is off rises so now static power important too o In 2006 goal for leakage 25 of total power consumption high performance designs at 40 Power Keeps Trending Up Intel CPU Power Dissipation Power W 0 1000 2000 3000 4000 Frequency MHz nte Pentium Intel Pentium II nte Pentium II Deschutes nte Pentium III nte Pentium III Later design Intel Pentium 4 Intel Core Solo ntel Core Duo Intel Itanium 2 Process generations and new approaches to power management give power trend an occasional reset 0 Data centers 0 20yearlifetime constant power footprint 0 Laptops 0 Power envelope battery energy lap temp PrmarvB ares Power envelope battery energy umquot Spati c summwwkn Summary Technology Trends 3 0 Memory bandwidth lags CPU demand 0 Architectures must tolerate memory BW limits to prevent being held back by the memory wall 0 Bandwidth improves more quickly than latency o Architectures must hide more and more latency to exploit available bandwidth for performance 0 Power envelope remains a key challenge o Architectures must exploit efficiency not frequency to meet performance goals without exceeding power budgets o Transistor count doubles every 1224 months 9 Transistors are free architectures should use them to address BW latency efficiency issues Computer Architecture Is 9 0 Applying principles and exploiting technology trends o To meet goals of throughput costperformance power o Presenting a manageable set of abstractions c To guide Compiler engineers OS engineers IC designers 0 Developing standards o To promote synergy solidify market control Instruction Sets Data types o on H H 39 O The ISA Abstraction Applications software Compilers Libraries Sometimes Operating System firmware aka ucode hardware I Execution Model I lO system I I Microarchitecture I I Logic Design I Circuit Design Compilation Interconnect amp Synthesrs Synthesrs I Custom Layout Cell Libraries Programmables I Process Technology I CS 538 Administrivia ES 0 Text 0 John L Hennessey David A Patterson Computer Architecture A Quantitative Approach 3rd edition Morgan Kaufmann 2003 o Prerequisites o Officially CS 322 or 333 per department listings 0 Unofficialy o A solid understanding of basic computer architecture 0 At least equivalent to ECE 341 Computer Hardware We re not kidding 0 Class website wmvcspdxeduMdanharf 0 Instruction page has a link to C8538 Fall 07 o Please review the course website thoroughly Grading 0 30 Weekly homework o 35 Exam over material in first half 0 35 Exam over material in second half Fundamentals in Review 0 Instruction Set Architecture Execution Model and Pipelining no Performance 00 Memory Hierarchy ISA The Key Abstraction 555 Applications Compilers Libraries Operating System Sometimes firmware aka ucode software I Execution Model I lO system I Microarchitecture I I Logic Design I hardware I Compilation Interconnect CIrcuIt Desugn amp Synthesis Synthesis I Custom Layout Cell Libraries Programmables I Process Technology I Evolution of Instruction Sets 0000 Single Accumulator EDSAC 1950 o I Accumulator Index Registers Manchester Mark I IBM 700 series 1953 Separation of Programming Model fro Implementation Highlevel Language Based Concept of a Family 35000 19 3 IBM 360 1964 General Purpose Register Machines Complex Instruction Sets Loa Store Architecture Vax Intel 432 197780 CDC 6600 Cray 1 196376 CISC RISC Intel x86 1980199x MipsSparcHPPAIBM R86000PowerPC 1987 Mixed cusc amp RISC IA64 1999 37 Instruction Set Architecture 53 De ned 39 The attributes of a computing system as seen by the programmer ie the conceptual structure and functional behavior as distinct from the organization of the data flows and controls the logic design and the physical implementation Amdahl Blaaw and Brooks 1964 What Data types Instruction formats and their representation Where Address space size Memory Map How Instruction set addressing modes privilege modes When Condition codes branch tests Whoal Exception Interrupt model Fundamentals Review Instruction Set Architecture Execution Model and Pipelining Performance Memory Hierarchy 39 IEXecu onIWodel Ia Seauent I ll Instruction Execution 1 Instruction Fetch 1 Instruction Decode Operand Fetch Memory Access Write Send PC to memory fetch instruction Determine required action Locate and obtain operands for the action Compute result value status or memory address Write or read at computed address in memory Deposit results in storage for later use or fetch desired data from memory But 3239 IFD EIMIWl iIFIDI EIMIWl iIFIDI EIMIWI Time a Must we do it all serially o Utilization of most logic wasted a Instruction retirement speed limited to cycle rate of entire execution a Technology lets us clock far faster 0 So how about a production line approach run l 1NQ1Q Sequential Laundry zg 6PM 7 8 9 1o 11 Midngilu r I Time I 40 quots ol 4o 2EOIE 4o lz dls o 4o lziol o Sequential laundry takes 6 hours for 4 loads 0 W le laamgd plpalmmg lmg amuld lawmle Hue Pipelined Laundry 6IPM 7 8 9 10 I Time I I 40 O I 40 4O 20 7 N o Pmpghmd amdr y tamg Mum m 4 mmdg me I 1NQ1Q Pipelining Lessons 6PM 7 8 9 Time 39 40 I 40 3940 40 Izol Pipelining doesn t help of single task it helps tltmugl lput of entire workload Pipeline rate limited by pipeline stage lt ulltlplle tasks operating simultaneously Time to till pipeline and time to it reduces speedup Pipelining must still preserve sequential execution model Fundamentals Review Instruction Set Architecture Execution Model and Pipelining Performance Memory Hierarchy The Iron Triangle 1W Iccunt O C O 0000 gt 000 J 000 f x 0000 f coo LEEEEE ydg ma CPU me Seconds Instructions x Avg Cycles x Seconds Progran1 Progran1 Inst Count CPI Ins uc on Cycm Clock Rate Program X Compiler X X Inst Set X X Microarch X Technology CPI Mean Cycles Instruction Example CPI Estimation Op Fr39e Cycles CPIi Time ALU 50 1 5 33 Load 20 2 4 27 Store 10 2 2 13 Branch 20 2 4 27 1 U1 Typical Mix of instruction types in program Relative Performance Ideal 5555 O o Key metric Speedup ratio of old to new performance 3 Avg instruction time Ti CPI TC CPI Cycles Per Instruction TC Clock cycle time CPU Time Clock Rate Instruction Count Let Speeduppipelining Tiunpipelined Tipipe ned and substituting we have Speedup unpipelined Tcunpipelined pipelined Tcpipelined but CPI of an unpipelined machine is exactly 1 and we assume that the ideal CPI of our pipeline is 1 else why do it and we assume that we get full benefit of the pain in designing our pipeline or To unpipelined ITO pipelined Depth of the pipeline so ideally Speedup Pipeline depth Engineers seldom get ideal situations 48 Fundamentals Review Instruction Set Architecture Execution Model and Pipelining Performance Memory Hierarchy The Memory Abstraction 9 0 Association of ltname valuegt pairs 0 typically named as byte addresses 9 often values aligned on multiples of size 0 SequenceofReadsandVM es 0 Write binds a value to an address 0 Read returns the bound value command R W address name data W data R done Idea Memory Hierarchy 333 0 0 Capacity o Access Time Stagin Upper COST Xfer fgize 039 93 F T 39 3 er Way Expensuve r ham Cl leer Instr Operands 392 bytes 936 1 C005 K Bytes Cache 10 MByte may be multiple levels cache cntl BIOCkS 8128 bytes 6 Bytes 3oo 1ooo Tc I Memory 003 MB eBay 05 Pages 4K bytes 1005 G B tes 10 ms 0M Tc msk I 010 GB eBay Fi I63 gistEerator Larger Infinite Forever Tape Way Cheap Lower Why Does Hierarchy Work 0 Locality o Program access a relatively small portion of the address space at any instant of time 0 Two Different Types 0 ll errlptilral l opalltv If an item is referenced it will tend to be referenced again soon eg loops reuse o If an item is referenced items whose addresses are close by tend to be referenced soon eg straightline code array access 4 Issues for Memory Hierarchy E c 01 Where can a block be placed in the cache Block placement 0 02 How is a block found if it is in the cache Sleek identi cation 0 QB Which block should be replaced on a miss Bleak replacement 0 Q4 What happens on a write Write strategy
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'