Adv Computer Architecure
Adv Computer Architecure ECE 4100
Popular in Course
Popular in ELECTRICAL AND COMPUTER ENGINEERING
This 0 page Class Notes was uploaded by Cassidy Effertz on Monday November 2, 2015. The Class Notes belongs to ECE 4100 at Georgia Institute of Technology - Main Campus taught by Yalamanchili in Fall. Since its upload, it has received 23 views. For similar materials see /class/233894/ece-4100-georgia-institute-of-technology-main-campus in ELECTRICAL AND COMPUTER ENGINEERING at Georgia Institute of Technology - Main Campus.
Reviews for Adv Computer Architecure
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 11/02/15
DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland 7 g UNlVERSlTY OF MARYLAND DRAM Architectures Interfaces and Systems A Tutorial Bruce Jacob and David Wang Electrical amp Computer Engineering Dept University of Maryland at College Park httpwww eceumd edublj DRAM DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Outline Basics DRAM Evolution Structural Path Advanced Basics DRAM Evolution Interface Path Future Interface Trends amp Research Areas Performance Modeling Architectures Systems Embedded Break at 10 am Stop us or starve DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Basics DRAM ORGANIZATION Switching element 0390 M03 Sense Amps Bit Lines 139 saun pJDM 39 Memory Array DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Basics BUS TRANSMISSION MEMORY CONTROLLER w Data InlOut Buffers Column Decoder Sense Amps Bit Lines Memory Array suu I MUM Japoaaq MOE DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Basics PRECHARGE and ROW ACCESS sa sgmpp MEMORY V CONTROLLER i i b gt AKA OPEN a DRAM PageRow or ACT Activate a DRAM PageRow or RAS Row Address Strobe U D m D D U Memory Array l39 saun pJDM 39 Japoaaa M03 DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Basics COLUMN ACCESS MEMORY CPU BUS CONTROLLER Memory Array u Huum Japoaaa M03 READ Command or CAS Column Address Strobe DRAM TUTORIAL ISCA 2002 Basics Bruce Jacob David Wang U DATA TRANSFER niverSIty of Maryland Sense Amps Buffers LIA j w Bit Lines BUS CONTROLLER L Memory Array 39 saun pJDM 39 Data Out Japoaaa M03 with optional additional CAS Column Address Strobe note page mode enables overlap with CAS DRAM TUTORIAL ISCA 2002 Basics Bruce Jacob David Wang University of Maryland Column Decoder I I Sense Amps BUS TRANSMISSION Data InlOut Buffers MEMORY CPU BUS CONTROLLER Bit Lines Memory Array u Huum Japoaaa M03 DRAM TUTORIAL ISCA 2002 Basics Bruce Jacob David Wang F University of Maryland CPU Mem Controller A gt B 6 D 2E3 C A Transaction request may be delayed in Queue B Transaction request sent to Memory Controller C Transaction converted to Command Sequences may be queued D Commands Sent to DRAM E1 Requires only 2 CAS or E2 Requires RAS CAS or E3 Requires PRE RAS CAS F Transaction sent back to CPU DRAM LatencyquotABCDEF DRAM TUTORIAL ISCA 2002 Basics 39Ergiim n Universityof PHYSICAL ORGANIZATION Maryland Column Decoder Column Decoder Column Decoder Till S nse Amps It Lines 9 Nu as on Bit Lines J Memory Array apoaaa M Japoaaa mou UH x2 DRAM x4 DRAM x8 DRAM This is per bank Typical DRAMs have 2 banks DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Basics Read Timing for Conventional DRAM RAS B Row Access t I l I D Column Access m l l l D Data Transfer Address DO I Valid L Valid Dataout Dataout DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DRAM Evolutionary Tree Structural Modifications E Targeting g FCRAM Latency Conventional DRAM Q Mostly Structural Modifications Targeting Throughput VCDRAM ngwg egg E O O I III III III FPM EDO PIBEDO SDRAM ESDRAM Q Interface Modifications gt Targeting Throughput III III Rambus DDRIZ Future Trends DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DRAM Evolution Read Timing for Conventional DRAM B Row Access D E D Column Access E Transfer Overlap D Data Transfer DO I Valid Dataoutl Valid Dataout DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DRAM Evolution Read Timing for Fast Page Mode In lt n 393 Row Access Q E Column Access E Transfer Overlap D Data Transfer I Valid Dataout DRAM TUTORIAL 39SCA2002 DRAM Evolution D Bruce Jacob David Wang Read Timing for Extended Data Out University of Maryland B Row Access D Column Access D Transfer Overlap D Data Transfer m I my l ii Valid Dataout DRAM TUTORIAL ISCA 2002 Evolution Bruce Jacob David Wang Read Timing for Burst EDO University of Maryland 393 Row Access D Column Access E Transfer Overlap D Data Transfer DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DRAM Evolution Read Timing for Pipeline Burst EDO B Row Access Q E Column Access Transfer Overlap D Data Transfer DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DRAM Evolution Read Timing for Synchronous DRAM Valid Valid Valid Valid Data Data Data Data I I I I 393 Row Access D Column Access u Transfer Overlap D Data Transfer S A8 W Command Bus aaAgssaJ e sun aq avaa puooas ueo OILO SMOJ Xueq Sums 01 HIMEl WVHCISEI OILO SMOJ Xueq awes 01 HIMEl WVHCIS Z39SVO J quot1598 INVEICISEI u PUhOJV39allIM quotMINGEl INVEICI puewew JO AIISJeAIun SueVI pIAeG qooer 90mg ZOOZ VOSI 39IVIEIOLHJ WVHCI DIED Ema 212a emu PllEA PllEA PllEA I I I I ma PllEA I Iueq ewes 0133 I 232a 232a 232a PllEA PllEA PllEA I I I HOOK WVHCIS Z39SVO J9 quotquotquotEl WWGSE 10 5UWJ 9938 MOH39JE IUI quotMINGEl INVEICI puewew JO AIISJeAIun SueVI pIAeG qooer 90mg ZOOZ VOSI 39IVIEIOLHJ WVHCI DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DRAM Evolution D Internal Structure of Virtual Channel 16 Channels Bank B segments InputOutput Bank A I Buffer m Segment 2Kb Segment l WMEiIai 2Kb Segment l 2Kb Segment l Se nse Row Decoder Amps SelIDec gt H H Activate Prefetch Read Restore write Segment cache is softwaremanaged reduces energy DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DRAM Evolution 0 Internal Structure of Fast Cycle RAM 393 SDRAM FCRAM 8M Array 8M Array 8 Kr x 1 Kb m Jepoaea M03 Jepooea M03 tRCD 15ns two clocks tRCD 5ns one clock Reduces access time and energyaccess WWCI39IH WWWW HG 9999 CICI 3 OG V81 SVH SVD SVH M8 Mead Hllesn 0 31 dluoled rimmed needs mama eeeee vac 3103 wwa Aauam1M01 4o uosuedwoo puewew JO AusJexgun uonn0Ag wvaa 3003 m QQQQQQQQQ QQQQQQQQQ QQQQQQQQQ QQQQQQQQQ1E EHEHEHEHEHEHEHEI Cl WWS39JJ SASOW JO almanns IBUJBJUI quotMINGEl INVEICI JO AusJexg un Suem plAeG qooer 90mg ZOOZ VOSI 39IVIEIOLHJ WVHCI DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Outline Basics DRAM Evolution Structural Path Advanced Basics Memory System Details Lots DRAM Evolution Interface Path Future Interface Trends amp Research Areas Performance Modeling Architectures Systems Embedded DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland What Does This All Mean DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Cost Benefit Criterion Package Cost Interconnect Cost Bandwidth Logic Overhead Test and Implementation Power Consumption DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Memory System Design IIO Technology Chip Packaging Topology DRAM Chip Architecture DRAM 39 Memory System DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DRAM Interfaces The Digital Fantasy Rom Col Xrox X x RAS CAS atency latency Pipelined Access Pretend that the world looks like this But DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland The Real World vDmipad 0 Dos P no DQO15 P n O O VSSQPad Controller side DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Signal Propagation La MG Ideal Transmission Line 066c 20 cmlns PC Board Module Connectors Varying Electrical Loads Rather nonIdeal Transmission Line DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Clocking Issues Figure 1 SlidingTime 0th Nth l l SRC Figure 2 HTree lt9 lt9 Elle G What Kind of Clocking System DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Clocking Issues Fig u re 1 Write Data Figure 2 Read Data lt Signa Direction We need different clocks for RNV DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Path Length Differential Bus Signal 2 Bus Signal 1 High Frequency AND Wide Parallel Busses are Difficult to Implement DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Subdividing Wide Busses B W Narrow Channels 3 3 Source Synchronous A Local Clock Signals DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Why Subdivision Helps gt81 E Sub Channel 1 1 Sub ii a r Channel 2 H J J V K Chan 1 Worst Case Worst Case Chan 2 erAAW VA A W Chan 1 Chan 2 Worst Case Skew must be Considered in System Timing DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Timing Variations Controller x Clock X Cmd to1 Load X Cmd to 4 Loads Controller I I I I 4Loads 1 Load How many DIMMs in System How many devices on each DIMM Who built the memory module Infinite variations on timing DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Loading Balance Controller Controller Duplicate Signal Lines Variable Signal Drive Strength DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Topology D D D D Controller 7 Chip Chip Chip Chip DRAM System Topology Determines Electrical Loading Conditions and Signal Propagation Lengths DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland SDRAM Topology Example Single Channel SDRAM Controller Data bus 64 bits Loading Imbalance DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland RDRAM Topology Example Controller 0 0 clock Packets traveling down Ch39p turns Parallel Paths Skew is around minimal by design DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland IIO Technology Logic High Logic Low A t Time Av Slew Rate F Smaller A v Smaller At at same slew rate Increase Rate of bitsslpin DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland IIO Differential Pair Single Ended Transmission Line Differential Pair Transmission Line Increase Rate of bitsspin Cost Per Pin Pin Count DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland IIO Multi Level Logic logic 11 range Increase Rate of bitsspin DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Packaging DIP good old days Features Target Specification S O J a Package Small Outline Jlead Speed VddVddq 25V25V 18V TSOP Thin Small Outline Package LQFP Low Profile Quad Flat Package FBGA Fine Ball Grid Array FBGA LQFP 800MBp 550Mbps Interface SST L2 Row Cycle 35ns Time tRC Memory Roadmap for Hynix NetDDR ll DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Access Protocol Single Cycle Command Cycle Cmd Cmd 6043 Data Xdommmm Multiple Cycle Command DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Access Protocol rlr RAS CAS latency latency Pipelined Access Consecutive Cache Line Read Requests to Same DRAM Row gt C d omman a Active open page r Read Column Read lt gt Data d Data Data chunk DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Access Protocol rlw One Datapath Two Commands I Soln Delay Data of Write Command to match Read Latency DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Access Protocol pipelines latency Three BacktoBack Pipelined Read Commands C i S latency Same Latency 2X pin frequency Deeper Pipeline When pin frequency increases chips must either reduce real latency or support longer bursts or pipeline more commands DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Outline Basics DRAM Evolution Structural Path Advanced Basics DRAM Evolution Interface Path SDRAM DDR SDRAM RDRAM Memory System Comparisons ProcessorMemory System Trends RLDRAM FCRAM DDR II Memory Systems Summary Future Interface Trends amp Research Areas Performance Modeling Architectures Systems Embedded DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland SDRAM System In Detail Single Channel SDRAM Controller Data Bus Mesh TOPOIOQY chip DIM M Select DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland SDRAM Chip 133 MHz 75ns cycle time Multiplexed CommandAddress Bus 2552 39L39i quot Programmable Burst Length 124 or 8 TSOP Quad Banks Internally SupplyVoltage of 33V Low Latency CAS 2 3 LVTTL Signaling 08V to 20V 3236quot 0 to 33V rail to rail 15 Addr 7 Cmd 1 Clk 1 NC Condition Specification Cur Pwr Operating Active Burst Continous 300mA 1W Operating Active Burst 2 170mA 560mW Standby Active All banks active 60mA 200mW Standby powerdown All banks inactive 2mA 66mW DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland SDRAM Access Protocol rlr 678910111213 llllllllllllll Data return from I 6 Data return from chlp 0 chip 1 I chip0 39 bttiiliidlef 39 chip1 hold time setup time Backtoback Memory Read Accesses to Different Chips in SDRAM Clock Cycles are still long enough to allow for pipelined backtoback Reads DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland SDRAM Access Protocol wlr Consecutive Reads Figure1 o Worst case DistN DistO N1th Figure1 Read After 0 0 Write Write T 4 A Read Worst case DistN DistN1 Bus Turn Around DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland SDRAM Access Protocol wlr SDRAM SDRAM chip 0 chip 1 Memory A 1 Controller 4 Data bus Read Following aWrite Command to Same SDRAM Device DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DDR SDRAM System Addr amp Cmd Data Bus DQS Data Strobe Chip DIM M Select Same Topology as SDRAM DRAM TUTORIAL 39SCA2002 DDR SDRAM Chip Bruce Jacob DavidWang 133 MHz 75ns cycle time Unmrfyigncg Multiplexed CommandAddress Bus Programmable Burst Lengths 24 or 8 Quad Banks Internally SupplyVoltage of 25V Low Latency CAS 2 25 3 SSTLZ Signaling Vref I 015V 0 to 25V rail to rail 16 PwrGnd 16 Data 15 Addr 7 Cmd 2 Clk tr CASL 2 DQS Postamble Hr DQS Preamble DRAM TUTORIAL 39SCA2002 DDR SDRAM Protocol rlr Bruce Jacob David Wang University of Mary39and A SDRAM SDRAM chip 0 chip 1 Memory Controller 6 Data bus DQS pre amble DQS POSt39amb39e Backtoback Memory Read Accesses to Different Chips in DDR SDRAM DRAM TUTORIAL RDRAM System Bruce Jacob David Wang University of Maryland Data Bus I tCWD tCAC LY write delay CAS access delay tCAc 39 tCWD Two Write Commands Followed by a Read Command Packet Protocol Everything in 8 half cycle packets DRAM TUTORIAL 39SCA2002 Direct RDRAM Chip Bruce Jacob David Wang University of 400 MHz 25ns cycle time 256 MB Maryland Separate RowCol Command Busses 86 pm Burst Length 8 FBGA 4l16l32 Banks nternay 111 alltrelan SupplyVoltage of 25V 3 Admcm Low Latency CAS 4 to 6 full cycles 4 CW RSL Signaling Vref I 02V 800 mV rail to rail mmmm All packets are 8 half cycles in length the protocol allows near 100 bandwidth utilization on all channels AddrlCmdlData DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland RDRAM Drawbacks High Frequency IIO Test and Package Cost Active Decode Logic Open Row Buffer High power for quiet state RSL Separate Power Plane 30 die cost for logic 64 Mbit node Single Chip Provides All Data Bits for Each Packet Power Significant Cost Delta for First Generation DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland System Comparison SDRAM DDR RDRAM Frequency MHz 133 1332 4002 Pin Count Data Bus 64 64 16 Pin Count Controller 102 101 33 Theoretical Bandwidth M Bls 1064 2128 1600 Theoretical Efficiency data bitscyclelpin 063 063 048 Sustained BW M Bls 655 986 1072 Sustained Efficiency data bitscyclelpin 039 029 032 RAS CAS tRAc ns 4550 4550 5767 CAS Latency ns 2230 2230 4050 133 MHz P6 Chipset SDRAM CAS Latency 80 ns StreamAdd Load to use latency DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Differences of Philosophy SDRAM Variants Complex Interconnect Inexpensive Interface Logic Simple RDRAM Variants gt gt H Simplified Interconnect expensive Interface Logic Complex Complexity Moved to DRAM DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Technology Roadmap ITRS 2004 2007 2010 2013 2016 90 65 45 32 3990 6740 12000 19000 Semi Generation nm CPU MHz 22 29000 MLogicTransistors 772 1543 309 617 cm 2 High Perf chip pin count 2263 3012 4009 5335 1235 7100 High Performance chip 188 161 168 144 cost centspin 122 Memory pin cost 034 027 022 019 centspin 139 084 034 039 019 033 Memory pin count 48160 48160 62208 81270 105351 Free Transistors amp Trend Costly Interconnects DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Choices for Future Direct Connect Custom DRAM Highest Bandwidth Low Latency Direct Connect semicomm DRAM High Bandwidth LowModerate Latency Direct Connect Commodity DRAM Low Bandwidth Low Latency DRAM DRAM DRAM DRAM DRAM DRAM mam DRAM DRAM Controller DRAM DRAM DRAM DRAM DRAM D RAM Indirect nnnnrtinn Highest Bandwidth DRAM DRAM 39 DRAM Highest Latency nexFenswe DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland EV7 RDRAM CompaqHP RDRAM Memory 2 Controllers Direct Connection to processor 75ns Load to use latency 128 GBIs Peak bandwidth 6 GBIs read or write bandwidth 2048 open pages 2 32 32 Each column read fetches 128 4 512 b data DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland What if EV7 Used DDR Peak Bandwidth 128 GBIs 6 Channels of 1332 MHz DDR SDRAM 6 Controllers of 6 64 bit wide channels or 3 Controllers of 3 128 bit wide channels EV7 6 controller EV7 3 controller System EV7 RDRAM DDR SDRAM DDR SDRAM 75 ns 50 ns 50 ns 265 PwrGnd 600 PwrGnd 600 PwrGnd 6 3 Latency Pin count Controller 2 Count Open pages 2048 144 72 page hit CAS memory controller latency including all signals address command data clock not including ECC or parity 3 controller desigr is less bandwidth ef cient DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland What s Next DDR II FCRAM RLDRAM RDRAM Yellowstone etc Kentron QBM DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DDR II DDR Next Gen Lower o DRAM core operates at Voltage 13V 14 freq of data bus freq SDRAM 11 DDR 12 Backward Compat t0 DDR Common modules possible 400 Mbps multidrop 800 Mbps 39POiNt to Don No more Page TransferUntil Interru pted Commands removes speedpath FPBGA package Burst Length Only 4 Ban ks internally Write Latency CAS 1 same as SDRAM and DDR increased Bus Utilization DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DDR II Continued Posted Commands SDRAM amp DDR W tch SDRAM amp DDR SDRAM relies on memory controller to know tch and issue CAS after tRCD for lowest latency DDR II Posted CAS Active 39ea39 RAS CAS Internal counter delays CAS command DRAM chip issues real command after tRCD for lowest latency DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland FCRAM Fast Cycle RAM aka NetworkDRAM Features DDR SDRAM FCRAMNetworkDRAM Vdd Vddq 25 02V 25 015 Electrical Interface SSTL2 SSTL2 Clock Frequency 100167 MHz 154200 MHz tRAC 40ns 2226ns tRC 60ns 2530ns Banks 4 4 Burst Length 248 24 Write Latency 1 Clock CASL 1 FCRAMNetworkDRAM looks like DDR DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland FCRAM Continued 1 E7MHz DDR 131313333 200MHz Network DRAM DDR400 1ch RDRAM 1066 j Retire Faster tRc allows Samsung to claim higher bus effi iency Samsung Electronics Denali MemCon 2002 DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland RLDRAM Peak Bandwidth per Chip BusWidth DRAM Type per chip Frequency Random Access Time tRAc Row Cycle Time tRc PC133 133 16 SDRAM 200 MBs 45 ns 60 ns DDR 266 133 2 16 532 MBs 45 ns 60 ns PCSOO RDRAM 400 2 16 16 GBs 60 ns 70 ns FCRAM 200 2 16 08 GBs 25 ns 25 ns RLDRAM 300 2 32 24 GBs 25 ns 25 ns Comparable to FCRAM in latency Higher Frequency No Connectors nonMultiplexed Address SRAM like DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland RLDRAM Continued Highend PC and Server low power and low cost RLDRAM is a great replacement to SRAM in L3 cache applications because of its high density Infineon Presentation Denali MemCon 2002 DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland RAMBUS Yellowstone BiDirectional Differential Signals Ultra low 200mV pp signal swings 8 data bits transferred per clock 400 MHz system clock 32 GHz effective data frequency Cheap 4 layer PCB Commodity packaging System CIOCEXX Data DOOOOOOOCX 12 V 10V Octal Data Rate ODR Signaling DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Kentron QBMT39V clock M output W Wrapper Electronics around DDR memory Generates 4 data bits per cycle instead of 2 Quad Band Memory DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland A Different Perspective Everything is bandwidth gt Clock gt Row CmdlAddr Bandwidth gt Col CmdlAddr Bandwidth gt Write Data Bandwidth Read Data Bandwidth Latency and Bandwidth Pinbandwidth and Pintransition Efficiency bitscyclelsec DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Research Areas Topology i l l l l l i i l l l l l l i quot Unidirectional Topology A V Write Packets sent on Command Bus Pins used for CommandAddresleata Further Increase of Logic on DRAM chips DRAM TUTORIAL 39SWOOZ Memory Commands Bruce Jacob David Wang MI W University of m Maryland 000000 Instead of A 0 Do write 0 Why do A B in CPU Memory 7 Controller Move Data inside of DRAM or between DRAMs Why do STREAMadd in CPU AI B C Active Pages Chong et al ISCA 98 DRAM TUTORIAL ISCA 2002 ress Bruce Jacob David Wang University of Maryland Access Distribution for Temp Control Avoid Bank Conflicts Access Reordering for performance DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Example Bank Conflicts Column Decoder Column Decoder Column Decoder Column Decoder Sense Amps Bit Lines Japoaaa M03 Japoaaa M03 Multiple 3ank Memory to Reduce 439 Array Access ConflicN Japoaaa M03 I Japoaaa may I Read 05AE5700 gt Device id 3 Row id 266 Bank id 0 Read 02333880 gt Device id 3 Row id 13A 3ank id 0 Read 05AE5780 gt Device id 3 Row id 266 Bank id 0 Read 00c3A2c0 gt Device id 3 Row id 052 Bank id 1 More Banks per Chip Performance Logic Overhead DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Example Access Reordering 0 Read 05AE5700 gt Device id 3 Row id 266 Bank id 0 9 Read 02333880 gt Device id 3 Row id 13A 3ank id 0 0 Read 05AE5780 gt Device id 3 Row id 266 Bank id 0 0 Read oocBAzco gt Device id 1 Row id 052 Bank id 1 Memory Access Reorde red Act Activate Page Data moved from DRAM cells to row buffer Read Read Data Data moved from row buffer to memory controller Prec Precharge close pageevict data in row buffersense amp DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Outline Basics DRAM Evolution Structural Path Advanced Basics DRAM Evolution Interface Path Future Interface Trends amp Research Areas Performance Modeling Architectures Systems Embedded DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Simulator Overview CPU SimpleScalar v30a 8way outoforder L1 cache split 64Kl64K lockup free x32 L2 cache unified 1MB lockup free x1 L2 blocksize 128 bytes Main Memory 8 64Mb DRAMs 100MHzI128bit memory bus Optimistic openpage policy Benchmarks SPEC 95 DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DRAM Configurations FPM EDO SDRAM ESDRAM DDR x16 DRAM x16 DRAM l x16 DRAM x16 DRAM m x16 DRAM CPU Memory 128blt 1UUMH 2 bus and caches Controller l x16 DRAM x16 DRAM DIMM HHHIHHH Rambus Direct Rambus SLDRAM CPU 128bit 100MHz bus memory and caches Controller Fast Narrow Channel Note TRANSFERWIDTH of Direct Rambus Channel equals that of ganged FPM EDO etc is 2x that of Rambus amp SLDRAM DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DRAM Configurations Rambus amp SLDRAM dualchannel CPU 128bit 100MHz busr memory and caChes Controller Fast Narrow Channel x2 E El Strawman Rambus etc INVHCI INVHCI CPU Memory 128bit 100MHz bus and caches Controller INVHCI Multiple Parallel Channels DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland First Refresh Matters com press 1200 i l l l l l l Bus Wait Time E Refresh Time Data Transfer Time Row Access Time i O O 00 Data Transfer Time Overlap Column Access Time Bus Transmission Time su sseoov Jed eulli O 0 st 0 mm FF MZ rpm E001 E002 somwgsmvstow RDR AM mow DRAM Configurations Assumes refresh of each bank every 64ms DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Total Execution Time in CPI SDRAM Overhead Memory vs CPU Stalls due to Memory Access Time Overlap between Execution amp Memory Processor Execution includes caches I L N f I N I Ln ldO U010n113Ul Jed sgtiooio I 394 O 4 E Compreschc Go ljpeg Li M88ksim Perl Vortex BENCHMARK Variable speed of processor amp caches DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Definitions var on Burger et al tpRoc processor with perfect memory tREAL realistic configuration tBW CPU with wide memory paths tDRAM time seen by DRAM system 39 tREAL tDRAM Stalls Due to t t BANDWIDTH REAL BW th Stalls Due to LATENCY tBW 39 tPROC CPUMemory OVERLAP tPRoc tREAL tDRAM CPUL1L2 t t Execution REAL 39 DRAM tPROC DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Memory amp CPU PERL BandwidthEnhancing Techniques I Stalls due to Memory Bandwidth Stalls due to Memory Latency Overlap between Execution amp Memory Processor Execution 5 I I I I New er DRAMs gt I I I I I I I I I I I 0 I I I I N r Ida uouonusui Jed sapl3 FP EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM DDR DRAM Configuration DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Memory amp CPU PERL Bandwidth Enhancing Techniques ll ids uononusui Jed sapl3 Stalls due to Memory Bandwidth Stalls due to Memory Latency I Overlap between Execution amp Memory Processor Execution SDRAM amp DDR SLDRAM x1x2 DRAM Configuration 1OGHz CPUs RDRAM x1x2 DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Average Latency of DRAMs Bus Wait Time Q Refresh Time Data Transfer Time 500 I O O V 300 amp I o o N su sseoov Jed ewil BAV I O O F m Data Transfer Time Overlap Column Access Time g Row Access Time Bus Transmission Time li a i iii EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM I a DRAM Configurations note SLDRAM amp RDRAM 2x data transfers DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Average Latency of DRAMs Bus Wait Time I Refresh Time Data Transfer Time I Data Transfer Time Overlap Column Access Time Row Access Time Bus Transmission Time 500IIIII I O O V 300 su sseoov Jed ewil BAV Wall i m I mung i I II D FPM EDO SLDRAM RDRAM SDRAM DRDRAM ESDRAM Cl I DDR DRAM Configurations note SLDRAM amp RDRAM 2x data transfers DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DDR2 Study Results Architectural Comparison I pc100 I ddr133 I drd I 1er iaan El ddr2vc 0 l i ll l l l l cc1 oom go ijpeg li linea mpe mpe peg perl ran strea strea pres rwal gZde gZen wit dom m mn s k c c wa oun zaoo awn uonnoexa peziiewJoN Benchmark DRAM TUTORIAL 39SCA2002 DDR2 Study Results Bruce Jacob David Wang University of Perl Runtime Maryland I pc100 I ddr133 I drd 8 07 I 1er D ddr2ems D ddr2vc 1 Ghz 5 Ghz 10 Ghz Processor Frequency DRAM TUTORIAL W20 RowBuffer Hit Rates Bruce Jacob David Wang NULZCacne NULZCacne University of an 4 Maryland new Mm in am uH new mm in am uH 1MB L2 Cache 1MB L2 Cache new Mm in am uH new Mm in am uH sp c W 95 ae nmmams 4MB Cache 4MB L2 Cache new Mm in aim in new Mm in aim in DRAM TUTORIAL W20 RowBuffer Hit Rates 39gggg Hits vs Depth in VictimRow FIFO Buffer 1000 200 1 5e05 Un iverSIty of 30 Li Vortex Maryland 1e05 A s 400 A A 50000 A A 200 50000 2e05 Ae05 s empress s 40000 15e05 A 999 A Se05 4 Perl 4 30000 A A 1e05 A A 2e05 A A 20000 A A 10000 A A 50000 1e05 0 0 0 Compress Vortex 100000 10000 10000 1000 1000 1 2000 4000 5000 8000 10000 1 2000 4000 5000 8000 10000 Interarrival time CPU Clocks Interarrival time CPU Clocks DRAM TUTORIAL 39SCA2002 Row Buffers as L2 Cache Bruce Jacob David Wang University of Maryland Staiis due to Memory Latency Oyeriap between Execution amp Memory Processor Execution O C O F IdO uonomisui Jed sgti00io Compress Gcc Go iipeg Li M88xsrm Peri yortex DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Row Buffer Management ROW ACCESS COLUMN ACCESS Column Decoder l l Sense Amps 4 I Bit Lines Memory Array Japoaaa may a t lt Japoaaa M03 RAS is like Cache Access Why not Speculate DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland CostPerformance FPM EDO SDRAM ESDRAM Lower Latency gt WideFast Bus Increase Capacity gt Decrease Latency Low System Cost Rambus Direct Rambus SLDRAM Lower Latency gt Multiple Channels Increase Capacity gt Increase Capacity High System Cost However 1 DRDRAM Multiple SDRAM DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Conclusions 100MHzI128bit Bus is Current Bottleneck Solution Fast Busles amp MC on CPU eg Alpha 21364 Emotion Engine Current DRAMs Solving Bandwidth Problem but not Latency Problem Solution New cores with onchip SRAM eg ESDRAMVCDRAM Solution New cores with smaller banks eg MoSys SRAM FCRAM Direct Rambus seems to scale best for future highspeed CPUs DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Outline Basics DRAM Evolution Structural Path Advanced Basics DRAM Evolution Interface Path Future Interface Trends amp Research Areas Performance Modeling Architectures Systems Embedded DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Motivation Even when we restrict our focus SYSTEMLEVEL PARAMETERS Width of channels Channel bandwidth Turnaround time Request reordering Columnaccess CAStoCAS latency L2 cache blocksize Bus protocol Number of channels Channel latency Ban ks per channel Requestqueue size Rowaccess DRAM precharge DRAM buffering Number of MSHRs Fully partiay not independent this study DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Motivation the design space is highly nonlinear E 32Byte Burst 64Byte Burst 128Byte Burst CQ U0l10HJ18U Jed sapb 32 64 128 256 f f a f f f f f f a are my 6 are are s8 are my are a 391 p lt2 a 391 lt2 a lt2 0 0 0 0 0 0 0 0 0 59 594239 594239 643 5quot 64a 64a 64a 594239 643 594239 N W39 N m V N n a n a a System Bandwidth GBs Channels Width 800MHz DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Motivation and the cost of poorjudgment is high H3 uononnsw Jed saplo H Worst Organization H Average Organization H Best Organization bzip gcc mcf parser perl vpr average SPEC 2000 Benchmarks DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland SystemLevel Model SDRAM Timing B Row Access Clock m D Column Access Command I I I I I 7 Transfer Overlap D Data Transfer DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland SystemLevel Model Timing diagrams are at the DRAM level not the system level B Row Access Clock mm D Column Access Command I I I I I Transfer Overlap D Data Transfer I Valid Valid Valid Valid Data Data Data Data r DRAM Bank Active ABUS Active DBUS Active DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland SystemLevel Model Timing diagrams are at the DRAM level not the system level B Row Access D Column Access I I I I Transfer Overlap D Data Transfer Clock Conimand Valid Valid Valid Valid Data Data Data Data I I I I r DRAM Bank Active ABUS Active DBUS1 Active lt gt DBUSZ Active DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Request Timing Frontside bus Backside bus Address Data bus Address 800 MHz Control RowColumn Addresses amp Control READ REQUEST TIMING to ADDRESS BUS DRAM BANK ltROWgt ltCOLgt ltPREgt DATA BUS DABUAgt5DB1 3632593322 DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland ReadNVrite Request Shapes READ REQUESTS tn ADDRESS BUS m DRAM BANK QUns DATA BUS 70ns gtm ADDRESS BUS m DRAM BAN K m DATA BUS ADDRESS BUS m D RAM BAN K on ns DATA BUS 70ns WRITE REQUESTS t DRAM TUTORIAL W20 PipelinedlSplitTransactions lgggiil VIE3 a Legal if RR to different banks FEE 70ns 20mm Read 70ns University of Maryland Read39 b Nestling of writes inside reads is legal if RNVto different banks Legal ifturnaround lt 10ns Legal if no turnaround mg m Read Read 70ns gtm 70quot5 10m 10 write Write lt 40ns m lt 40ns m C Backtoback RW pair that cannot be nestled V39O39OVOVOVO39 10202020201 fo o 993391 v v v v v V 92029202924 DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland One independent channel Twoindependent channels Banking degrees of 1 2 4 Banklng degrees of1 2 4 i Four independent channe 5 Banking degrees of 1 2 4 1 2 4 800 MHz Channels 8 16 32 64 Data Bits per Channel 1 2 4 8 Banks per Channel lndep 32 64 128 Bytes per Burst DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Burst Scheduling BacktoBack Read Requests 128Byte Bursts l 4 64Byte Bursts l l 32Byte Bursts l l l l l Criticalburstfirst Noncritical bursts are promoted Writes have lowest priority tend back up in request queue Tension between large amp small bursts amortization vs faster time to data DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland New BarChart Definition 39 tREAL tDRAM Stalls Due to t t DRAM Latency REAL BW t Stalls Due to 3Y3 Queue Bus tSYS tPRoc CPUMemory OVERLAP tPROC 39 tREAL 39 tDRAM CPUL1L2 t t Execution REAL DRAM tPROC tPROC CPU with 1cycle L2 miss tREAL realistic CPUDRAM config tSYS CPU with 1cycle DRAM latency tDRAM time seen by DRAM system DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland System Overhead Regular Bus Organization OCycle Bus Turnaround Perfect 10 uoipnusu Jed sapQ System Bandwidth GBs Channels V dth Speed Benchmark BZIP SPEC 2000 32byte burst 16bit bus DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland System Overhead Regular Bus Organization OCycle Bus Turnaround I N System overhead 10 100 over perfect memory 10 uoianJsu Jed sapi V System Bandwidth GBs Channels V dth Speed Benchmark BZIP SPEC 2000 32byte burst 16bit bus DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Concurrency Effects 1 Bank per Channel 2 Banks per Channel 4 Banks per Channel 8 Banks per Channel I I BANKS per CHANNEL CHANNEL BANDWIDTH HO uoipnusu Jed sapQ System Bandwidth GBs Channels Width Speed Benchmark BZIP SPEC 2000 32byte burst 16bit bus Bankschannel as significant as channel BW DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Bandwidth vs Burst Width 32Byte Burst 3 64Byte Burst 128Byte Burst H3 uononnsw Jed sapt3 4 0 No 4amp9 4 6 we u 39Lquot wbo39bybo 0 0 06 System Bandwidth GBs Channels Width 800MHz Benchmark GCC SPEC 2000 2 bankschannel 9 9 9 9 x 6 9 0amp6 63 63 63 v n 43 4 4 43 o o o o A 690 690 a 90 09 4 39L m w u u Iauueuosxueq z oooz OEIdS 339 mewuouaa szoos mplm sleuueuo 589 mplmpuea masks 0 0 b 0 1 V q 1 V V 90 90 90 90 90 90 60 90 90 091230 01 395 030 o 036 030 9 03 05 030 d 051quot V 3 a 95 99quot 95 95 99quot 95 5 5 99quot 95 9 5 95 a 9399Z S39ZL 7399 Z39S 939l 39 N H3 uononnsw Jed sap 3 ISJnE BME39SZL ISJnE 915839179 ISJnE 915819 390 UlleISINEI 39SA UllePUBEI puewew JO AusJexgun Suem plAeG qooer 90mg zooz VOSI IVIaomi wvaa IBUUBHOISMUBQ Z OOOZ OEIdS 339 IJBUJIIOUBEI szoos mplm sleuueuo 589 mplmpuea weJSKS 0 0 01 on 01 0V on 01 0V 01 0V w fog fo z Q1 93 93 6 Vo 1 oo Voe1 o 90 V 994 95 95 994 95 94 994 95 049 96 049 839ZL 17399 Z39s 939L 8390 0 o lt 2 D tn 39c D 391 5 tn 2 E Z 9 5 3 8 15MB aMaszt E g Isms awe179 s puelmw 15MB BIKEZ9 JO AusJexgun UlleISINEI 39SA UllePUBEI Suem plAeG qooer 90mg ZOOZ VOSI 39IVIEIOLHJ WVHCI DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Bandwidth vs Burst Width 32Byte Burst 64Byte Burst E 128Byte Burst 390 Wide channels 3264bit want large bursts N 9 9 a 4 9 was v 09 6 4 a v A 0 N0 391 9 9 9 9 9 43 quot 43 9 09 a 09 0 0 0 0 09 go 05 Nov Lgv kg Lgv kg kg System Bandwidth GBs Channels Width 800MHz Benchmark GCC SPEC 2000 2 bankschannel DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Bandwidth vs Burst Width 32Byte Burst 64Byte Burst 128Byte Burst Narrow channels 8bit H3 uononnsw Jed saplo 1 6 3 2 6 4 128 256 9 9 9 9 9 9 9 9 L o gpvfe 04 L o fe 94 699 04 L o fe 699 04 0 9 0 9 9 9 9 9 9 9 9 9 a v a v 039 60 039 039 039 v 039 v A Ar 11 Ar Lo 9 Ar Lo 0 0 Lo b5 05 System Bandwidth GBs Channels Width 800MHz Benchmark GCC SPEC 2000 2 bankschannel DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Bandwidth vs Burst Width H3 uononnsw Jed sapl3 16 6 2 6 6 a 6 6 6 64 9 63 63 0 63 63 63 p 4 4 p 06p 43 4 p we 9 o 600 6009 600 A 3 Medium channels 16bit want medium bursts 39L 3930 qf39 32Byte Burst 64Byte Burst 128Byte Burst o o o Fowrpob o System Bandwidth GBs Channels Width 800MHz Benchmark GCC SPEC 2000 2 bankschannel DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Burst Width Scales with Bus Range of BurstWidths Modeled ADDRESSBUS 1oquot DRAM BANK DATA BUS 70ns ADDRESSBUS10ng DRAM BANK pj5ns DATABUS 70ns ADDRESSBUS10ng DRAM BANK DATABU8 70ns a ADDRESSBUS 0quot DRAM BANK DATABUS 70ns ADDRESSBUS 0quot DRAM BANK DATABUS 70ns ADDRESSBUS 0quot DRAM BANK DATABU8 70ns gt 160ns 64bit channel x 32byte burst 64bit channel x 64byte burst 32bit channel x 32byte burst 64bit channel x 128byte burst 32bit channel x 64byte burst 16bit channel x 32byte burst 32bit channel x 128byte burst 16bit channel x 64byte burst 8bit channel x 32byte burst 16bit channel x 128byte burst 8bit channel x 64byte burst 8bit channel x 128byte burst DRAM TUTORIAL 39SCA2002 Burst Width Scales with Bus Bruce Jacob David Wang Range of BurstWidths Modeled ADDRESS BUS DRAM BANK University of Maryland DATA BUS 64bit channel x 64byte burst 70quot o L 32bit channel x 32byte burst B U RST WI DT H S gel x 128byte burst 16bit channel x 128byte burst 7015 w 8bitchannelx64byte burst ADDRESS BUS m 8bit channel x 128byte burst 70ns 160ns I DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland HQ uoipnusw Jed sapQ 32Byte Burst 64Byte Burst 128Byte Burst 6 6 6 63 6 6 63 L b L t Q 0 0 9 9 0 lt3 G lt3 tx L tx 32 GBs System Bandwidth channels x width x speed DRAM TUTORIAL 39SCA2002 Focus on 32 GBs MCF Bruce Jacob David Wang 1 Bank per Channel Universit of 2 Banks per Channel Maryland 4 Banks per Channel 8 Banks per Channel ido uouonusui Jed sapQ Banks not particularly important given large burst sizes 32Byte Burst 64Byte Burst 128Byte Burst 66 66 6 65 66 6 by y W V by y V lt lt lt lt lt o b b 0 b b 0 b b 554 6 6 554 6 6 L L 32 GBs System Bandwidth channels x width x speed 6 6 6 6 6 6 6 V tx q 0 xx bf DRAM TUTORIAL 39SCA2002 Focus on 32 GBs MCF Bruce Jacob David Wang 1 Bank per Channel 2 Banks per Channel umh earrigncg 7 4 Banks per Channel 8 Banks per Channel ido uoipnusui Jed sapQ even less so With multIchannel systems 32 GBs System Bandwidth channels x width x speed DRAM TUTORIAL 39SCA2002 Focus on 32 GBs MCF Bruce Jacob DaVId Wang 1 Bank per Channel Universit of 2 Banks per Channel Maryland 4 Banks per Channel 8 Banks per Channel HQ uouonusu Jed sapQ Multichannel systems but not always a 32Byte Burst 64Byte Burst 128Byte Burst 66 66 6 65 66 6 by y W V by y V lt lt lt lt lt lt b b 0 b b 0 b b 554 6 6 554 6 6 L L 32 GBs System Bandwidth channels x width x speed 6 6 6 6 6 6 6 V tx q 0 xx bf DRAM TUTORIAL 39SCA2002 Focus on 32 GBs MCF Bruce Jacob David Wang E 1 Bank per Channel 2 Banks per Channel umh earrigncg 7 4 Banks per Channel 8 Banks per Channel HQ uoipnusu Jed sapQ 4x 1byte channels 2x 2byte channels 1x 4byte channels 32 GBs System Bandwidth channels x width x speed DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Focus on 32 GBs BZIP E 1 Bank per Channel 2 Banks per Channel 4 Banks per Channel 8 Banks per Channel ido uouonusui Jed sepAQ 64 Byte Burst 6 5 6 6 6 Ne 33 1343 3 u e 9 9 09 0d 9 0 Ne we to 5 we 0 32 GBs System Bandwidth channels x width x speed DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland BEST CONFIGS are at Focus on 32 GBs BZIP 1 Bank per Channel 2 Banks per Channel 4 Banks per Channel 8 Banks per Channel idQ uoipnusui Jed sepAQ SMALLER BURST SIZES 0 32Byte Burst 64 Byte Burst E E 6 E E 6 6 avg Leg WT W43 6 6 WT avg 128Byte Burst 6 6 y V bcd L 0Q 0 31 0 lt lt lt lt 0 lt lt I 5 9 W at Q a w a to L L L 32 GBs System Bandwidth channels x width x speed DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Queue Size amp Reordering BZIP 16 GBs 1 channel Alnfinite Queue 1Entry Queue 432Entry Queue Ida uoilomtsul Jed sapQ V0 32byte burst 64 byte burst 128 byte burst DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Conclusions DESIGN SPACE is NONLINEAR COST of MISJUDGING is HIGH CAREFUL TUNING YIELDS 30 40 GAIN MORE CONCURRENCY BETTER but not at expense of LATENCY gt NOT wl LARGE BURSTS gtALWAYS SAFE gt DOESN T PAY OFF gt NECESSARY Via Channels Via Banks Via Bursts Via MSHRs BURSTS AMORTIZE COST OF PRECHARGE Typical Systems 32 bytes even DDR2 gt THIS IS NOT ENOUGH DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Outline Basics DRAM Evolution Structural Path Advanced Basics DRAM Evolution Interface Path Future Interface Trends amp Research Areas Performance Modeling Architectures Systems Embedded DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Embedded DRAM Primer Ear 39i Array Not Embedded DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Whither Embedded DRAM Microprocessor Report August 1996 Five Architects Look to Processors of Future Two predict imminent merger of CPU and DRAM Another states we cannot keep cramming more data over the pins at faster rates implication embedded DRAM A fourth wants gigantic onchip L3 cache perhaps DRAM L3 implementation SO WHAT HAPPENED DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland Embedded DRAM for DSPs MOTIVATION TAGLESS SRAM TRADITIONAL CACHE hardwarem anaged SOFTWARE The cache covers managesmis the el39itll39e address manages this movement of space any datum movement data mme space may be cached Address space includes both memory space cache and primary memory and memory4 mapped 0 the Original NONTRANSPARENT addressing TRANSPARENT addressing EXPLICITLY MANAGED contents TRANSPARENTLY MANAG ED contents DSP Compilers gt Transparent Cache Model DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland DSP Buffer Organization Used for Study LdStD LdSt1 Igt DSP Fully Assoc 4Block Cache if i i i i DSP Bandwidth vs DieArea TradeOff for DSP Performance DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland EDRAM Performance Embedded Networking Benchmark Patricia 200MHz C6000 DSP 50 100 200 MHz Memory Cache Line Size 10 H 32 bytes H 64 bytes 8 H 128 bytes H 256 bytes H 512 bytes H 1024 bytes Bandwidth DRAM TUTORIAL EDRAM Performance Bruce Jacob David Wang Embedded Networklng Benchmark Patnc1a University of Maryland 200MHz C6000 DSP 50MHz Memory 10 Cache Line Size H 32 bytes H 64 bytes 8 H 128 bytes A A 256 bytes H 512 bytes alt 1024 bytes i 39 IdO Increasing bus width gt o i i i i i I I 9 9 9 9 9 e bc ggc 9 Sbc 5699 3 3 BP 63 r Bandwidth DRAM TUTORIAL SCA2002 Performance Bruce Jacob David Wang Embedded Networklng Benchmark Patnc1a University of Maryand 200MHz C6000 DSP 100MHz Memory 10 Cache Line Size H 32 bytes H 64 bytes 8 H 128 bytes H 256 bytes H 512 bytes H 1024 bytes Increasing bus width 0 i i i i i l l eg eg eg eg 0g5 9 w 9A 60 w 3 5A 0900 560 Bandwidth DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland 1amp0 EDRAM Performance Embedded Networking Benchmark Patricia 200MHz C6000 DSP 200MHz Memory 10 fquotache Line Size H 32 bytes H 64 bytes 8 H 128 bytes H 256 bytes n 7 512 bytes 6 alt 1024 bytes Increasing bus width gt O I 9 I 9 I 9 I 9 I Q90 9 9 M 93 9 39 e 0 56 Bandwidth DRAM TUTORIAL ISCA 2002 Bruce Jacob David Wang University of Maryland PerformanceData Sources A Performance Study of Contemporary DRAM Architecturesquot Proc ISCA 99 V Cuppu B Jacob B Davis and T Mudge Organizational Design TradeOffs at the DRAM Memory Bus and Memory Controller Level Initial Resultsquot University of Maryland Technical Report UMDSCATR19992V Cuppu and B Jacob DDR2 and Low Latency Variantsquot Memory Wall Workshop 2000 in conjunction w ISCA 00 B Davis T Mudge V Cuppu and B Jacob Concurrency Latency or System OverheadWhich Has the Largest Impact on DRAMSystem Performance Proc ISCA 01V Cuppu and B Jacob Transparent DataMemory Organizations for Digital Signal Processorsquot Proc CASES 01 S Srinivasan V Cuppu and B Jacob High Performance DRAMs in Workstation Environmentsquot IEEE Transactions on Computers November 2001 V Cuppu B Jacob B Davis and T Mudge Recent experiments by Sadagopan Srinivasan PhD student at University of Maryland Tech Module Superscalar Processors and Hardware Scheduling Part 1 mm v 9mm Weng n Wang and smmr Vitamin g Reading for this Module Pipeline Review and MultiFunction Pipelines 7 Sections A1 A2 A5 The MIPS 64 ISA 7 Section 212 Dynamic Scheduling in Hardware 7 The scoreboard 7 Section A8 e Tomasulo s Algorithm 7 Sections 31 32 33 ECE 41006100 2 2 Topical Map for this Module MIPS ISA Review of Pipelining Basics Basics ofMuItipIe Instruction Issue 7 From dependencies to hazards 7 Taxonomy of multiple issue designs Hardware Outof Order Issue Techniques 7 Scoreboard 7 Tomasulo s Algorithm ECE 41006100 3 3 The MIP864 Instruction Set Architecture Loadstore architecture 7 All operations are on data stored in registers 7 Memory operations only move data tofrom registers Instruction formats Simple xed field small number 7 Lead to simplifications processors in the handling of pipelined 7 Uniformity of instruction decoding and control opcode RegSfer o legSte rd r5 Immediate mmedale opealong Memoy opealong UncondIona branches ECE 41006100 4 4 Tech The Register Files General Purpose Registers GPR 7 64 bit registers 7 All integer operations are 64 bit 7Smaller data types loaded and sign extended Floating Point Registers FPR 7 Instructions to move data between GPR and FPR 7 Separate instructions to convert between integer and floating point 7 Two concurrent single precision operations on the contents of a PP register ECE 41006100 5 Whhe MIP864 Registers and Data Types 32 64 bit general purpose registers 7 R0 hardwired to 0 32 double precision oating point registers 7 One half ofthe register is unused for single precision 8 16 32 and 64 bit data types 7 Use the 64bit registers with signextension 7 Operate using 64 bit instructions Only immediate and displacement addressing modes 7 Encoded in the instruction types MSB L88 0 63 ECE 41006100 6 6 Tech The Memory System 0 Memory is byte addressable with 64 bit addresses Programmable Little vs Big Endian mode All memory accesses must be aligned Branches and jumps are relative to the PC 7 28 bit PC relative jump addresses 7 18 bit PC relative branch addresses ECE 41006100 7 7 Sample Instructions LD R1 64R2 7 displacement addressing mode 7 Loadstore instructions for multiple operand sizes DADD R4 R1 R2 DMUL R4 R5 R6 7 Registertoregister operations sw R6 22R2 7 Indexed addressed mode BNE R1 R2 LOOP SLT R1 R2 R3 7 Use general purpose registers to store results of relational operations 7 Comparison can be used to synthesize all manner of conditions ECE 41006100 8 8 recn A Review of Pipelining 0 Stages of an instruction processing pipeline 7 IF Instruction Fetch 7 ID Instruction Decode and Operand Read 7 EX Instruction Execute 7 MEM Memory ReadNVrite 7 W8 Result writeback ECE 41006100 9 9 Tech Without Pipelining Time W m m m m m a m g m m a I I I I I DADD R1 R2 R3 912 221 31 2211 I I I I I Program Execution Order III instructions ECE 41006100 10 10 Tack I With Pipelining Tlrne cw cm cm cw chs chs cm cm cw I I I I I i I I I I I 39 I r um I III I I III I I m mm I am I Wm I Imam I I l I l I I I I I i I II WM III mm m I MI mm am H mm WW I q we I I I i I I l I mm III mm m I MI mm mm mm Wm I I I I I I I I I am III I my m i m I I g a up WW I I I I I I I I I I I I I I I I I I I I Program Execution Order Il l Instructions ECE 41006100 11 11 Synchronous Pipelining Clock High clock frequencies are achieved with deeper pipelines At higher frequencies clock skew and latch delay become a bigger percentage of the clock cycle 9 percentage of cycle used by logic is reduced ECE 41006100 12 12 Tech The Problem with Dependencies Time Cim cm cm cw cm Ciks cw cm cw i i 1 correct 5 J esult Program Execution Order H i instructions ECE 41006100 13 13 Static Scheduling Solution Time i i i i i i i I R1 1 n a M INN m E i i i i i i NOP H i i am WW g i i i i g i i i NOP g g m i g i I NOP x R1 ECE 41006100 14 14 Hardware Solution Time cw cm cm cw cm Ciks cw cm cw i E x E E E E E R1 I E W E M h E E E E g g g g g g E v E E E E E R1 5 i E E i E E E E E R1quot E ESame instruction i ibut no progress m E E 39 z z I E E i i i R1 The pipeline is stalled ECE 41006100 15 15 rean Hardware Solution If Dsrc1 or Dsr02 EXEdest or MEMdest then disable latch over at IF and ID latches EWMEM MEMWS In nn nn gt Fetch DECODE EX M E M WB lFlD E g ECE 41006100 16 16 Tack An Important Trick 0 Internal forwarding move the result directly from the ALU into one of the two source ports of the ALU 0 Signi cantly reduce the number of stall cycles 0 VWI not work for load instructions still need hardware stalling for these ECE 41006100 17 17 Internal Forwarding Time R1 l l l l l l R1 ECE 41006100 18 18 Implementation of Forwarding 0 Provide data paths from all possible sources of data and pipeline control to select the correct operands 0 Note that instructions immediately dependent on a LD instruction still introduces a stall cycle lDEX EWMEM MEMWES From EWMEM stage FromMEMWsstag meDExstage gt gtIgt gt FromMEWWESstage FromDExstage DECODE Ex MEM WB ECE41006100 19 19 Tech Performance Performance Goal Exploit instructionlevel parallelism to execute multiple instructions per cycle swarmsa Muhrcvce v39LW darapath EMU Ppemmg mainanon eve palalehsm gt CPl 1 o I Cpl gt1 Cpl lt1 Architectural and compiler techniques to systematically increase the number of instructions executed per clock cycle Both hardware and compiler techniques are dependent on the instruction set architecture ISA 7 Set of instructions and resources they manipulate 7 The ISA is interface between the hardware and the compiler ECE 41006100 20 N o Multiple Issue Machines Pipelining allows for multiple instructions at different stages of execution Advances in VLSI made it possible to put more functional units onto a single processor But programs are written for a von Neumann model single PC single instruction ECE 41006100 21 21 Machine Model mleger Multiple Functional Units Multiple issue No rppemed Fu cno a Unit 9 9 PP dvde Latency cycles between an instruction that produces a result and one that consumes the resut Initiation interval rate at which instructions can be issued to a functional unit ECE 41006100 22 22 datali ow This Module Iww itier cmtest Compiler Hardware ECE 41006100 23 23 Terminology lnorder events happen in the order dictated by the program s sequential semantics Instruction issue movement of an instruction from the decode stage into the execute stage Instruction completion committing of results to the physical destination register nonstandard term ECE 41006100 24 24 Tech Flynn s Bottleneck o Attributed to Michael Flynn of Stanford o The throughput of a processor the number of instructions executed per unit time cannot exceed the rate ofinstruction issue ECE 41006100 25 25 Gmquot 39 Tami Instruction Level Parallelism o Parallelism at the level of machine instructions 7 the number of parallel machine instructions that can be issued in a single cycle in a processor 0 ILP is limited 0 ILP is not evenly distributed ECE 41006100 26 26 Limitations to ILP 0 Data dependences Control dependences Resource constraints Capability ofthe compiler 7 to expose ILP 7 to effectively utilize CPU resources ECE 41006100 27 27 kit39e i Hazards o Dependencies and resource constraints manifest themselves as potential hazards during pipeline execution 7 Data hazards 7Ordering constraint between a pair of instructions in the pipeline 7Three types of hazards ReadafterWrite RAW Writeafter Read WAR and WriteafterWrite VVAVV 7 Structural hazards 7Two instructions attempting to concurrently access the same hardware component I Function units Register file ports 7 Buses ECE 41006100 28 28 ReadAfterWrite RAW Hazard o Follows from data dependencies in the program LD R4 32R3 Program older L Data dependency DADD R5 R4 R2 0 Data dependencies may ow through registers or through memory SD R4 32R3 A2k p C X LD R6 44R5 Y A2m p B ECE 41006100 29 29 WriteAfterRead WAR Hazard o Follows from name dependencies 7 Occurs when two instructions share the same physical resource ie name but do not have a data dependence for example a register 7 Two types of named dependencies o Antidependence 7 instrJ tries to write operand before instri reads it I DADD R1 R2 R3 Wm i J DSUB R2 R4 R3 ECE 41006100 30 30 WriteAfterWrite WAW Hazard 0 Output Dependence 7 instrJ tries to write an operand before instrlwrites it I DADD R4 R2 R3 Program older 1 J DSUB R4 R1 R3 ECE 41006100 31 31 Control Dependencies 0 Control dependencies determine execution ordering of program instructions 0 Control dependencies 7 Determine which data dependencies are to be honored at runtime DADD R5 R6 R7 BNE R4 R2 CONTINUE DMUL R4 R2 R5 DSUB R4 R9 R5 ECE 41006100 32 32 amp Resource Constraints 0 Number of registers Number ofports to registers and cache Memory bandwidth Number ofparallel instruction decoders Number offunctional units of various types Number ofdata paths between various CPU components all these limit realizable lLP ECEMDDBluu 33 33 A Taxonomy of lLP Processors Tiuu KKK Canonical Model ECE4lEIEIBltltl 34 34 Superpipelined Processors Time Successwe Instructions ECE 41006100 35 35 VLIW Time Successwe Instructions Independence Architecture ECE 41006100 36 36 1334 Superscalar Processors Time Successive Instructions Sequential Architecture ECE 41006100 37 37 chi lnorder Issue lnorder Completion Instruction Issue TV Synchronized progress imc T1 Successive Instructions ECE 41006100 38 38 799quot lnorder Issue lnorder Completion Should there be a hazard dependences or resource constraints entire decoding process stalls OthenNise instructions would execute outoforder no means of tracking output and antidependences StraightfonNard design ECE 41006100 39 39 V Te h illnorder Issue Outoforder Completion Time Out of order completion Successwe ll tl uCtll1S m m ECE 41006100 40 40 Techilnorder Issue Outoforder Completion c As soon as execution nishes result may be completed even if outof order Besides ensuring ow dependences need extra hardware to take care of output dependences Two instructions writing to the same register may complete outof order potentially violating output dependences if nothing is done about it ECE 41006100 41 41 7 lnorder Issue 73quot Outoforder Completion Example 1 DMUL R8 R10 2 DSUB R10 R6 R Output 3 DSUB y 39 4 dependency n DADD R16 R6 R20 0 If instruction 1 completes after instruction 3 it will lead to incorrect results Output dependencies WAW hazard must be enforced to ensure correct execution WAR hazards cannot occur ECE 41006100 42 42 A Big Problem Precise Exceptions 0 An exception is precise if the following conditions are met 7 All the instructions preceding the instruction causing the exception have been executed and have modified the process state correctly 7 All instructions following the instruction causing the exception have not yet been executed and have done no modification to the process state 7 The instruction that caused the exception may or may not have been executed This will depend on the definition of the architecture and the cause of the exception ECE 41006100 43 43 Machine Model Non7ppemed Fu cno a ml 9 g was 0 Different instructions may be in various stages of execution ECE 41006100 44 44 134 Precise Exceptions 0 An exception is imprecise if it is not precise Precise exception requires inorder execution semantics Certain important exceptions have to be precise We will return to support for precise exception in outof order completion machines later ECE 41006100 45 45 Outoforder Issue Outoforder Com Bletlon Time Successive Ins nsli uclion W indow ECE 41006100 46 46 Tech Outoforder Issue Outoforder Completion lnorder issue stops as soon as con ict arises 7 no lookahead Outof order issue as soon as inputs are ready instruction may be issued 7 Finite amount of lookahead 7 Combined with outoforder completion gives most ILP ECE 41006100 47 47 The Instruction Window 0 A pool of up to n decoded instructions 0 Any ready instruction in this pool may be issued Must deal with antidependences in addition to ow and output dependences ECE 41006100 48 48 Outoforder Issue Tmquot Outoforder Completion Example 1 DDIV R6 R8 R10 2 DSUB R10 R6 Am dependency n DAD 6 R20 If instruction n completes before instruction 2 reads its operands it will lead to incorrect results 0 Antidependencies WAR hazard must be enforced to ensure correct execution ECE 41006100 49 49 Algorithms for 79 Outoforder Issue 0 Scoreboarding o Tomasulo s Algorithm 0 Others ECE 41006100 50 50 Tech Scoreboarding o This idea was first introduced in the CDC 6600 1964 which had 16 independent functional units 7 4 forfloating point operations 7 5for memory access and 7 7 for integer operations o The scoreboard maintains information about the instructions the functional units and the results ECE41006100 51 51 Et39eah f39 Concept l Centralized scoreboard to instruction Execution State and Control gt keep status of all instructions Control outof Permit out of order completion order issue ECE 41006100 52 52 Key ideas Decompose the decode stage into issue and read operand RO steps 7 Stall on WAW or structural hazards in issue stage Allow bypassing in R0 of independent in terms of data ow instructions 7 Localize stalls 9 stall data dependent instructions Enforce WAR during write back 7 Detect and enforce hazards as late as possible ECE 41006100 53 53 The Scoreboard o Consists of 7 instruction status table 7 functional unit status table 7 result table ECE 41006100 54 54 Techi Instruction Status Table 0 Keeps the information about which activities of the execution process an instruction is currently in 7 issue is the instructions issued 7 rdopd has it completed reading its operands 7 exec has it completed its execution 7 wrback has it completed its writeback ECE 41006100 55 55 73 Functional Unit Status Table Function unit producing vaiue Source Registers nave Vaiue dest reg Srci Scm No Ilnteger lYes INo I IAdd lYes ISub F8 F6 F2 Has an entry for each functional unit and there are 9 fields for each entry 7 busy indicates if the functional unit is busy 7 op the kind of operation being performed 7 dest the destination register 7 srcl src2 the two source registers Funcl Q1 func2Qj the functional units producing the results in the two source registers readyl readyZ indicates if srcl and src2 is ready ECE 41006100 56 56 Result Status Table 0 Maintain an entry per register indicating the functional unit that will write the pending result into the register ECE 41006100 57 57 Operation Using a Scoreboard occurs here Check for WAW and Structural Hazards Check for WAR Hazards Check for RAW Hazards ECE 41006100 58 58 Scoreboarding 1 1 When an instruction is fetched an entry is made in the instruction status table 2 After the instruction is decoded the corresponding issue entry in the instruction status table is marked ECE 41006100 59 59 Scoreboarding 2 3 Select a functional unit This is obtained by checking the busy flag of all the functional units which can execute the current instruction Enter the relevant information in the corresponding functional unit status table funcl and func2 are obtained from the corresponding entries in the result table For example if one of the source register is R1 then under the entry of R1 in the result table locate the functional unit responsible for writing to this register If there is no entry then mark readyl as ready39 The same goes for src2 This takes care of flow dependency Function unit producing value Source Registers nave value Scm dest reg 5rc1 IAdd lYes ISub F8 F6 F2 IlntegerlYes No ECE 41006100 60 60 Scoreboarding 3 01 An instruction is ready for issue if both readyl and readyZ entries are marked ready and the corresponding entry in the result table for the destination register is empty If the latter condition is not fulfilled instruction issue is stalled This avoid output dependency This entry is overwritten by the number of the functional unit that will produce the result FU Mult1 integer 0 Add Divide 6 The instruction then proceeds to read its operands and executes with the corresponding entry in the instruction status table updated accordingly ECE 41006100 61 61 Scoreboarding 4 7 At the completion of the execution stage the busy corresponding entry in the functional unit status table is turned to no 8 Before writeback it is necessary to check for antidependency If there xists antidependency the current writeback must be stalled until the 3 azard is cleared 9 During writeback the result is written back to the register the entry in the result table is turned to empty and the instruction39s entry in the instruction status table is deleted At the same time the functional unit atus table is scanned such that the ready entries can be updated so as to reflect the fact that the result in this register is now ready Add Yes ISub F8 F6 F2 Ilnteger Yes No ECE 41006100 62 62 Example 1 Instruction Flow MULD F0F2F4 SUBD F8F6F2 ADDD F6F8F2 Function unit latencies FPADD 2 cycles FPMULT WAR cycles FPDIV 15 cycles FPLOAD 2 cycles Integer 5 1 cycle Cannot read and write a register in the same cycle All units except FPDIV are pipelined ECE 41006100 63 63 Foo Example 2 Instruction Flow LD F0 DSUB R1 R1 8 BNEZ R1 F00 SD Function unit latencies FPADD 2 cycles FPMULT 5 WWW cycles FPDIV 15 cycles FPLOAD 2 cycles Integer 1 accesslolne Cycle registerfile Cannot read and write a register in the same cycle All units are pipelined ECE 41006100 64 64 Compiler l Hardware ECE 41006100 65 65 w i Summary Outof order issue and outof order completion Performance limited by 7 Amount of ILP in the code segments 7 Number of entries in the scoreboard ie amount of look ahead 7 Number of functional units Complexity of scoreboard is on the order of a functional unit ECE 41006100 66 66 Pipelines ECE 41006100 Yalamanchili Fall 2003 MultiCycle Execution Units ECE 41006100 S Yalamanchili Integer uni Faneger multiply FF adder 1 A1 A2 M M FP nlager divider 9 2003 Elsevier Science USA All rights reserved AnaIySIs 7 Latency vs clock rate 7 Presence of forwarding 7 Issue order vs completion order 7 Latency and initiation interval 7 Hazards 7 Structural data and control Fall 2003 Handling Data Hazards lnieger unit Faneger mulliply 2003 Elsavie Suianoe USA All vighls vasewed 0 Data Hazards 7 WAR cannot occur why not 7 WAW can occur how 7 Solution stall or suppress writes 7 RAW 7 Check for pending writes and forward ECE 41006100 5 Yalamanchili Fall 2003 Handling Structural Hazards integer unit FPinleger multiply o 2mm Eisevier Science USA All rights reserved 0 Hazard resolution 7 Structural hazards 7 Functional unit 7 Register file write port 7 Solution stall in ID ECE 41006100 S Yalamanchili Fall 2003 Handling Exceptions Integer unit Faneger mumpiy 2003 Elsavle Scianoe USu All vrgms reserved 0 Buffering state 7 History file and rollback 7 Future and commit 0 Software support 0 Issue restrictions to reduce overhead 0 Live with imprecise exceptions ECE 41006100 S Yalamanchili Fall 2003 Dynamic Schedulin A r V use unn EX wmmgw may M4 a 2005 Elacvlcr Suww lusn m l UN 0 Pure in order issue will stall useful independent instructions 7 Example 0 Out of order issue 9 lead to out of order completion 9 WAR ancl WAW hazards Example ECE 41006100 S Yalamanchili Fall 2003 Solution 1 Centralized Schedulin l Instruction Execution State and Control quotr W 77quot quot77 In 7 a 5 H llll HWWIHIIIIHIM o Readin A8 and r f W3quot 5 3 quot 5F T2 0 Goal further increase issue rate to approach CPI of 1 o What do we need to do i Enforce data dependencies i Prevent pipeline state ECE 41006100 S Yalamanchili WAW and WAR hazards 9 centralized control complete Fall 2003 Key ideas 0 Allow bypassing in ID of independent in terms of dataflow instructions Localize stalls 9 stall only data dependent instructions Other hazards cause stalls 0 Break ID into issue and read operand RO steps Permit independent instructions to bypass in R0 Check for structural hazards in issue stage 0 Enforce WAR during write back Detect and enforce hazards as late as possible 0 High bandwidth to and from the register file 0 No fonvarding will solve later 0 Retain name dependencies and resulting stalls will solve later ECE 41006100 S Yalamanchili Fall 2003 The Scoreboard Registers Data buses 0 Issue if no structural or WAW hazards FPmU FP mult o Stall in R0 on RAW hazard FP divide o Stall in WB on WAR hazard 0 Functional unit status to keep track of operands 0 Register field status to keep track of data dependencies Scoreboard Control status Control status 2003 Elsevier Sclence USA All rights reserved ECE 41006100 5 Yalamanchili Fall 2003 Scoreboard Status Information FU Mult1 Integer quot Add Divide 0 Store functional unit that will deliver contents FunCtion unit P KOdUCing Value Source R gisters have value A A dst reg src1 scm r r Integer Yes H Load F2 R3 Add Yes Sub F8 F6 F2 Integer Yes No 0 Data structures keep global status that can be queried by the control logic 0 Scoreboard implementation is as complex as one functional unit ECE 41006100 S Yalamanchili Fall 2003 Example Instruction Flow LD F6 34R2 LD F2 45R3 MULD FOF2F4 SUBD F8F6F2 DIVD F10FOF6 ADDD F6F8F2 WA R 0 Function unit latencies FPADD 2 cycles FPMULT 5 cycles FPDIV 15 cycles FPLOAD 2 cycles Integer 1 cycle 0 Cannot read and write a register in the same cycle 0 All units except FPDIV are pipelined ECE 41006100 S Yalamanchili Fall 2003 1 1 Solution 2 Tomasulo s Algorithm 0 Reading 32 33 0 Add hardware support for scheduling React to runtime conditions Hardware support to prevent dependencies from becoming hazards 9 register renaming o Simplify compiler o Portability of binaries across pipelines 0 Register renaming to avoid WAR and WAW hazards ECE 41006100 S Yalamanchili Fall 2003 The Essence of Register Renaming DIVD F0 F2 F4 DIVD F0 F2 F4 ADDD 8 F0 F8 ADDD F0 so F6 0R1 WAW S39D 8 0R1 SUBD SUBD T F10 F14 MULD MULD F6 F10 T o Compilerbased renaming o Compiler analysis to provide analysis beyond code block 0 May extend capabilities beyond that of the compiler of reservation stations 0 Note that many forms of storage used in register re naming ECE 41006100 S Yalamanchili Fall 2003 Data Structures o LDSD buffers act as reservations stations for memory units 0 Instruction execution cannot start until all branches resolved Speculation a more complete framework Reservation stations Values A A r Qk Vj Vk A Busy Op Q Register value I Qi l ECE 41006100 S Yalamanchili Fall 2003 The Data Path and Functional Units Register renaming and distributed control A form of generalized forwarding Renaming registers distributed among the functional units Control logic for forwarding is distributed among the function units Serialization of broadcast of results enables correct operation ECE 41006100 S Yalamanchili From instruction unit instruction queue FP registers I Load store i operations 3 Floatingpoint operations Store buffers Operand buses Load buffers Operation bus Reservation ir stations Common data bus CDB 2003 Elsevier Science USA All rights reserved Fall 2003 Memory Disambiguation 0 Detection of RAW dependencies through memory SD F6 44R4 LD F8 32R8 RAW Dependency o Loads must be checked with preceding stores RAW 0 Stores must be checked with preceding Loads and Stores WAW and WAR 0 A simple scheme all effective address calculations are performed in program order Buffers A field stores effective address Can use forwarding directly tofrom loadstore buffers ECE 41006100 S Yalamanchili Fall 2003 What Next Multiple Issue Superscalar VLIWEPIC Statically scheduled Dynamically scheduled Statically scheduled 0 Reading 36 0 Increase the issue rate 0 Now issue multiple instructionscycle Issue restrictions simplify control 0 Increase in Forwarding logic complexity Importance of branch prediction mechanisms Hardware for concurrent decoding and execution ECE 41006100 S Yalamanchili Fall 2003 A Taxonomy Static Dynamic Hardware Static lnorder SUN Superscalar execution UltraSPARC llll Dynamic Dynamic Hardware Dynamic Some outor IBM Power2 Superscalar order execution Superscalar Dynamic Hardware Dynamic Outof order P3 P4 MIPS speculative speculation R10K Alpha 21264 VLIW Static Software Static No hazards l860 between Trimedia packets EPIC Mostly static Mostly Mostly static Explicit ltanium software dependences marked by compiler ECE 41006100 3 Yalamanchili Fall 2003 Design Issues 0 Issue packet 0 Issue restrictions Motivation Match the hardware Tradeoff complexity vs performance Enforcement Impact on penalties 0 Multiple issue Checking within and across packets Pipelining the issue logic I Instruction i2 Instruction i3 ll I Instruction i I Instruction i1 l ECE 41006100 S Yalamanchili Issue and fetch multiple instructions per clock cycle Fall 2003 Dynamic Scheduling with Multiple Issue o Widen the issue logic 0 Boost instruction issue remain inorder by using reservation stations to move dependence handling to runtime 0 Match between available functional units distribution of dependencies and amount of real work determines achievable performance 0 Examples ECE 41006100 S Yalamanchili Fall 2003 20 Example 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 issue execute LD F0 0R1 ADDD F4 F0 F2 SD F4 0R1 DADDIU R1 R1 8 ENE R1 R2 LOOP LD F0 0R1 ADDD F4 F0 F2 SD F4 0R1 DADDIU R1 R1 8 ENE R1 R2 LOOP LD F0 0R1 ADDD F4 F0 F2 SD F4 0R1 DADDIU R1 R1 8 ENE R1 R2 LOOP ECE 41006100 S Yalamanchili write back Fall 2003 21 Example issue memory 1 LD F0 0R1 1 2 3 4 1 ADDD F4 F0 F2 1 5 8 1 SD F4 0R1 2 3 9 1 DADDIU R1 R1 8 2 4 5 1 ENE R1 R2 LOOP 3 6 2 LD F0 0R1 4 7 8 9 2 ADDD F4 F0 F2 4 10 13 2 SD F4 0R1 5 8 14 2 DADDIU R1 R1 8 5 9 10 2 ENE R1 R2 LOOP 6 11 3 LD F0 0R1 7 12 13 14 3 ADDD F4 F0 F2 7 15 18 3 SD F4 0R1 8 13 19 3 DADDIU R1 R1 8 8 14 15 3 ENE R1 R2 LOOP 9 16 ECE 41006100 S Yalamanchili Fall 2003 22 Example issue execute memory write back 1 LD F0 0R1 1 2 3 4 1 ADDD F4 F0 F2 1 5 1 SD F4 0R1 2 3 9 1 DADDIU R1 R1 8 2 3 4 1 ENE R1 R2 LOOP 3 5 2 LD F0 0R1 4 6 7 8 2 ADDD F4 F0 F2 4 9 12 2 SD F4 0R1 5 7 13 2 DADDIU R1 R1 8 5 6 7 2 ENE R1 R2 LOOP 6 8 3 LD F0 0R1 7 9 10 11 3 ADDD F4 F0 F2 7 12 15 3 SD F4 0R1 8 10 16 3 DADDIU R1 R1 8 8 9 10 3 ENE R1 R2 LOOP 9 11 0 Separate integer unit for effective address calculation ECE 41006100 S Yalamanchili Fall 2003 23 ECE 41006100 Advanced Computer Architecture Lecture 12 Multithreading intel Prof HsienHsin Sean Lee core2 School of Electrical and Computer Engineering Georgia Institute of Technology With adaptations and extensions for ECE 4100l6100Spring 2009 r g for the Xeon and Power 5 processors by S Yalamanchili 1 Athlonx Georgia Umesti u ifs 39 TechmUltgjy I IE I I Reading Section 32 33 35 Papers as listed on the web page Geo ia 1 quotngch Limits of ILP The Perfect Prggsor 0 Register renaming infinite virtual registers 0 Branch prediction perfect no mispredictions 0 ump predictio all jumps perfectly predicted 0 Memoryaddress alias analysi addresses known amp a load can be moved before a store provided addresses not equal 0 Perfect caches 1 cycle latency for all instructions I II I Available ILP gee 55 espresso 63 SPEC F 18 benchmarks fDPPP doduc lomcatv O 20 4O 60 80 100 120 140 160 Instruction issues per cycle Emir lr w musm Effects of Window Size 0 Window size limited by storage number of comparisons and issue rate Hencnmmks Other structural limits typically reduce issue rate below window size lon lssuES Dev cycle Effect of Available Registers m mm ifiifliiii ii Speculation requires a larger number of rename registers Greater number of live values Benchmarks Limits on lLP from program structure and other structures ie branch predictor m lssuas uer cycle Opportu nity We need to improve execution unit utilizations 25 utilizations not uncommon Power5 paper We need to find more instructions to execute Look for higher levels of parallelism beyond lLP from a serial instruction stream Thread level TLP and task level parallelism Investments in power and transistors have not been returning commensurate improvements in performance How can we increase utilization of key hardware components via TLP Geo ia i 97 2 v I II I TLP lLP of a single program is hard Large lLP is Farflung We are human after all program w sequential mind Reality running multiple threads or programs Thread Level Parallelism Time Multiplexing Throughput computing Multiple program workloads Multiple concurrentthreads Helper threads to improve single program performance Geo ia i rch IE II MultiTasking Paradigm Virtual memory makes it easy FU1 FUZ Fu FUA Eg use Context switch could be rea I I lThread2 IIDD expenSIve or requnres extra HW Threa g VIVT cache Thread 4 E lThreads39g DDD VIPT cache E IEIDEI TLBs a IEIEIEI IEIDEI Conventional Superscalar Single Threaded EE quot 39 Multithreading Paradigm FU1 FUZ FUJ FU4 EIUnused IIEIEI Thread1 HEIDI Thread 2 a IIEIEI Thread 3 E IDDD IThread 45 DENIM Thread 5 I u 5 Ell El El I I III E El El El El Conventional Finegrained Coarsegrained Chip Simultaneous Superscalar Muliithreading Muliithreading Multiprocessor Multithreading Single cyclebycycle Block Interleaving CMP or SMT Threaded Interleaving MultiCore Geo ialn C gr vr9ch 139 IE II Conventional Multithreading 0 Zerooverhead context switch 0 Duplicated contexts for threads Register le Memory shared by threads Gag l 39 If II I Cycle lnterleaving MT Percycle Perthread instruction fetching Examples HEP Horizon Tera MTA MIT M machine Interesting questions to consider Does it need a sophisticated branch predictor Or does it need any speculative execution at all Get rid of branch predictionquot Get rid of predication Does it need any outoforder execution capability Geo ia aw IE Tera MultiThreaded Architecture Cyclebycycle interleaving MTA can contextswitch every cycle 3ns As many as 128 distinct threads hiding 384ns 3wide VLIW instruction format MALUALUBr Each instruction has 3bit for dependence lookahead Determine if there is dependency with subsequent instructions Execute up to 7 future VLIW instructions before switch Loop nop r1r2r3 r5r64 Iookahead1 nop r8r9r10 r11r12r13 Iookahead2 r5r1 r4r41 Iookahead0 Geo ia l 1 Tan I II I Block lnterleaving MT Context switch on a specific event dynamic pipelining Explicit switching implementinga switch instruction Implicit switching trigger when a specific instruction class fetched Static switching switch upon fetching Switchonmemoryinstructions Rhamma processor Switchonbranch orswitchonhardtopredictbranch Trigger can be implicit or explicit instruction Dynamic switching Switchoncachemiss switch in later pipeline stage MIT Sparcle MIT Alewife s node Rhamma Processor Switchonuse lazy strategy of switchoncache miss Valid bit needed for each register Clearwhen load Issued setwhen data returned Switchonsignal eg interrupt Predicated switch instruction based on conditions No need to support a large number of threads Geo ial lTgch39 SImultaneous Multithreadmg SMT SMT riarne first used by UW researchers Ari earlier version frorn UCSB Nemiruvsky etal 39jS Intel39s HyperThreadirngrvay S om ower56 2 cores Each ZWay SMT 4 chips pe 7 Intel future rri r packs ge Powers has 000 cores Powere Irirorder cores um ore Brcore 2mm 45mm Nehalerri Basic ideas Conventional M Fdiv IE I l Instruction Fetching Policy FIFO Round Robin simple but may be too naive Adaptive Fetching Policies BRCOUNT reduce wrong path issuing Count of br inst in decoderenameIQ stages Give top priority to thread with the least BRCOUNT MISSCOUT reduce IQ clog Count of outstanding Dcache misse Give top priority to thread with the least MISSCOUNT ICOUNT reduce IQ clog Count of inst in decoderenameIQ stages Give top priority to thread with the least ICOUNT IQPOSN reduce IQ clog Give lowest priority to those threads with inst closest to the head of INT or FP instruction queues ue to thatthreads With the oldest instructions WIII be most prone to IQ clog No Counter needed Geo ia39X gs IE I Resource Sharing 0 Could be tricky when threads compete for the resources 0 Static Less complexity Could penalize threads eg instruction window size P439s Hyperthreading 0 Dynamic Complex What isfair How to quantifyfairness A growing concern in Multicore processors Shared L2 Bus bandwidth etc Issues 0 Faireness 0 Mutual thrashing Geo ia i Task I III Alpha 21464 EV8 Processor Technology Leading edge process technology 12 N 20GHz 0125pm CMOS SOIcompatible Cu interconnect lowk dielectrics Chip characteristics 12VVdd 25O Million transistors 1100 signal pins in flip chip packaging Geo ia aw IE I Alpha 21464 EV8 Processor Architecture 0 Enhanced outof order execution that giant 2Bcgskew predictor we discussed before is here Large onchip L2 cache Direct RAMBUS interface Onchip routerforsystem interconnect Glueless directorybased chUMA for up to 512way 8MP 8 wide superscalar 4way simultaneous multithreading SMT Total die overhead 6 allegedly Geo ia39i 39 i g 7 saw 7 IE I l SMTPipeline Fetch Decode Queue Reg Execute Dcachel Reg Retire Read Store Write Buffer Regs Dca we Regs Geo ia39 x New IE I EV8 SMT In SMT mode it is as if there are 4 processors on a chip that shares their caches and TLB Replicated hardware contexts Program counter Architected registers actuallyjust the renaming table since architected registers and rename registers come from the same physical pool Shared resources Rename register pool larger than needed by 1 thread Instruction queue Caches TLB Branch predictors Deceased before seeing the daylight Geo ia i 013w I i II I P4 0 Obsenations for dynamically scheduled processors Have large registers sets with support for renaming Tag support enables tracking of instructions across threads Schedulers and execution units track dependencies What resources must be partitioned on a per unit basis Conflicts on shared resources should not degrade performance substantially Idea provide support for sharing resources across threads with little additional hardware support 9 Hyperthreading Abstraction Logical processors This is what the programmer and operatingsystem sees Geo ia i l39Tgchquot IE II I Hyperthreading in the Xeon Processor Family 2 CPU V thout Hyperthreading 2 CPU VWh Hyperthreading Xe Ion Exe cutio n Resources Resources Prqes 9r PEQWSSQF Processor ESgecutiofri ProcessorEXecu bn same 39 source 39 Goals Minimize die area cost Independent forward progress for a logical processor Do not penalize single thread performance 9 minimize static allocation of resources Implementation of Hyperthreading adds less that 5 to the chip area Principle share major logic components by adding or partitioning buffering logic Geo ia 7 I 72 7 7 I III Journal P4 Microarchitecture From The Microarchitecture ofthe Pentium 4 Processor Instruction Decoder Microcode ROM T C h race ac e Imp Queue Quad Pumped 32 GBIs Bus Interface Unit L2 Cache Acu 2x ALU 2x ALU Slew ALU 256K Byte Store Simple Simple Camplex Instr Instr ns r 8 WHY 4BGBIS The Xeon Pipeline Trace pup Rename Queue Schedulers Regazer lEXecuIe 05m queue Read ac Fixedsharmg i memory ops round roor acceaa 3 L1 cache dynamo uPound arrarmg b r laimess emorceo Shaw 3 To 5W 5995 A 75 IEXecuIion mu 2 rmrza on ourer was no allocafe oorzvroua Io rogrca sharing parmoreo laimess dormera processors ndependem regrazers ROB enorced o rmrza obrvrous Io logical forwardquot Mame code porrrzers LSQ errzrrea orr ourer arrarrrrg Nate sst m 0 mid register Ne Fem LOQ C srrareo one wrm r Duplicale lTLBs and PCs ror eacrr LP gg ga moms lndependem roarera ror decode PAS duocared 5H5 duorcazeomazory array shared Georgia r Tech 39 Performance I i Ill 0 65 performance increase for high end server applications for 4way server platform 20 30 performance improvement for categories such as transactions web server and server side Java environment 0 Operating system can optimize scheduling of threads across logicalphysical processor combinations Geo ia r rquotTgch39 Power 5 g lmrrcmurhrem E targetpredtcttom ranch redirects dispatch Instructions are tracked through the pipeline as a group inarder 1 u a 39on lnsmlrctiun agenda dispatch Figures DCdata E 39 RL From PowerS System Architecture B Sinharoy etal etalIBMJ RES ampDEV B Eirg39 wj quot 77fquot vol 49110 45 JulySeptember 2005 ec H quot V IE III 0 Shared lcache fetching8 instructionsthreadcycle 0 Shared BHT 0 Group dispatch and commitment Instructions tracked as a group via GCT Register renaming dynamically shares registers between threads as well as LRQ and SRQ 0 Issue is independent of group membership 0 l and D caches fully shared via increased associativity 0 Resource balancing logic to prevent starvation Power 5 Key Features Georgia Tech IE II Consequences of SMT over Power 4 0 Increase L1 lcache associativity 0 Sharing of address translation tables 0 Addition of perthread load and store queues Facilitate ordering checks within a thread 0 Increasing the size of L2 and L3 caches 0 Adding separate instruction prefetch and buffering 0 Increasing the number of virtual registers 0 Increasing the size of issue queues 0 Dynamic resource throttling between threads Geo ia i i g wet I II I Review of the Superscalar Datapath inrordertetch inrorder and issue logic Outrofrordel39 completion logic executlon core I t f I Instruction Completion ns I39LIC IDn SSUE Sam le Ta Sample Tasks Enable waiting instructions Instruction Execution Retire from reorder buffer Renamlng I I Forward from reorder buffer Allocate reservation stations Sam le Tasks Allocate reorder buffer entry and LSQ entries I Data driven execution 9 all dependencies have been resolved Check for structural hazards Issue m fundional quotnquot Deallocate reservation stations Forwardin Check loadstore dependencies Georgia wTech39 ECE 41006100 Yalamanchili Fall 2003 Improving the Performance of the Cache Hierarchy Hit Time Miss Penalty quot Miss Rate 0 Reductions in miss penalty 0 Reductions in the miss rate 0 Reductions in the hit time o Compiler optimizations ECE 41006100 S Yalamanchili Fall 2003 Reducing Miss Penalty 1 Multilevel Caches 0 Reading Section 54 0 Goal balance fast hits vs slow misses Techniques for the former are distinct from those for the latter 0 Goal keep up with the processor vs keep up with memory Reduces to small vs large ECE 41006100 S Yalamanchili Fall 2003 Analysis AMAT HittimeL1 missrateL1 misspenaltyL1 MisspenaltyL1 hittimeL2 missrateL2 misspenaltyL2 AMAT HittimeL1 missrateL1 hittimeL2 missrateL2 misspenaltyL2 0 Local miss rate Defined with respect to the cache 0 Global miss rate Defined with respect to the total number of memory references ECE 41006100 S Yalamanchili Fall 2003 Performance 100 r r 99 99 98 90 39 39 96 Local miss rate 30 Global miss rate Single cache miss rate 70 60 Miss rate 50 Secondlevel 40 cache size KB 0 20 10 6 5 4 4 470 3 2 2 21 0 4 4 F 34 3 16 32 64 128 256 512 1024 2048 4096 234 Cache size KB 239 100 125 150 175 200 225 250 Relative execution time 2003 Elsevier Science USA All rights reserved 2003 Elsevier Science USA All rights reserved 0 Note L2 hit time not that important why Miss rate behavior of large L2 indistinguishable from a single cache Global miss rate a good indicator of performance ECE 41006100 S Yalamanchili Fall 2003 Design Issues 0 Speed L1 coupled with CPU L2 coupled with L1 miss penalty 0 L2 design Reduce miss rate to main memory Associativity 9 reduction in conflict misses Size 9 reduction in capacity misses Match main memory design 0 Handling writes Using a write through design with write buffers o Multilevel inclusion exclusion L1 data is always contained in the L2 Maintenance requires L1 invalidations by L2 Size cost constrained designs may use exclusion ECE 41006100 S Yalamanchili Fall 2003 Multilevel InclusionExclusion v 0X40 0X22 0X35 0X27 0X00 v 0X54 0X36 0X22 0X47 0X27 v 0X66 0X76 0X01 0X00 0X40 S v 0X69 0X36 0X21 0X02 0X00 Swap with L2 entry when replaced J v 0X40 0X22 0X35 0X27 0X27 v 0X50 0X22 0X35 0X28 0X77 v 0X55 0X56 0X1 2 0X34 0X34 v 0X55 0X76 0X42 0X34 0X44 v 0X08 0X16 0X32 0X64 0X64 v 0X08 0X26 0X38 0X64 0X62 v 0X66 0X76 0X01 0X00 0X40 v 0X68 0X76 0X01 0X40 0X40 lnvaidate L1 entry when replaced J o Simplifies coherence maintenance 0 Increase in miss rate but reduced cost ECE 41006100 S Yalamanchili Fall 2003 2 Critical Word FirstEarly Restart Memory Word boundary 0X40 0X44 0x48 W d Ox4C Referenced by the CPU line o Fetch referenced word first and remainder of the line in the background 0 Standard line fetch but referenced word is forwarded to the CPU when it is fetched o Gains improve for larger line size 0 Complexity of multiple successive references to the same block ECE 41006100 S Yalamanchili Fall 2003 3 Priority of Reads over Writes Write b uffer 0 Give reads priority over writes to main memory 0 Check for RAW hazards in the write buffer and stall on conflict 0 Use of write buffers for writethrough design as well as write back design Overlap writeback with CPU operation ECE 41006100 S Yalamanchili Fall 2003 4 Merging Write Buffer write combining inle address 124 wme address o 2003 Elsewer Scenes USA AH righxs reserved 0 Improving the efficiency of write buffers 0 Combine sequential writes into a burst transaction to memory 0 Amortize transfer startup overhead ECE 41006100 S Yalamanchili Fall 2003 Performance of Write Combining Effects of WriteCombining on PIO Injection Bandwidth M 35 100 1000 10000 100000 Injection Burst Size Close to 90 bus bandwidth utilization ECE 41006100 S Yalamanchili 1000000 PIO WriteCombining PIO Plain Fall 2003 5 Victim Cache CPU address Data Data l i 65 out lt4 Tag Data buffer Write I 0 Reduce the conflict misses of a direct mapped ca Ch 9 Lowerlevel memory i Effective backup 0 Capture a slightly larger cache footprint Capture recent discards 2003 Elsevier Science USA All rights reserved ECE 41006100 S Yalamanchili Fall 2003 Reducing the Miss Rate 0 Reading Section 55 0 Focus on reducing Compulsory misses 9 eg larger block size Capacity misses 9 eg larger cache Conflict misses 9 eg higher associativity o Tradeoff miss rate with Hit time eg higher associativity can increase hit time Miss penalty eg larger block size can increase miss penalty ECE 41006100 S Yalamanchili Fall 2003 1 Larger Block Size 4K Miss rate 5 4 0 I39 I39 256K 16 32 64 128 256 Block size 2003 Elsevier Science USA All rights resened 0 Larger block size increases spatial locality at the eventual expense of temporal locality compare Figures 517 amp 518 o Reduces compulsory misses but eventually increases conflict misses o Reductions in miss rate are accompanied by increase in miss penalty What happens to AMAT High low latency high low bandwidth memory impact on block size ECE 41006100 S Yalamanchili Fall 2003 2 Increase Associativity o 211 rule 0 Tradeoff with increased hit time ECE 41006100 S Yalamanchili Fall 2003 3 Way Prediction and Pseudo Associative Caches 0 Way Prediction Use setassociative caches but predict the line in the set Multiplexer is set early Makes the common case fast We can see a natural affinity with lcache behavior 2way set associative 21264 lcache with predictor bit 1cycle hit vs 3 cycle hit Activity management for power management 0 Variant is pseudo associative cache Each set has a fast hit line and a slow hit line fixed Maintenance of a fast hit block requires transfers from the slow hit block Performance degradation due to too many slow hits ECE 41006100 S Yalamanchili Fall 2003 16 4 Compiler Techniques ECE 41006100 S Yalamanchili Memory hierarchy exposed to the compiler We can schedule for execution performance why not schedule for miss rate or miss penalty Examples Reordering instructions to improve locality Reordering data accesses to improve locality Reduce conflict misses by remapping of instructions or data in memory Fall 2003 41 Loop Interchange forj0 jlt100 j for i0 ilt5000 i 2 Interchange loops for i0 ilt5000 i for j0 jlt100 j Xii 2 Xii 0 Improve spatial locality by matching order of traversal with order of storage Principle maximize the use of data in line before it is discarded o This optimization does not affect the dynamic instruction count ECE 41006100 S Yalamanchili Fall 2003 42 Blocking Original Code Transformed Code for i0 iltN i for iii0 iiltN ii iiB7i for j0 jltN j for kk0 kKltN kk kkB r0 for i0 iltN i for k0 kltn k r r Viiiik Zikiiii Xm r quot 354 Ii wi iiii I it One block in a column All blocks in a column All columns of blocks o Restructure the loops to improve Fit in the cache Improve temporal locality 0 Solutions now become machine dependent ECE 41006100 S Yalamanchili Fall 2003 42 Blocking cont ZWll yllllk Compute the partial 0d H m bl k Complete computation pr uc or is oc of all columns 1 Compute a row in the block 039 and k Complete computation of Block 00 l Complete computation of Blocks in a column kk o What is the miss behavior 0 Decompose the computation to operate on BxB blocks such that three blocks fit in the cache 0 Reduce the overall number of worst case misses by a factor of B ECE 41006100 S Yalamanchili Fall 2003 20 43 Loop Fusion 3 sequential arrays of same size int aSlZE int bSZE int cSIZEL Locality of use for a fori0 iltSlZE i ai bi fori0 iltSlZE i ci ai K for i0 iltSZE i ai bi i K ci 0 Reduce conflicts between the two arrays o Exploit temporal locality across arrays 0 Not always obvious whether loops can be fused o What about merging the arrays into a single structure ECE 41006100 S Yalamanchili Fall 2003 21 44 CompilerControlled Prefetching for i0 ilt3 i for j0 jlt100 j for j0 jlt100 j prefetchbi70 aii bli0 bi10 prefetcha0i7 a0i bi0 bi10 a for i1 ilt3 i for j0 jlt100 j prefetchaii7 aii bi10 bLi10 Reading in Section 56 Register or cache prefetch instructions Behavior of the prefetch instructions faulting vs nonfaulting Investment should be worthwhile Optimizations hints to allocateonwrite but not load What about prefetching in the presence of pointers ECE 41006100 S Yalamanchili Fall 2003 22 5 Nonblocking caches Hit under 1 miss Percentage of the average stall lime 0 i i i sQ wbm eea D eaaea quotbequot QQq ue ab w90a ax Q roe a a 9 0 G K x b b g 643quot Benchmarks o 2003 Elsevier Science USA All rights reserved Continue operation through a miss hit under missquot or miss under missquot Hide miss penalty effectively reducing impact on CPI 9 latency hiding 7 lmplies the memory system can service multiple misses concurrently Handling multiple misses to the same line Complexity of the cache controller and memory system interface increases ECE 41006100 S Yalamanchili Fall 2003 6 Hardware Prefetching ECE 41006100 S Yalamanchili I l lstrearn buffer Hardware engines to prefetch into stream buffers resident outside the cache i1 prefetch Profile driven prefetch Runtime statistics to drive prefetch mechanims I SE SE SE r K Dstrearn buffers Fall 2003 24 Reducing the Hit Time 1 Small Simple Caches 1 31 I Tag Index I Byte l l 3 l Overlap tag lookup and line access 1l lg l l 0 Reading 57 0 Components of a hit Decode time 9 related to cache size Tag match Line access and word extraction Cache size 9 can it fit on chip 0 For L2 caches keep tags on chip ECE 41006100 S Yalamanchili Fall 2003 25 Hit Time Analysis III 1way I 2way 4way El Fully associative Access time ns 4KB 8K8 16KB 32KB 64KB 128KB 256KB Cache size 2003 Elsevier Science USA All rights reserved 0 Direct mapped caches tend to 12 15 times faster that two way set associative caches 0 Note trends in L1 cache size Speed not keeping up with processor pipelines Size growing slowly or staying constant to keep hit times low ECE 41006100 S Yalamanchili Fall 2003 26 Pipelined Caches WB gt o Pipeline the cache to multiple levels 0 Slower hits but increased instruction bandwidth 0 Impact Penalties due to hazards increases Exception handling can be complicated ECE 41006100 S Yalamanchili Fall 2003 27 Trace Caches o The cache line contains instruction traces determined dynamically by the processor rather than statically by the compiler o Header and trailer utilization around branches is better in a trace cache 0 Redundant instruction storage in a trace cache 0 Addresses no longer aligned to a power of 2 multiples of the word size ECE 41006100 S Yalamanchili Fall 2003 28 Summary o Optimizations focused in three main cache attributes Miss penalty Miss rate Hit time 0 General strategies include Latency tolerance Hit latencies by overlapping misses with useful work Concurrency of operation Focus on each of the parameters of the expression for cache access time Compiler strategies vs hardware strategies ECE 41006100 S Yalamanchili Fall 2003 29 Study Guide Given a complete memory system design understand the design Depending on the data given compute the number of sets associativity address breakdown at each level of the cache hierarchy Given a sequence of memory addresses and a specific optimization such as a victim cache be able to update the contents of the cache directory tags and state such as dirty bit validinvalid bit and any associated data structures Assess the impact of the memory hierarchy on the CPI Single and multilevel caches Be able to translate given miss data from missesreference tofrom missesinstruction Similar ability for stall cycles Assess the impact of specific optimizations given the relevant data on the CPI average memory access time and stall cycles ECE 41006100 S Yalamanchili Fall 2003 30