Computer Architecture EEL 4768
University of Central Florida
Popular in Course
Popular in Electrical Engineering
This 21 page Class Notes was uploaded by Isaac Hauck on Thursday October 22, 2015. The Class Notes belongs to EEL 4768 at University of Central Florida taught by Staff in Fall. Since its upload, it has received 11 views. For similar materials see /class/227664/eel-4768-university-of-central-florida in Electrical Engineering at University of Central Florida.
Reviews for Computer Architecture
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/22/15
Computer Von Neumann The architecture of a computer system is what the user sees or has access to The structure is called the implementation EEL 4768 Von Neumann ISA Instruction Set Architecture is rst described Computer System Design II in 1946 Lecture 2 Von Neumann ISA is an accumulator based ISA similar to Motorola 68HC11 used in EELA767 Von Neumann Model Instructions data and IID share the same address space Prof Taskin Kocak httplwwwcsucfedutkocaklee4768html Arithm ed e Logie Unit Instruction Set Architect urn Instruction Set Architecture Data Types Neumann ISA had only 21 instructions Some authors considered it an early Rlsc architecture Fractmns 12er igrr lgl39lts of the instruction set are Address Data Types A 40 bnwordtypes as a 25 complement when Addresses 39 e ordanizaiidn di Memnlyr ndw drierands are reiereneed and resdiis sidred u smemn E 5 Fraction n r mzcces rieisnMI accumuialirrand emniy 7 Mn mndl mllnns In lhe address such as register reialive nr indexmii Luca I slnrme is a smile accumuialnr Instr 39ons Memory organized in 4096 words 2m Two 20 ebit insauedons An 87b operation eode A 12bit address are alloeatedto the Aoebitmemory word 39 22 27 2 19 Left Address Left 0 27 3 OP InstructionSetquot kquot 39 Registers Instruction set Arrhitnrtllrn n o39 Has seyen registers for interpreting insauetions from memory Three types Move ALU and Control N0 10 Show Move meyes data belwem the aeeumulater multiplier duetientregster MQ and memery ALU eperatiens sueh as add subtiaet multiply and diyi cirnmrl Uneenditienal and cenditienal hianeh instrueti de uns that redireet pregam uw MQ indiesmearing Registers Instruction Interpretation Cycle Amhsz gregarious Registers Mzmaxylntzn ace Ragislus w i um i M i i Cammlegisms it i Inslru inn Executinn CyEIE r BENCH Example Inslru inn Executinn CyEIE 7 quotCH Example mum Mum Mamba MthCude strum on Memory Lo Machine Code 3 m m 2 LDAAsznnu scuun E6 2 un 315 1 Warm is mama gamma Mamae m mm mm awn lnslrucl39nn Execminn Cyclz am1c A x PCsetmCDDD cmmrmememoryinmmuszuumsszi 2 Emma maukmmmcmwpwae ia momma ta mm A New mm mums m rmmmanma im hmmka v and addnsg mfannanan mummaas a m a pc n Firm 2 i m fetchth mandamus madam mmuxmi It performs the apn mnspci edbyh upcadz 3 rcuxmjlimufogms h min n wntzsbackthz result in enhznegistzmxmemaxyincman W wgfh i v u mummies M as mm aim mm mum Pcscm3 amasssmsuaammm new a new We momma ecu mesylxedmkaddmsbmwnha an mm ems mm 5 M an a tum mummtcc A Limitations of the Von Neumann ISA No automatic address modi cation he address in the inslru inns must he mnd led hy nlher instruct ns In indexlhrnugh an 2 th s Ihesellr mndny ng cndelhzl is ver pruneln prngrzmming ermr No base register mode The program counter is an implemented register All mndem pincessnls have an zrchilecled PC EEL 4768 Computer System Design 2 Lecture 5 Designing a smgle cycle Datapath MINarm lam nitnannies 1 Analyze instruction set gt datapath reguirements the meaning of each instruction is given by the register transfers datapath must include storage element for ISA registers 7 possibly more datapath must support each register transfer 2 Select set of datapath components and establish clocking methodology 3 Assemble datapath meeting the requirements 4 Analyze implementation of each instruction to determine setting of control points that affects the register transfer 5 Assemble the control logic MINarm lam brimice All MIPS Instructions are 32 bits long Thethree instruction formats 31 5 2i lo ii mp9 Is the me We so We 6a er a it n quotWe H In 6a so We was Wpe 3i 2 n n at allde 5 hrs 25 hrs The dlfferent elds are op operation of the Instruction rt rd the sourc rs e and destination register speci ers ount sham hilt am funct selects the address immediat variant of the operation in the op field e address offset o immediate value r target address target address oftheiump instruction mum and harmice Ste 1a 1 Mine r todav i lo ii ADD and sue 3 26 2 6 addu rd 539 n 6 hrs 5 hrs 5 hrs 5 hrs 5 hrs 6 hrs U rdv rs rt a 2 2i is u R quotquot39quot 39 a 39 quot39 nvmv39m39Ms 6hrs Shrs Shrs 16 hrs a 2 2i is a MD and ST REW N In rt rs Innquot 6 ohrs Shts Shts 16hrs 3i 2 2i n ERAN H Is mm io n bequr rImm16 ohrs Shts Shts 16hrs MINarm lam nitnannies 39 39 Transfers RTL gives the meaning of the instructions All start by fetching the instruction up lrslnlm lshamll mclMEMH7C 1 uplrslnl lmml MEMPC insi Rgzgter Transfers ADDLV Rhine Rxx Rn susu Rhine RxxrRn oiu Rn lt7 Rxs zemiexl mml pc lt7 pc c LOAD eri lt7 MEM Rxs sig3xi1mn161 PC lt7 pc c STORE MEMRxssig1Lexllmml lt7RnFC ltel2c BEQ iI Rxs RnvhenPC ltel2c Sigiiexl mml ll W e pc lt7 pc c Pcltepc MINarm lam brimice Steo 139 mory instruction amp data Registers 32 x 32 read RS read RT Write RT or RD PC Add and Sub register or extended immediate Add 4 or extended immediate to PC mum and harmice Step 239 the Datanath Combinational Elements Storage Elements clocking methodology newtarmy lam nitnannies CarryIn Adder A 3 3 Sum B 3 Carry Sale NUX A 3 g B 5 3 Y 3 0 ALU A 1 3 3 Result 3 newtarmy lam brimice Keuislel Register W Similar to the D Flip Flop except Nblt input and output Damn Write Enable input Write Ena negated 0 Data Out will not change asserted 1 Data Out will become Data In mwonrm nvnn nmmaavce Redister File Register File consists of 32 registers RWRARB Two 32bit output busses W i EEmh E 5 5 busA and busB One 32bitinput bus busw Register is selected by RAnu mber selects the register to put on busA data RB number selects the register to put on busB data RW number selects the registerto be written via busw data when Write Enable is 1 Clock input CLK The CLK input is afactor ONLY during write operation During read operation behaves as a cornbinational logic block 7 RA or RB valid gt busA or busB valid after access ti mequot newtarmy lam nitnannies Memory idealized ne inp bus Data in One output bus Data Dut Memory word is selected by Address selects the word to put on Data Dut Write Enable 1 address selects the memory word to be written viathe Data in bus Clock input CLK The CLK input is afactor ONLY during write operation During read operation behaves as a cornbinational logic block 7 Address valid gt Data Out valid after access timequot newtarmy lam brimice All storage elements are clocked by the same clock edge cycle Tlme cLKtod Longest Delay Path setup clock Skew cLKtod shortest Delay Path clock skew gt Hold Time mwonrm nvnn nmmaavce t 1h mm 11 e m Sten 3 Register Transfer Reguirements gt Datapath Assembly Instruction Fetch Read Operands and Execute Operation mama am VINnanICE 3a39 Overview on The common RTL operations Fetch the instruction memPC Update the program c unter a Sequential Code PC lt PC 4 Branch and Jump Pc lt something elsequot mama am mamame RrdltRrs op Rrt Example addu rd rs rt Ra Rb and Rw come from instruction s rs rt and rd elds ALUctr and RegWr control logic arter decoding the instruction 3 25 zl 16 u nn sllzml aa saa saa ours 5 bits 5 bits 5 ms mmaa am vlhrlanIlCE 39MI Example lw rt rs immls mama am VINnanICE 3c39 Loaical 39 Rm lt Rrs op zeroExqimmlsn mama am a mamame Rim lt MemRrs SlgnExtlimlMS 16 mmaa am vlhrlanIlCE may Datanatn IUl Datapath generates condltlon lequall 3239 MemRrssrgnExtllmmlsltRrt1 Examplesw rtrs1mmls 15 21 25 21 mm a 5 ms Shus 51m 11mph m lean nthnannies mop ALUch 15 hrs 3f39 21 25 21 15 a 5 hrs 5 bus 51m 15 hrs beq rsrt1mmls memFc Equal lt Rrs Rm calculate the branch condltlo lrlcoun eqol calcu 7 PC lt Fc415lgnExt1lmm1Sx4 else 7 PC lt PC 4 mm mm normma Fetch the lnstmctlon from memory late the next lnstmctlon39s address n beq rsrt1mmls 31 25 21 a rs 11mm nvnn vmmmca 15 5 155 s 1 1r RTL gt Control Puttina it All Tnnpthpr A E E mm mm nthnannies lmumltz1 11gt ALLan Meme MemmReg 5x105 ALUch An Reglster le and 1deal memory Dunng read operatlon b r 0 Address valld tput valld after acces Rzg39slszib39s An ALI 1m Memury m Selup Time m R Cluck Sluw mm mm normma The cLK unput 1s aractor ONLY durlng wrlte operatlon ehave comblnatlonal loglc u s tlm 1mm Memnxy39s mes Time essTmIz mTime g39slszih war 11mm nvnn vmmmca nPcisleegW39xRegDSLEx39Op Amsc ALLvm Mezan MemmReg Equal 1mm 1mm Meanina or Rs Rk Rd and Imed1s hardwlred Imo dakapa h nFc sel ogt Pclt lc4 1gt Pclt lc4srgnsxulm1sllloo MemW wme memory zero Sign MemtoReg 1 gt Mem ALUsrc ogt regE1 mulled RegDst 0 gt rt 1 gt rd ALUctr add sub or Regwr wme dest reglster chu 11le MemmReg Illa REM ADD Rxd lt7RxsRx1 PC PC ALUm Rng ALUcn add Rngsl m Rngr quotraise w 5quot RI39llltRSlRI ll PClt7PC Um p Quack 4124a 7Rngx7 MmmReg7 Memwm 7 Em om an1ltenrrs1umjxramra rcqcu ALUm 7 my 7 Alch 4 Rena e 7 mgwun MmmReg7r mme nPcise 10m Rn Mallll ers n39glile mml rc lterc 1 m iJIxmp insular ngnsr gnaw Mrmmkega Memwm sroRE mumpsagnyrrrmraryerqrs rcqcu Emp R st RzgmeznmkegmMenwm Alum Rrl mrc we Wiemmmrm u m use ya we 1 Mm helm 7 Ragwm Munmkegm MenWm umvhamm um Wm M MUS m um rmlea Control Sianals Sten 539 Louis for k 39 39 39 sun 5 99 g 9 am mum 5 mi ans Rzggr Transit ADD mm lt42qu Rn PC you ALUm Rng Awm add Rngsl m Rngr urged 1 sun mm lt42qu dun PC you ALUm R12 ALUnr suh RegDsl quotLReannPcisel 4 on Rrl lt7Rrsuxniexvlmml PC you ALUm 1 Exhwp z MLvm h Rz 5 RegVVrmPCiszl 10m Rrl Mullll Rrs n39glile mml rc ltrrc u mp 1quotAl m andquot 11 R12 Wr nPC721 1 noRE mumpsKerriemmlanemrs Pcltrrc AU 7 mExth 1quotAl nr HddquotMemWrnPciszl no ifRrs RrldlznPCltr Csiglizxvflmml umelgPCltrPCt CseIEQUALALUm 511 mum um nilnannies nFc sel lt IHOF EEOjthenEQUAL elseo ALUsrc lt oooooo lmerr rega else lrrrrrrea ALUctr lt oooooo lmerr funct 0er then ok my men sub else addquot Extop lt Merrrwr MemtoReg lt Regwr RegDst lt mum um mumma nPc sel lt IHOF aEalmerr EQUAL elseo ALUsrc lt IHOF oooooo l then raga else mulled ALUctr then sub else add Extop 0er men zero else slgrr MemWr lt 1017 smrel MemmReg lt1ol Load Regwr lt If 1101 srorelll 101 EEQnthenoelse1 RegDst ltlrllol LoadHHOF 0Rlmen0else1 mwonrm um vlhnanluc mmmmuwm mmnmnmnlumn armmama Summary 5 steps to design a processor 1 Anaiyzeinstmmuns at awe uirements 2 seem set m atapatn mmpunerts A smiish ciuck methu uiugy a Assemhie atapatn meetingme requirements 4 Anaiyze impiememaiun m eam insmctiuniu aaermme satiny m conirui paintsin etiedsme regiseriranster 5 Assemmeme cummi iugic MIPS makes it easier instructinnssame sze Source registers aiways in me piace immediassamesize iucatmn Operations aiways an registers mme iates quotSingle cycle datapathgt CP cc r gt long Nexttime implementing control mmmmuwm EEL 4768 Computer System Design 2 Lecture 8 Memory Systems 43mm in DAVntersan e uee The Big Picture Where are We Now The Five Classic Components of a Computer Processor Mmmw mammowmm euee Technology Trends from 1st lecture Capacity Speed latency Logic2x in 3years 2x in 3years DRAM 4x in 3 years 2x in 10 years Disk 4x in 3 years 2x in 10 years DRAM Year Size Cycle Time 1930 39 39 64 Kb 39 39 250 ns 1933 256 Kb 220 ns 1936 1 Mb 190 ns 1939 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns 2004 1 Gb 35 ns 43 mm DAV ersan e uee Who Cares About the Memory Hierarchy ProcessorDRAM Memory Gap latency prroc 60lyr 2X15yr 39P essorMemory Performance Gap 9r9ws 50 I year I DRAM WW 9yr 2X10 yrs 1000 Moore s Law 1 O O Performance OHNnvmgerOvtaggggggg a aaaaaaaa 2000 Time 43mm in DAVntersan e uee Today s Situation Microprocessor Rely on caches to bridge gap MicroprocessorDRAM performance gap time ofafull cache miss in instructions executed 1st Alpha 7000 34 nslS0 ns 68 clks X 2 or 2nd Alpha 8400 266 nsl33 ns 80 clks X 4 or 3rd Alphatbd 180 nsl17 ns 108 clks X6or 1I2X latency X 3X clock rate X 3X Instrclock 55X mammowmm euee 136 instructions 320 instructions 648 instructions Impact on Performance Suppose a processor executes at mg Mm Clock Rate 200 MHz 5 ns per cycle 15 H m CPI 11 50 aritniogic 30 idst 20 control Suppose that 10 of memory Dames operations get cyc e W miss penal y CPI ideal CPI average stalls per instruction 11cy 0309 atamopsinsa x 10 miss datamop x 5 cyclemiss 116cycle 15 cycle 58 of the time the processor Is stalled waiting for memory a 1 instruction miss rate would add an additional 05 cycles to the CPI 43 mm DAV ersan e uee Why hierarchy works The Goal illusion of large fast cheap memory FacgflLarge memories are slow fast memories are sm How do we create a memory that is large cheap and fast most of the time Hlerarcny Parallelism mum mm m An Expanded View of the Memory System Speed Fawn S We n est Etggest Cost Htgnst WW mum nAlauml lc The Princlple of Locality Program access arelatlvely small portlon ofthe address space at any Instant of time Probability of referenc U Address space 2quot 391 mmmwam lc Memory Hierarchy of a Modern Computer System Memory Hierarchy How Does it Work Temporal Locality Locality in Time gt Keep most recen y accessed data ltems cl Spatial Locality Locality in Space gt Mov e blocks conslsts of contlguous words to the upperlevels mum mm m oserto the processor Memory Hierarchy Terminology Hit data a ears in some block in the upper level example lock X Hlt Rate thefractlon of memory access found lntne upperlevel Hlt Time Time to access the upper level whlch c n Ists of RA e stlmeTlmeto determrne hltmlss Niss data needs to be retrieved from a block in the lower level Block Y Hlt Rat 9 eto replace a block In the upper level Mlss Rat Mlss Penal Tllll Tlmeto dellvertne block the processor Hit Time ltlt Miss Penalty mum nAlauml lc By taking advantage of the principle of locality Present the userwlth as much memory as Is available In the l cneapesttecnno ogy Provlde access at the speed offered by tnerastest tecnnolo mmmwam lc Teamy sang Disk seems is ms inns manaumsmauuamauns n iytes mm mm mm K5 M5 Gs Ts How is the hierarchy managed Registers ltgt Memo by compiler lprogrammenl cache ltgt memory by the hardware memory ltgt disks by the hardware and operating system wirtual mem oryj by the programmer files imam mmquot iu Memory Hierarchy Technology Random Access Random Is good access time Is the samefor all locations M Dynamic Random Access Memory High density low power cheap slow 7 Low densl high power expenslve fast Static content will last forever39luntll lose power Nonsorandomquot Access Technology Access time varies from location to location and from time to time Examples Disk CDROM Sequenti I Access Technology access time linearin location egTape A wuam imam ic Main Memory Background Performance of Main Memory Latency cache Miss Penalty Accass Time time between request and Word arrives Cycle Time timebetween re uests eandWidth lO amp Large Elock Miss Penalty 1L2 Main Memory is DRAM Dynamic Random Access Memory Dynamic since needsto be refreshed periodically 15 ms Addresses dlvlded into2 halves Memory as a2D matnxj RAE orRowAccess Strobe CA5 or Column Access Snob Cache uses SRAM Static Random Access Memory No refresh 6 transistorsbit vs 1 translstorbltl Address not dIVIded 2E9 DRANIISRAM 4398 CostCycle time SRAMDRAM 8 Random Access Memory RAM Technology Wh do computer designers need to know about RAM tec nology Processorperformance is usually limited by memory bandwidth As lc densities increase lots of memory will fit on processorchip Talloronchlp memory to specific needs 7 Instruction cache What makes RAM different from a bunch of ip ops Densioy RAM is much more den er imam mmquot iu Static RAM Cell sTransistorsRAM cell word word lrow selectl K bit Write 1 Drive bitlines lbit1 blt0 bquot bquot select row mead 1 Precharge bit and bitto Vdd 2 select row a cell pulls one line low 4 sense amp on column detects difference between bit and bit A wuam imam ic Typical SRAM Organizatio 16word x 4bit Wm xapnra 5mm mm 2 mm i mm n mm moanvnmnAmwm ic Logic Diagram of a Typical SRAM Write Enable is usually active low WEL Din and Dout are combined to save pins A new control signal output enable DE L is needed WE L is asserted Low DE L is disasserted High 7 D serves as the data input pin WE L is disasserted High DE L is asserted Low 7 D is the data output pin BothWE L and DE L are asserted Result is unknown Don t do that Althou h could change VHDL to do what desire must 0 the best with what you ve got vs w Memengemnem euce hat you Typical SRAM Timing Write Timing MavistlamDAHtersan euce Problems with SRAM Select 1 bit0 Six transistors use up a lot of area an aim DAVatelsan e uce 1Transistor Memory Cell DRAM Write 1 Drive bit line 2 Select row Read 1 Precharge bit line to Vdd 2 Select row 3 Cell and bit line share charges a Very small voltage changes on the bit line 5 Sense rancy sense am a can detect changes of1milion electrons 5 Write restore the value Refresh 1Just do a dummy read to every cell MawdtamDAVmersan euce row select Classical DRAM Organization square bit data lines Each intersection represents 2 LT DRAM Cell word row select Column Selectoramp column 0 C rcults Address row address 39 Row and Column Address data together Select1 bit atirne uaummoarm uca DRAM logical organization 4 Mbit Column Decoder A0A10 Address Buffer Row Decoder Square root of bits per RASICAS an aim DAVatelsan e uce Out Data In Date Logic Diagram of a Typical DRAM control Signals RAsiL cAsiL WEiL OEiLj are all active low Din and Dout are combined lDl wE Us lumior Lurkmedullaquot n sire mn nllrul am Row and column addresses snaretne same pins lAl m L weslm PlnsA Melmumquot smmms L weslm PlnsA Melmumquot Isululnl m1 a minimum timerrom RAs line Talling to the valid data output Quma1smemmDRAM Af Mnkmtm snm quot t minimum time from the start of one row access to the start of the next um 11nnsfnrWnRAMmhImmsnns v minimum timerrom cAs line Talling to id data output unsignm RAMmmtmmsnns lt A an minimum timerrom the start ofone column access to the start of the next ansfnrWDRAM r m 5 us A so ns it DRAM can airnu mccsmlyuluy nns l Tnesetime o not includetnetimeto drive the addresses off the microprocessor nortne memory controller overnead 13quot um 25a minmimwmmm minimum nan my my mum Main Memory Performance Time mum DRAMA Readwritelc cleTime DRAM lRead ritelAccess ime 1mm DRAMlReadwritelcycleTime mm inns mimenines i w a lull in an only it hlsfntlu fur nmu nn sutumly DRAM lReadwritel Access Time m4 mumquot ynu F vlmynu w mcey l mm m Ices Ilnw ii sum s re as hlsfntlu wlll are mine nme DRAM aandWidtn Limitation analogy Wlm Itqu ii re mm ml or nnm nu Wlemmly A mmmw DI am hi AnaSax DI Main Memory Performance Timing model 8 o 2 to E 8 s s o r 4x16132 Wide 1e1 8 InterleavedMP164x111 ad d Fewer DRAMslSystem over Time rmm MacWiliams in DRAM Generation 86 89 92 96 99 02 w 1Nb4Mb16Mb64Mb256Mb1Gb 4 MB 32 gt 8 M DRAM growth gt 15 4 2 3 MB 60 year 16 MB 8 gt 2 u 32 MB 4 l n 64 MB Memoryper 8 2 E System growth 128 MB 25963096 year 4 gt 1 C E 256 MB 8 2 A wuam Imam Ic Page Mode DRAM Motivation Regular DRAM Organization 33 l Ems Each Mblt access requlres a RAS cAs cycle Fast Page Mode DRAM N X M register to save a row mm 0qu quot IserhilAccm P quot MMrhilAcoess N 2 Fast Page Mode Operation Fast Page Mode DRAM N X M SRAM to savearow After a row is read into the i mwvr Rm register g mums only OAS IS needed to access g other Mblt blocks on that row RAS cAs L Is toggled M 395 om Iserhil Access 2m Mrhil m Mrhil m Mrhil ms I 39I 39I 39I 39I I I I I CA5 I I I I AXRowmidnssx CulMdm x CulMdm x CulMdm x CulMdm 1 Mom Imam Iu DRAM V quot quot Standards pinout package bina comgatibility refresh rate IEEE 54quot bus capacity Sources Multiple Single Fi ures 1 ca aci 1 lbit 1 SPECs eed gr Merit 2 B intrate cy 2 cost p lm rove 1 601 25 1 60 ateyear 2 20 3 quotA 2 little change A wuam Imam Ic Reduce cell size 25 increase die size 15 3 phases engineerin samples first customer shIpFCS mass pro uctIon Fastest F05V mass production wlns share size testing time yield gt pro t Yield 0 redundant rowscolumns to repalrflawsj mpmvnmnAmwm Ic DRAMs capacity 60lyr cost 30lyr 25x cellsarea15 dle size In lt5 years DRAM fab line costs 13 to 23 M only density leakage v speed Rely on increasing no of computers amp memory per computer 60 market SIMM or DIMM IS replaceable unlt gt computers use any generation DRAM Commodit second source industry gt hlgh vo ume low profit conserv th tle organization innovation in 20 years page mode EDo syncn DRAM Order of importance 1 Costbit 1a Capacity RAMEUS 10 EW 30 cost llttle Impact mum in we To second source industry profit conservative rsl Commodi gt high vo ume low Little organization innovation lvs processo in 20 years page mode EDO syncn DRAM DRAM industry at a crossroads FewerDRAMs computerovertime tn hpDRAM5060yr 7 Nathan Myrvold Ms mature software growtn rfor rm growth ME of DRAM 125 50 yri starting to question buying largerDRAMs Aweam imam ic Mulhmix DRAM Revenue per Quarter sznnnn 15nnn 1nnnn ssnnn Sn 1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 94 94 94 94 95 95 95 95 96 96 96 96 97 ntnl By taking advantage of the prin Present the userwltti 5 much memo cneapesttecnnolo Provide access at the speed offered by the fastest tecnnology ciple of locality ry as is available intne DRAM is slow but cheap and dense 6 SRAM is fast but expensive and not very dense Good cnoiceforproviding the user FAST access time mum in we oiceforpresenting tne userwitn a EIG memory system Summary 5 1 quotTaxquot TWo Different Types of Locality Temporal Locality Locality in Timel If an item is referenced itwill be Mame 39quotquot 5 quot Processor quotA Area Transistors Spatial Locality Locality in Space If an item is referenced items wnose addresses are close by tendto be referenced soon 9059 power Alpha 21164 37 77 61 94 StrongArm SA110 Pentium Pro 64 88 2 dies per package FrocISDS L23 Caches have no inherent value only try to close performance gap Aweam imam ic 30lyear since 1987 13 income profit Example 1 KB Direct Mapped Cache with 32 B Blocks The Art of Me y Fora 2 A N byte cache The uppermos 2 Mi bits are always the cache ag EEL 4768 Computer System Lecture 9 Memory Systems cont mum in m Design 2 meme diam onmt ltaaaargtltaomgtltaaoargt no limit rem wit ooomlze the memory system organization to minimize the average memory access me for wical workloads he iowestM bits are the Eyte select leiock Size 2 A M mpwvnmnl mwm ic mum imam ic Another Extreme Example Fully Associative Extreme Example single big line Cache om che Fully Associa ve Ca Forget aboutthe cache Index Block Size Tradeofr locality BUT Largerblock size means larg few cache blocks In general Average Access Time Hitrim x11 Miss Rahal Mss Penalty x Miss Rate ermiss penalty In general larger block size take advantage of spatial relative to cache size miss ratewni go up Average van on Cache Tag El I mm mm Em n Cache Size 4 bytes Block Size 4 b es only ONE entry inthe cache If an item is accessed likely that it will be accessed a ain soon sed again immediatei xt access will likely to be a iss again 7 continually loading data into the cache but discard force out them before they are used again t ightmare of a cache designer Ping Pong Effect Eut itis unlikely that itwni be acces The he in compare the cache Tags of all cache entnes in parallel Example Elock SI1e 2 E blocks we need 27bit comparators Con ict Miss 0 for a fully associa Miss Miss 55 Penalty R e Exp ails Spatial Locality 1139 Feweiblueks mmeif x sazm 39 cumpmmises a Emma mm Con ict Misses are misses caused by l Different memory locations mapped to the same cache index mck u mck n Ehuk Sin 7 Solution ill ethe cache Slle bl r mm a mm Newmanquot 2 Multiple entries forthe same Cache Index WWWW A Summary on Sources of Cache Misses Disadvantage of Set Associative Cache cold start or proceis migration first c Compulsory reference lrst access to a bio A Twoway Set Associative Cache Nway set associative N entries for each Cache Index N direct mapped caches operates in paralle Example Twoway set associative cache cache I Cache Index selects a set fro The two tags in the set are compared in parallel Data is selected based on the tag resul Cache index Adapts an DAValtersan e uca Nway Set Associative Cache versus Direct Mapped ac e N comparators vs1 Extra MUX delay forthe data Data comes AFTER HitMiss decision and set selection In a direct mapped cache Cache Block Is avallable BEFORE HitMiss Possible o assume a hit and continue Recoverlater if miss Cache index Cold fact of life not a whole lot you can do about it If you are going to run billions orinstmction s are insigni cant Note Compulsory Misse Conflict collision Multiple memorylocations mapped to the s me cache location Solution1increase cache size Solution 2 increase associativity Capacity Cache cannot contain all blocks access by the program Solution increase cache size lnvalidation other process eg lO updates emo AdavedtlamDAVnersan uCa Adamedtlam DAVatersan e uca How Do you Design a Cache Sources of Cache Misses Answer Set of Operations that must be supported Source of Cache Misses Quiz r a ltMemPhysical Address write MemPhysical Address lt Data inside it has TagData Storage Muxes Physical Address Com parators ReadIWrite Data Deterimine the internal register transfers Design the Datapath Design the Cache Controller Null Choices Zero Low Medium High Same quothm M39 m n Cache 2quot Address Controller we a a in Walt Data Out lawmaker uCa melanomam uCa madamme uCa Impact on Cycle Time cache Hitriiiie dlrectly tied to clock rate increases With cache size In reases With associatiVity Average Memom Access time Hit Time Miss Rate X Miss Penalg Time IC X CT X ideal CPI memom stalls Example direct map allows miss signal after data mum mm iu Improving Cache Performance 3 general options 1 Reduce the miss rate 2 Reduce the miss penalty or 3 Reduce the time to hit in the cache A wuam imam ic 4 Questions for Memory Hierarchy 01 Where can a block be placed in the upper level Block placement 02 How is a block found if it is in the upper level Block identification 03 Which block should be replaced on a miss Block replacement 04 What happens on a write Write strmegy mmmuam ic 01 Where can a block be placed in the upper level Block 12 placed in 8 block cache Fully associative direct mapped2way set associative sA Mapping eiock Number Modulo Mumbersa mum mm iu 02 w39 w level Tag on each block N n to check Index or block olfset Increasing associ ty shrinks index expands tag A wuam imam ic Q3 Whirh hlnrk 39 Easy for Direct Mapped Set Associative or Fully Associative LRU LeastRecently Used Associativity 2way 4way 8way Size LRURandomLRURandom LRURandom 16 KB 52 57 47 53 44 50 64 KB 19 20 15 17 14 15 256 KB 115 117113 113 112 112 mmmuam ic 04 39t n Wr e through The information is written to both e oc In e cache and to the block in the lower level memory Wr e back The information is written only to the 5 ocR In the cache The modi ed cache block 39 written to main memory only when it is replaced Is block clean or dlrty Pros and Cons of each WT a nIIsses cannot result In wntes WE no writes of repeated Wnte WT always combined with write buffers so that don39t walt for lower level memory mum Damn m Write Buffer for Write Through Write Mr A Write Buffer is needed between the Cache and Memory Processor wntes data Into the cache and the Write bu39fer Memory controller wnte contents onne bulrerto memory Write buffer is just a FIFO Typlcal number of entnes 4 Works ne It storefrequency1wrttlmeltlt1 DRAerme cycle Memory system designer s nightmare store frequency mm tlmel a 1 DRAM wnte cycle erte buffer saturation mum nAlauml lc Write Buffer Saturation Write am Store frequency wrt time gt 1 I DRAM write cycle If tnls condmon exlstfora long penod oftlnle lcPu cycle umetoo qulck andortoo many store lnstructlons In a row 7 store bufferwlll overflow no matter how big you make It The cPu cycle Tlme lt DRAMerte cycle Tlme Solutlon for write buffer saturation back cacne Install a second level an cacne mmmamm lc Writemiss Policy Write Allocate versus Not Allocate Assume a 16bit write to memory location 0x0 and causes a mlss Do we read In the block 7 Yes erte Allocate No Wnte MotAllocate mum Damn m Recall Levels of the Memory Hierarchy why u pper Level Actor 1 Stlgmg Dart my hat faster cplRagnar 1 s was lt1 quot5 us DngcolIullu 1 n bytes x was 1n 1nn us sm nnlnn mm all n m bytes m M was mm 1 sm nnl 05 512 AR ms on c was quot1 la la can usemem Whytes m Larger 1 Luwer Level m 39539quot mum mam la Basic Issues in Virtual Memory System Design slze of Information blocks that are transferred from secondary to maln storage my block oflnfomlatlon brought Into M and M Is full then some region of M must be released to make room forthe new block gt replacement pollcy wnlcn reglon ofM ls to hold the new block gt placementpollcy mlsslng ltenl fetched from secondary memory only ontne occurrence of afault gt demand load pollcy mem Paging Organization v al and phy lcal address space partltloned Into blocks of equal slze 1 page frames mam may Address Map n 1 virtual address space H gt m m 1 physical address space MAP v M u a address mapping function MAPa a39 if data at virtual addressais present in physical add 39 and a39inM ifdata at virtual address a is not present in M missing item fault Name Space v Secondary Memory physical address as performs this transfer mam an DAVallersan e uca Paging Organization P om m 1024 1 o m A w24 T 5quot 7168 7 1K Physical W 31744 1 1 Virtual Memory Address Mapping 4 1 0 D VA a unit of M mapping K also unit of transfer from Page Table Base Reg actually concatenation is more likely H lto M t bl l td we a e ocae physical ll l physical memory memory address Madam DAVatersan e uca Virtual Address and a Cache Trans Iation It takes an extra memory access to translate VA to PA This makes cache access very expensive and this is the quotinnermost loopquot that you want to go as fast as possible ASIDE Why access cachewith PA at all VA caches have a problem synonym lalias problem two different virtual addresses map to same physical address gt two different cache entries holding data for the same physical address forupdate must update all cache entries with same physical address ormemory becomes inconsistent determining this requires significant hardware essentially an associative lookup on the physical address tags to see if you have multiple hits or software enforced alias boundary same lsb ofVA ampPAgt cache size AdavedtlamDAVmersan euca TLBs A way to speed up translation is to use a special cache of recently used page table entries is has many s but the most name frequently used is Translation Lookaside Bufferor L TLB access time comparable to cache access time much less than main memory access 39 mam an DAVallersan e uca Translation LookAside Buffers 39 cache the TL set associative or direct mapped TL y r n y I a high end machines This permi ciative ts fully asso lookup on these machines Most midd39ange machines use small nway set associative organizations Translation Wi h a TLB Madam DAVatersan e uca Reducing Translation Time Machines with TLBs go one step further to reduce cyclescache access They overlap the cache access with the TLB access orksbecau1sleBhigh order bits of the VA are used to look In the while low order bits are used as index into cache AdavedtlamDAVmersan euca overlapped Cache amp TLB Access IF cacne nlt AND lcacne tag My tnen dellver data to CPU ELSE IF cacne mlss OR lcacnetag PM and TLE hltTHEN mory wltn LE access me the PA from theT ELSE do Standard VA translatlon mum mm lu Problems With overlapped TLB Access overlapped access only works as long as the address blts used to Index lnto the cache do not change as the result of VA translatlon rnls usually llmlts tnlngs to small caches large page slzes ornlgn nway set assoclatlve cacnes Ifyou want a large cacne Example suppose everytnlng the same excepttnattne cacne ls Increased toE K bytes Instead on K lt 11gt 2 rnls blt ls changed 20 12 by VA translatlon but ls needed forcache Vlrtpagerx m lookup Solutlons go to BK byte page slzes go to 2 way set assoclatlve cache or SW guarantee VA1llA1l 1K 2 way set assoc cacne mummum Summary 1 4 The Prin 39ple of Locality Program likely to access a relatlvely small portlon onne address space at any lnstant of tlme 7 Temporal Locallty Locallty ln Time A Spatlal Locallty Locallty ln Space Three Major Categories of Cache Nisses compulsory Mlsses Sad lacts ofllfe Example cold start mlsses confllct Mlsses Increase cache Size andor aSSoclatlvlty NI ntmare Scenano plng pong effect capaclty Mlsses lncrease cacne slze Cache Design Space total Size block Size aSSoclatlvlty replacementpollcy wntellt pollcy lwrltetnrougnwntebackl wntemlss pollcy mpwvnmnAmwm lcl Summary 2 I 4 The Cache Design Space Several interacting dimensions CzcheSize cache Size block Size assoclathlty replacement po cy writethrough VS wnteback erhe allocation The optimal choice is acompromise depen n access cnaractenstlcs Assnciztivity Black Size A workload r usellcacnencacnerLel depends on technology cost and Gun new rm Simplicity often wins LN More mum mm lu Summary 3 I 4 TLB Wrtual Memory Cache TLBs Virtual Mem ryall understoodvlwI examinlng how the deal with 4 questions 1 ere can block be place 2 How Is b ock found 3 What lock ls repalced on mlss 4 How are wntes handled Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are signi cant in processor performance u nny times as most stems can t access all of 2nd evel cache without TL misses mum nAlauml lcl Summary 4 I 4 Memory Hierachy Vlrtual memory was controversial at the time can SW automatlcally manage 64KB across many programs 1000X DRAM growth removed the controversy Today VM allows man rocesses to share single memory without having to swap all processes o disk VlIl pro ectlon ls more Important than memory hlerarchy Today CPU time is afunction of ops cache misses vs just fops W at does this mean to Compilers Data structures Algorlthms mpwvnmnAmwm lcl