New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Dorothea McGlynn


Dorothea McGlynn
GPA 3.98


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in ComputerScienence

This 53 page Class Notes was uploaded by Dorothea McGlynn on Monday October 19, 2015. The Class Notes belongs to CS 472 at Oregon State University taught by Staff in Fall. Since its upload, it has received 53 views. For similar materials see /class/224504/cs-472-oregon-state-university in ComputerScienence at Oregon State University.




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/19/15
Chapter Seven 1998 Morgan Kaufmann Publishers 1 Memories Review SRAM value is stored on a pair of inverting gates very fast but takes up more space than DRAM 4 to 6 transistors DRAM value is stored as a charge on capacitor must be refreshed very small but slower than SRAM factor of 5 to 10 Word line Pass transistor Am D B B Capacitor V Bit line 1998 Morgan Kaufmann Publishers 2 Exploiting Memory Hierarchy Users want large and fast memories SRAM access times are 2 25ns at cost of 100 to 250 per Mbyte 1997 DRAM access times are 60120ns at cost of 50 per Mbyte 2003 Disk access times are 10 to 20 million ns at cost of 002 per Mbyte2003 Try and give it to them anyway build a memory hierarchy CU Increasing distance from the CPU in access time Levels in the memory hierarchy Level n Size of the memory at each level 1998 Morgan Kaufmann Publishers 3 Locality A principle that makes having a memory hierarchy a good idea If an item is referenced temporal locality it will tend to be referenced again soon spatial locality nearby items will tend to be referenced soon Why does code have locality Our initial focus two levels upper lower block minimum unit of data hit data requested is in the upper level miss data requested is not in the upper level 1998 Morgan Kaufmann Publishers 4 Cache Two issues How do we know if a data item is in the cache If it is how do we find it Our first example block size is one word of data quotdirect mappedquot For each item of data at the lower level there is exactly one location in the cache where it might be eg lots of items at the lower level share locations in the upper level 1998 Morgan Kaufmann Publishers 5 Direct Mapped Cache Mapping address is modulo the number of blocks in the cache 00001 00101 01001 01101 10001 10101 11001 11101 Memory 1998 Morgan Kaufmann Publishers 6 Direct Mapped Cache For MIPS 3130u131211u210 20 Data Index Valid 0 1 2 What kind 0fl0caliljy are we taking advantage of 1998 Morgan Kaufmann Publishers 7 Direct Mapped Cache Taking advantage of spatial locality 3115154 3210 12 Byte offs et Block offset 16 bits 128 bits V Tag Data 4K entries 1998 Morgan Kaufmann Publishers 8 Hits vs Misses Read hits this is what we want Read misses stall the CPU fetch block from memory deliver to cache restart Write hits can replace data in cache and memory writethrough write the data only into the cache writeback the cache later Write misses read the entire block into the cache then write the word 1998 Morgan Kaufmann Publishers 9 Hardware Issues Make reading multiple words easier by using banks of memory Memory Memory Memory bank 0 bank 1 bank 3 c Interleaved memory organization b Wide memory organization a On ewordwide memory organization It can get a lot more complicated 1998 Morgan Kaufmann Publishers 1 0 Performance Increasing the block size tends to decrease miss rate 40 35 30 25 20 Miss rate 15 10 5 0 4 16 64 Block size bytes l 1 KB 0 8 KB 0 16 KB 964 KB 256 KB Use split caches because there is more spatial locality in code Block size in Instruction Data miss Effective combined 1998 Morgan Kaufmann Publishers 1 1 Performance Simplified model execution time execution cycles stall cycles cycle time stall cycles of instructions miss ratio miss penalty Two ways of improving performance decreasing the miss ratio decreasing the miss penalty What happens if we increase block size 1998 Morgan Kaufmann Publishers Decreasing miss ratio with associativity dir e ct39 i e39 39 W Block Tag Data 0 Twoway set associative Set Tag Data Tag Data 0 1 2 3 lmU39IwaA Fourway set associative Set Tag Data Tag Data Tag Data Tag Data Eightway set associative fully associative Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Compared to direct mapped give a series of references that results in a lower miss ratio using a 2way set associative cache results in a higher miss ratio using a 2way set associative cache assuming we use the least recently used replacement strategy 1998 Morgan Kaufmann Publishers 1 3 An implementation 3130121110983210 4to1 multiplexer Hit Data 1998 Morgan Kaufmann Publishers Performance 15 12 9 6 Miss rate 3 0 I I I I On e way Twoway Fourway Eightway Associativity l 1 KB 0 16 KB I 2 KB 0 32 KB 0 4 KB I 64 KB 0 8 KB 128 KB 1998 Morgan Kaufmann Publishers 1 5 Decreasing miss penalty with multilevel caches Add a second level cache often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory DRAM miss penalty goes down if data is in 2nd level cache Example CPI of 10 on a 500Mhz machine with a 5 miss rate 200ns DRAM access Adding 2nd level cache with 20ns access time decreases miss rate to 2 Using multilevel caches try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache 1998 Morgan Kaufmann Publishers 1 6 Virtual Memory Main memory can act as a cache for the secondary storage disk Address translation Disk addresses Advantages illusion of having more physical memory program relocation protection 1998 Morgan Kaufmann Publishers 1 7 Pages virtual memory blocks Page faults the data is not in memory retrieve it from disk huge miss penalty thus pages should be fairly large eg 4KB reducing page faults is important LRU is worth the price can handle the faults in software instead of hardware using writethrough is too expensive so we use writeback 3130292827 15141312 111098 3210 Virtual page number I Page offset l Translation 292827 15141312 111098 3210 Physical page number Page offset l Physical address 1998 Morgan Kaufmann Publishers 1 8 Page Tables number Page table Physical page or disk address Physical memory Valid Disk storage 1998 Morgan Kaufmann Publishers 1 9 Page Tables I Page table register I Virtual address 3130 29 28 27 151413121110 9 a 3 21 o Virtual page number I Page offset l 30 12 Valid Physical page number Page table 18 f0 then page is not present in memory 29 23 27 151413121110 9 3 3 21 o Physical page number Page offset Physical address 1998 Morgan Kaufmann Publishers Making Address Translation Fast A cache for address translations translation lookaside buffer Virtual page quot Physical page number Valid Tag Physical memory Page table Page Disk storage 1998 Morgan Kaufmann Publishers 2 1 TLBs and caches rtual address TLB miss exception Write data into cache update the tag and put the data and the address into the write buffer Cache miss sta Deliver data to the CPU 1998 Morgan Kaufmann Publishers Modern Systems Very complicated memory systems a a four way set associative two way set associative B 1 28 entries TLB 84 entries TLB 128 entries 1993 Morgan Kaufmann 2 3 Some Issues Processor speeds continue to increase very fast much faster than either DRAM or disk access times Design challenge dealing with this growing disparity Trends synchronous SRAMs provide a burst of data redesign DRAM chips to provide higher bandwidth or processing restructure code to increase locality use prefetching make cache visible to ISA 1998 Morgan Kaufmann Publishers Chapters 8 amp 9 partial coverage 1998 Morgan Kaufmann Publishers Interfacing Processors and Peripherals lO Design affected by many factors expandability resilience Performance access latency throughput connection between devices and the system the memory hierarchy the operating system A variety of different users eg banks supercomputers engineers Interrupts Processor Memory O bus I O O Main controller controller controller O l memory Disk Disk Graph39cs output 1998 Morgan Kaufmann Publishers IIO Important but neglected The difficulties in assessing and designing lO systems have often relegated NO to second class status courses in every aspect of computing from programming to computer architecture often ignore NO or give it scanty coverage textbooks leave the subject to near the end making it easier for students and instructors to skip it GUILTY we won t be looking at NO in much detail be sure and read Chapter 8 in its entirety you should probably take a networking class 1998 Morgan Kaufmann Publishers IIO Devices Very diverse devices behavior ie input vs output partner who is at the other end data rate 1998 Morgan Kaufmann Publishers IIO Example Disk Drives E Platter To access data seek position head over the proper track 8 to 20 ms avg rotational latency wait for desired sector 5 RPM transfer grab the data one or more sectors 2 to 15 MBsec 1998 Morgan Kaufmann Publishers IIO Example Buses Shared communication link one or more wires Difficult design may be bottleneck length of the bus number of devices tradeoffs buffers for higher bandwidth increases latency support for many different devices cost Types of buses processormemory short high speed custom design backplane high speed often standardized eg PCI lO lengthy different devices standardized eg SCSI Synchronous vs Asynchronous use a clock and a synchronous protocol fast and small but every device must operate at same rate and clock skew requires the bus to be short don t use a clock and instead use handshaking 1998 Morgan Kaufmann Publishers 3 0 Some Example Problems ReadReq Data DataRdy Let s look at some examples from the text Performance Analysis of Synchronous vs Asynchronous Performance Analysis of Two Bus Schemes 1998 Morgan Kaufmann Publishers 3 1 Other important issues Bus Arbitration daisy chain arbitration not very fair centralized arbitration requires an arbiter eg PCI self selection eg NuBus used in Macintosh collision detection eg Ethernet Operating system polling interrupts DMA Performance Analysis techniques queuing theory simulation analysis ie find the weakest link see lO System Design Many new developments 1998 Morgan Kaufmann Publishers Multiprocessors Idea create powerful computers by connecting many smaller ones good news works for timesharing better than supercomputer vector processing may be coming back bad news its really hard to write good concurrent programs many commercial failures Processor Processor I Single bus I V0 N etwork 1998 Morgan Kaufmann Publishers 3 3 Questions How do parallel processors share data single address space SMP vs NUMA message passing How do parallel processors coordinate synchronization locks semaphores built into send recieve primitives operating system protocols How are they implemented connected by a single bus connected by a network 1998 Morgan Kaufmann Publishers Some Interesting Problems Cache Coherency l Processor I l Processor Snoop Cache tag Snoop Cache tag tag and data tag and data Snoop Cache tag tag and data I Single bus Synchronization provide special atomic instructions testandset swap etc Network Topology 1998 Morgan Kaufmann Publishers 3 5 Concluding Remarks Evolution vs Revolution More often the expense of innovation comes from being too disruptive to computer users a Q II 0 0 II n L a 8 3 5 8 9 9 w 5 a e 5 E 8 3 3 9 S E tn 9 8 9 g E 37 E 239 0 D E 3937 E E E gt a E lt 39E 3 E E 8 E E 3 g 8 m D E 3 m o ifn m n n E lt Z 5quot I a 9 E a E lt E o o x n c E 3 0 m 2 2 9 j c w 1 o gt Z 0 w w 1 o n 3 w u 1 w w 2 9 m E E 0 0 o n m m E n o 5 gt n o 0 Z E E n Evolutionary Revolutionary Acceptance of hardware ideas requires acceptance by software people therefore hardware people should learn about software And if software people want good machines they must learn more about hardware to be able to communicate with and thereby influence hardware engineers 1998 Morgan Kaufmann Publishers 3 6 Chapter 5 The Processor Multicycle Implementation C5472 Single Cycle Implementation Review 0 This implementation is combinational It is not a series of steps it all happens in l clock cycle 0 The signals Within the datapath can vary unpredictably during the clock cycle 0 All instructions take as long as the longest one If we added fp instructions w instruction would be m long 0 Each functional unit can be used only once per clock cycle Some functional units must be duplicated C5472 Multicycle Implementation 0 An instruction is a series of steps Each step will take 1 clock cycle Different instructions require a different number of steps clock cycles 0 A functional unit may be used more than once per instruction as long as it is used on different clock cycles A single memory unit for instructions and data A single ALU rather than an ALU and two adders Registers are added after every major functional unit to hold the output of that unit until it is needed in a subsequent clock cycle C5472 3 Multicycle Datapath Instruction PC Address register gt Data I IR gt gt Register Instruction Registers Out or data Memory data register P i t r Data I MDR We have additional registers to hold the output of the major functional units IR MDR A B ALUOut Memory b pRegister C5472 4 During 1 Clock Cycle We want to balance the work in each step so we restrict each step to contain at most 1 of the following 1 memory access instruction fetch or data access I register file access 2 reads or 1 write 0 or 1 ALU operation C5472 5 Saving the Intermediate Results At the end of a clock cycle any data value we need in a later step must be stored in 0 a major state element PC register file or the memory 0 a temporary register written on every clock cycle A B MDR or ALUOut 0 or a temporary register with write control 1R C5472 6 Sharing Functional Units One memory unit for both instructions and data we need a muX to select which memory address to use as an input 0 One ALU instead of 2 adders and an ALU we need need more and bigger muxes for that o More muxes mean more control lines Multiplexors muxes and registers are relatively small compared to memory units and adders so we have saved some chip real estate C5472 7 MIPS Multicycle DiVisions 0 Instruction fetch Decode and register fetch Execution memory address computation or branch completion 0 Memory access or Rtype instruction completion 0 Write back step C5472 8 Step 1 Instruction Fetch IR MemoryPC Since the ALU isn t being used to fetch an instruction multiplex the PC onto 1 side of the ALU and multiplex 4 onto the other side Put the sum back into PC PCPC4 C5472 9 Step 2 Decode Instr Reg Fetch A Reg IR2521 B Reg IR2016 We don t know what the instruction is yet so we do things that we might need later Calculate a new value for PC in case this is a branch instruction ALUOut PC signext IR150 ltlt 2 CS472 10 Step 3 Most of the Work This is the 1st instruction speci c step The ALU is involved in any case Memory ref ALUOut A signext IR150 Rtype ALUOut A op B Branch if AB PC ALUOut Zero signal from ALU used Jump PC PC 3128 IR 250 ltlt 2 CS472 11 Step 4 Mem Ref Rtype Completion Loads and Stores access memory OR an arithmetic logical instruction writes its result Memory reference Load MDR Memory ALUOut or Store Memory ALUOut B R pe Completion Reg IR 1511 ALUOut C5472 12 5 i Memory Read Completion Load is the longest instruction Reg IR2016 MDR C5472 13 Summary Action for Rtype Action for memoryreference Action for Action for Reg lR1511 Load MDR MemoryALUOut ALUOut or Varying Steps Not all steps are involved in all instructions Some may take 3 cycles some may take 5 cycles Steps Rtype Memref Branches Jumps Ld St 1 X X X X X 2 X X X X X 3 X X X X X 4 X X X 5 X C5472 15 Iw t2 0t3 Iw t3 4t3 beq t2 t3 Label assume no branch add t5 t2 t3 sw t5 8t3 What are we doing in the 11th cycle How about the 17th cycle C5472 16 Chapter 5 The Processor Datapath and Control C5472 1 Performance instruction count determined by compiler and instruction set arch CIOCk CYCIC time determined by the implementation CPI of the processor We will develop 2 different implementations of the MIPS instruction set 0 Single cycle Multicycle C5472 2 Our Implementation includes a subset of the core MIPS instruction set memory reference instructions lw and sw arithmeticlogical instructions add sub and or and slt branch instructions beq and j doesn t include integer multiplication or division oating point instructions C5472 3 Overview 1 Fetch the instruction from memory 2 Decode the instruction Read 1 Any instruction or 2 registers us1ng fields of the class instruction 1W requires just 1 reg 3 Address calculation operation exec or branch comparison Instruction class 4 Memory access Speci c 5 Write back C5472 4 Two Types of Logic Elements 0 elements that operate on data values like ALU combinational must wait for signals to settle propagation delays no internal storage elements that contain state like registers and mem some internal storage memory have a data input and a clock input clock is used to determine when data is written we assume an edgetriggered clocking methodology state elements can be read at any time Digital Logic Review State element combinational element Signals coming out of combinational logic must be settled before the rising edge of the clock The propagation delay times in the combinational logic will determine the clock period C5472 6 Digital Logic Review State element Combinational l We can read the contents of a register send the value through some combinational logic and write that register all in the same clock cycle because we are using edgetriggering With edgetriggering there is no feedback within a single cycle and the above circuit operates correctly C5472 7 Single Cycle Instr Implementation 0 Every instruction begins execution on one clock edge and completes execution on the next clock edge 0 All logic is combinational 0 Control circuitry is relatively simple 0 CPI is 10 by de nition but clock period is high 0 Load instr takes the longest It determines the clock period 0 Wastes chip area need multiple ALU s C5472 8 Register File built using D ip ops Feed 1235121 nunbevl Read registev mmbevl Read aweauuaai dam Read 72 my rumba magma W012 veglstev Read veglstev Rm nunbevZ Wm d 2 9 data Wm Read dataZ C5472 Functional Units lnslmman addvess pc lnslm mn mamaan memaW Memes a lnstm mnmemmv in ngmm mumev 9 may Addvess mu m th2 Data data quotternary MemFeai a Dmamemawmn mm 3 Ramsey h ALU C5472 h Slgnrextensmnunn Put Datapath Together Registers l Read I Memvvme h P ad regiSteM P M address Read dam Memeeg registerZ instruction Write Read Read i stmction register data 2 data M memory vpfe Data x 5quot Write memory RegWHt data The control lmes 6 Sig 32 extend come from decoding instructlon 31 26 and 5 0 c5472 11 RegDst 31 26 mammal Read reg 1 Read reg 2 dRead Instruction am 1 31 0 Re isters g Read Write reg data 1 Write data Sign R Type Instruction p A Fetch from instruction memory and increment PC N Read the 2 source registers from the register file Main Control unit decodes the op code to determine the control line settings The ALU operates on the data Result is written into a register PC is also updated at the end of this step 9 CS472 13 add Instruction add 1 2 3 Rfonnat 3126 2521 2016 1511 106 50 0 2 3 1 0 32 Trace its datapath Which functional units are used for this instruction C5472 14 1w Instruction 1W 1 1002 Iformat 3126 2521 2016 150 35 2 1 100 Trace its datapath How is sw different beq Instruction beq 1192100 Iformat 3126 2521 2016 150 1 1 2 25 Trace its datapath C5472 16 Summary Using multiplexers we can tie together all the required datapaths for all the instructions The value of the control signals is based only on the instruction being executed and can be implemented with combinational logic The cycle time for the singlecycle implementation is determined by the worstcase path for all instr A machine with more powerful instructions fp would result in an imbalance in the amount of time required Thus the singlecycle implementation is not very practical C5472 17


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Anthony Lee UC Santa Barbara

"I bought an awesome study guide, which helped me get an A in my Math 34B class this quarter!"

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.