Popular in Course
Popular in ComputerScienence
This 168 page Class Notes was uploaded by Vito Kilback on Wednesday September 23, 2015. The Class Notes belongs to CS282 at Drexel University taught by Staff in Fall. Since its upload, it has received 41 views. For similar materials see /class/212484/cs282-drexel-university in ComputerScienence at Drexel University.
Reviews for SystemsArchitectureII
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/23/15
Systems Architecture Topics Performance Evaluation and Benchmarking This lecture was derived from material in the text Chap 2 Netee Courtesy of Jeremy R Johnson Lec 1 Systems Architecture 11 Introduction Objective To quantify performance and relate performance to design parameters Also to understand the role of benchmarking Execution Time sec InstProgram X CyclesInst CPI X SecCycle Topics Performance Definition Performance parameters and equation Benchmarking Fallacies and Pitfalls Amdahl s law Lec 1 Systems Architecture 11 2 Performance Definition Response Time Throughput Cost Example PerformanceX 1IExecution TimeX X is n times faster than Y a Performancex Performancey n Airplane Passengers Range mi Speed mph Throuthut Boeing 777 375 4630 610 228750 Boeing 747 470 4150 610 286700 BACSud Concorde 132 4000 1350 178200 Douglas DC850 146 8720 544 7 9424 Lec 1 Systems Architecture 11 3 Measuring Performance Execution time Wallclock elapsed time CPU time system vs user Limited accuracy Instruction count simulatorhardware counters Cycle count simulatorhardware counters Memory performance simulatorhardware counters Lec 1 Systems Architecture 11 Performance Parameters and Equation Instruction count depends on program compiler optimization flags instruction set architecture Cycles Per Instruction CPI depends on implementation of architecture datapath pipelining parallelism etc Clock rate depends on implementation design and technology Execution Time sec InstProgram X CyclesInst CPI X SecCycle Lec 1 Systems Architecture 11 Performance Equation Example Suppose we have two implementations of the same instruction set architecture Machine A has a clock cycle time of 1 ns and CPI 20 Machine B has a clock cycle time of 2 ns and CPI 12 Which machine is faster for this program CPUAgtlt20x1ns2gtltlns CPUBgtlt12gtlt 2ns24gtltlns PerfAIPerfI3 242 12 gt A is 12 times faster than B Lec 1 Systems Architecture 11 Comparing Code Segments Compare efficiency of two code sequences Instruction Class CPI for this instruction class A 1 B 2 C 3 Code Sequence Instruction counts for instruction class A B C 1 2 1 2 2 4 1 1 CPU Clock Cycles1 2 x1 1 x 2 2 x 3 10 cycles CPU Clock Cycles2 4 x1 1 x 2 1 x 3 9 cycles CPI1 10 cycles I 5 instructions 2 cyclesinst CPI2 9 cycles I 6 instructions 15 cyclesinst Second choice is better even though there are more inst Lec 1 Systems Architecture 11 Benchmarking Use sample programs that approximate actual usage Beware of small artificial kernel benchmarks synthetic benchmarks Whetstone Dhrystone Peak performance reports use of parameters other than execution time eg program size MIPS Make sure results are reproducible SPEC System Performance Evaluation Corporation Collection of real world integer and floating point programs httpwwwspecbenchorg CPU95 SPECint95 SPECfp95 originate in 1989 CPU2000 also graphics web and other benchmarks Lec 1 Systems Architecture 11 Benchmark SPEC95 Lec 1 Systems Architecture 11 E L VIE 7A5 7mm The n walnuts the use at mums hag untunint1un Lec 1 Systems Architecture 11 SPEC95 Doubling clock rate does not double performance SPECfp SPECint 50 11 150 2130 250 50 1 1 150 2113 250 Clockrale NHz I Pentium 0 me MHZ I Pen ium Pentium Pro I Pen Ium Pm Systems Architecture 11 11 Lee 1 SPEC performance ratio 800 600 500 400 300 200 0 SPEC89 Compiler enhancements and performance doduc nasa7 i eqntott matrix300 fpppp tomcatv Benchm ark Compiler Enhanced compiler Systems Architecture 11 12 Summarizing Results Example Computer A Computer B Program 1 1 10 Program 2 1000 100 Total time 1001 110 PerfBlPerfA 1001110 91 Total execution time Arithmetic mean weighted Geometric mean for ratios used by SPEC Lec 1 Systems Architecture 11 13 SPECint95 Geometric mean of ratios compared to SPARC 10 Model 40 Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Computer Co Dell Precision Workstation 410 Dell Precision Workstation 410 Precision Workstation 410 450MH Precision Workstation 410 650 M Precision Workstation 410 700 M Precision Workstation 420 600 M Precision Workstation 420 733 M Precision Workstation 610 450MH Precision Workstation 410 450MH Precision Workstation 410 500MH Precision Workstation 410 550MH Precision Workstation 410 600MH Precision Workstation 610 Precision Workstation 610 Precision Workstation 610 Precision Workstation 610 Precision Workstation 610 Precision Workstation 610 450MH 500MH 500MH 550MH LLLLxxLLLL LLLL KL 134 153 176 315 337 300 358 189 186 204 226 246 164 165 190 221 216 244 134 153 176 312 334 297 353 189 186 204 226 246 164 165 190 221 216 244 Lec 1 Systems Architecture 11 Amdahl s Law Execution Time After Improvement Execution Time Unaffected Execution Time Affected Amount of Improvement Example Suppose a program runs in 100 seconds on a machine with multiply responsible for 80 seconds of this time How much do we have to improve the speed of multiplication if we want the program to run 4 times fasterquot How about making it 5 times faster Principle Make common cases fast Lec 1 Systems Architecture 11 15 Remember Performance is specific to a particular programs Total execution time is a consistent summary of performance For a given architecture performance increases come from increases in clock rate without adverse CPI affects improvements in processor organization that lower CPI compiler enhancements that lower CPI andor instruction count Pitfall expecting improvement in one aspect of a machine s performance to affect the total performance Lec 1 Systems Architecture 11 16 Systems Architecture Topics Overview of Pipelining Pipelined Datapath and Control This lecture was derived from material in the text Chap 6 Notes Courtesy of Jeremy R Johnson Lec 2 Systems Architecture 11 Systems Architecture Topic 1 Overview of Pipelining Lec 2 Systems Architecture 11 Introduction Objective To understand pipelining and the enhanced performance it provides Pipelining is an implementation technique in which multiple instructions are overlapped in execution Instructions are broken down into stages and while one instruction is executing one stage another instruction can simultaneously execute another stage Topics Introduction to pipelining Speedup Designing instruction sets for pipelining Pipelining hazards Superscalar and dynamic pipelining Lec 2 Systems Architecture 11 3 A Simple Example Stages in laundry example wash clothes dry clothes fold clothes put clothes away Same time to complete single load Speedup requires multiple loads While drying the first load one can begin washing the second load When the first load is being folded the second load can be dried while a third load begins When the clothes from the first load are being put away four loads are operating simultaneously With sufficiently many loads the time is reduced by a factor of 4 Lec 2 Systems Architecture 11 Timing Diagram Lec 2 Systems Architecture 11 Pipeline Stages for MIPS Instruction Execu on There are five stages Fetch instruction from memory Read registers while decoding the instruction Execute operation or calculate an address Access an operand in data memory Write result into a register Delays for functional units 2 ns for memory access 2 ns for ALU operation 1 ns for register file readwrite Lec 2 Systems Architecture 11 Pipelined Execution for Singlecycle MIPS Implementation Instructions lw sw add sub and or sit beq Instruction Class Inst Fetch Reg Read ALU op Data Acce Reg Write Total Load word Iw 2 ns 1 ns 2 ns 2 ns 1 ns 8 ns Store word sw 2 ns 1 ns 2 ns 2 ns 7 ns Rformat addsub and or sit 2 ns 1 ns 2 ns 1 ns 6 ns Branch beq 2 ns 1 ns 2 ns 5 ns Lec 2 Systems Architecture 11 Timing Diagram Program D execution 2 4 6 8 10 12 14 16 18 orderD T39me I I I I I I I l l in instructions Instruction DataD IW 1 1000 fetch fReg ALU access Reg 4 L L Data IW 2 2000 8 ns fetch I Keg ALU access Reg 4 L L Iw 3 3000 8 n8 fetch I 8 ns ProgramD executionD 2 4 6 8 10 12 14 r B T39me I I I I I I I in instructions InstructionI DataD Iw 1 100O f Reg ALU acc ss Reg 4 IW 21 200O 2 n8 IInstructIonI Reg ALU 21ng Reg InstructionI ata IW 3 3000 2 n8 fetch Reg ALU access Reg Lec 2 Systems Architecture 11 Speedup Time Startup Number of operations X timestage Number of stages Number of operations 1 X timestage Speedup nonpipelined timelpipelined time If sequential time timestage X number of stages Speedup SNtS N1t z S where S number of stages N number of operations t timestage Lec 2 Systems Architecture 11 Designing Instruction Sets MIPS for Pipelining Want to break down instruction execution into a reasonable number of stages of roughly equal complexity All instructions the same length easier to fetch and decode Few instruction formats source register fields are located in the same place can begin reading registers at the same time instruction is decoded Memory operands appear only in loads and stores calculate address during execute stage and access memory following stage otherwise expand to addr stage mem stage and exec stage Operands must be aligned in memory don t have to worry about a single data transfer instruction requiring two data memory accesses hence it requires a single pipeline stage Lec 2 Systems Architecture 11 10 Pipeline Hazards Situations in pipelining when the next instruction cannot execute in the following clock cycle Structural hazards hardware can not support the combination of instructions that we want to execute in the same cycle Control hazards need to make a decision based on the results of one instruction while others are executing Data hazards an instruction depends on a the results of a previous instruction still in the pipeline Lec 2 Systems Architecture 11 11 Structural Hazards Problem conflict in resources Example Suppose that instruction and data memory was shared in singlecycle pipeline Data access conflicts with instruction fetch Solution remove conflicting stages redesign resources to separate resources or replicate resources Lec 2 Systems Architecture 11 12 Control Hazards Problem The next element to go into the pipe may depend on currently executing instruction or we may have to wait until a stage is completed to determine the next stage Example branch instruction Solutions Stall operate sequentially until decision can be made wastes time Predict guess what to do next If guess correct operate normally if guess is wrong clear the pipe and begin again Lec 2 Systems Architecture 11 13 Timing Diagram stall Program D execu nD 2 4 6 8 10 12 14 16 orderD Tlme I I I I I I I I V in instructions DataD access ALU add 4 5 6 beq 1 2 40 ALU 2ns Data D access ALU IW 3 3000 fetch 4nsD 2ns Lec 2 Systems Architecture 11 14 Timing Diagram predict ProgramD executionD 2 4 6 8 10 12 14 L orderD Tim I I I l i l in instructions add 4 5 6 DataD beq 139 2 40 access Data D access ALU Iw 3 3000 2ns ProgramD D execution D T 2 4 I orderD 39me 39 39 39 39 in instructions DataD add 4 5 6 ALU access beq 1 2 40 ALU 335 2 ns DataD or 7 8 9 ALU access Lec 2 Systems Architecture 11 15 Delayed Branch ProgramD executionD 2 4 orderD Time I I in instructions beq 1 2 40 D ata D fetc h access add 4 5 6 Delayed branch slot 2ns Data D access ALU IW 3 3000 Data D access ALU Lec 2 Always execute the next sequential instruction with the branch taking place after that one instruction delay Compiler tries to put instruction in the branch delay slot that is always executed independent of branch outcome compilers typically fill 50 of the branch delay slots Systems Architecture 11 Data Hazards Problem Instruction depends on the result of a previous instruction still in the pipeline Example add sotot1 sub t2sot3 Solutions forwarding or bypassing instruction reordering to remove dependencies Lec 2 Systems Architecture 11 17 Timing Diagram forwarding ProgramD executionD orderD Time in instructions add s0 t0 t1 39 sub t2 s0 t3 ProgramD Time I l l l executionD orderD in instructions Iw 50 20t1 sub 12 3550 t3 Lec 2 Systems Architecture 11 18 Summary Pipelining increases the number of simultaneously executing instructions and the rate at which instructions are started and completed Pipelining does not reduce the time it takes to complete an individual instruction In the fivestage pipeline it still takes 5 clock cycles for instructions to complete Pipelining improves instruction throughput rather than reducing individual instruction execution time Good pipeline design requires an appropriate number of stages of comparable complexity speedup number of stages Must cope with structural control and data hazards Techniques include branch prediction forwarding and stalls Lec 2 Systems Architecture 11 19 Improving Pipeline Performance Superpipelining very long pipelines Superscalar replicate resources so that multiple instructions can execute in the different stages of the pipeline issue multiple instructions in the same cycle Dynamic pipeline scheduling dynamically reorder instructions to avoid hazards usually combined with resource replication lw to 20s2 addu t1 to t2 sub 54 54 t3 slti t5 s4 20 Lec 2 Systems Architecture 11 20 Superscalar MIPS Issue two instructions per cycle One ALU operation or branch and one load or store requires fetching and decoding 64 bits To simplify require that the instruction be paired and aligned 40000040 Lec 2 Systems Architecture 11 21 Superscalar Example Loop lw t0 0 s1 t0array element addu t0 t0 s2 add scalar in s2 sw t0 0s1 store result addi s1 s1 4 decrement pointer bne s1 zero Loop branch s1 0 ALU or branch inst Data transfer inst Clock cycle Loop Iw t0 0s1 1 addi s1s14 2 addu t0 t0 32 3 bne s1zeroLoop sw t0 4s1 4 Lec2 Systems Architecture 11 22 Loop Unrolling Unroll several iterations of the loop to provide additional work Introduce additional temporaries to remove artificial dependencies Previous example speedup 2 ALU or branch inst Data transfer inst clock Loop addi s1s116 Iw t0 0s1 Iw t112s1 addu t0t0s2 Iw t2 8s1 addu t1t1s2 Iw t3 4s1 addu t2t2s2 sw t0 0s1 addu t3t3s2 sw t1 12s1 sw t2 8s1 bne s1zeroLoop sw t3 4s1 OJNOUU IPOONA Lec 2 Systems Architecture 11 Dynamic Pipeline Scheduling Go past stalls to find later instructions to execute while waiting for the stall to be resolved Out of order execution In order completion Speculative execution dynamic scheduling branch prediction Register renaming Can accomplish automatic loop unrolling Lec 2 Systems Architecture 11 24 Design of a Dynamically Scheduled Pipeline Instruction fetchD Inorder issue and decode unit l l l l ReservationD ReservationD ReservationD ReservationD s a ion s a ion station station I t t t t FunctionalD Integer Integer units LoadD Store Outoforder executt norder commit CommitD uni Lec 2 Systems Architecture 11 25 Systems Architecture Topic 2 Pipelined Datapath and Control Lec 2 Systems Architecture 11 26 Introduction Objective To understand the modifications to the datapath and control used in the single cycle implementation of MIPS needed to support pipelining Key idea Separate datapath into 5 pieces one for each stage of the pipeline Add registers between pieces to hold the information specific to the currently executing instruction Note We will ignore the difficulties due to hazards in this lecture Topics Review of the single cycle implementation of MIPS Pipelined datapath Graphical representation of pipelines multiple and single clockcycle pipelining diagrams Adding control to the pipelined datapath Lec 2 Systems Architecture 11 27 Pipeline Stages for MIPS Instruction Execu on There are five stages IF Fetch instruction ID Instruction decode and register file EX Execution or address calculation MEM Data memory access WB Write back Lec 2 Systems Architecture 11 28 Singlecycle Datapath IF Instruction fetch ID Instruction decodeD register file read EX ExecuteD MEM Memory access address calculation Ream register 2 instructiunEI memury Registers Read daaz WB Write back Exeie Lec 2 Systems Architecture 11 29 Instruction Execution Time in clock cycles Program cc 1 cc 2 cc 3 cc 4 executionD CCS orderD in instructions Iw11000 a r a Lec 2 Systems Architecture 11 30 Pipeline Registers o M u x FAD DEX EXMEM 4 8 ad 1 M 8 regwster Read 1 E Readl M E regwster 2 ns trucuonl Regmerg Read memory Wmeu dam ster MEMWB Lec 2 Systems Architecture 11 31 Load Instruction stages 13 W mm uctmn fetch W Executm Lec 2 Systems Architecture 11 32 Load Instruction stages 45 lw Memory lw Write bac Lec 2 Systems Architecture 11 33 Store Instruction stages 35 sw Memory Lec 2 Systems Architecture 11 34 Correction to Datapath for Load F D DEX EXMEM MEMWB Address nstrucuon III memory I WW n n a a 3 3 A N m a 939 m 0 Lec 2 Systems Architecture 11 35 Datapath with Control Signals Lec 2 Systems Architecture 11 36 ALU Control Instruction ALUOp Instruction funct ALU opcode operation action LW 00 load word xxxxxx add SW 00 store word xxxxxx add BEQ 01 branch eq xxxxxx sub Rty pe 10 add 100000 add Rtype 10 sub 100010 sub Rtype 10 and 100100 and Rtype 10 or 100101 or Rtype 10 st 101010 st ALU control 010 010 110 010 110 000 001 111 Lec 2 Systems Architecture 11 37 Control Signals RegDst register destination number if deasserted then rt field bits 2016 if asserted then rd field bits 1511 RegWrite write to register file when asserted ALUSrc select second input to ALU if deasserted then second register file output if asserted then signextended 16 bits of instruction PCSrc select input to PC if deasserted then PC4 if asserted then branch target MemRead read from memory if asserted MemWrite write to memory if asserted MemtoReg select register write data source if deasserted then ALU output if asserted then data memory Lec 2 Systems Architecture 11 38 Passing Control through the Pipeline ExecutionAddress Calculation Memory access stage stage control ALU ALU ALU Mem Mem Reg Mem Instruction IFID IDEX EXMEM MEMWB Lec 2 Systems Architecture 11 39 Datapath with Control Signals Lec 2 Systems Architecture 11 4O lw 10201 sub 112 3 and 12 4 5 or 1367 add 14 8 9 Example Lec 2 Systems Architecture 11 ID beforelt1gt EX beforelt2gt MEM beforelt3gt WB beforelt4gt IF IW 10 201 Address ns LrummnD DataEI memury Clock 1 Lec 2 Systems Architecture 11 42 IF sub 11 2 3 ID IW 10 201 EX beforelt1gt MEM1beforelt2gt WB beforelt3gt Address nstrucuonn Clock 2 Lec 2 Systems Architecture 11 43 IF and 12 4 5 ID sub 11 2 3 EX IW 10 MEM1beforelt1gt WB beforelt2gt nsirucuon III DaIaD memory Clock 3 Lec 2 Systems Architecture 11 44 IF or 13 6 7 nstru CIon III memory Clock 4 ID and 12 2 3 EX sub 11 MEM IW 10 beforelt1gt Lec 2 Systems Architecture 11 45 IF add 14 8 9 ns39uucnnnn Clock 5 Dor1367 EX and12 MEMsub11 IW10 Lec 2 Systems Architecture 11 46 IF afterlt1gt ID add 14 8 9 EX or13 MEM and 12 WBsub11 ns39uucnunn Clock 6 Lec 2 Systems Architecture 11 47 IF afterlt2gt ID afterlt1gt EX add 14 MEM10r13 and 12 nstrucnonn memory Address Dal III memory Clock 7 Lec 2 Systems Architecture 11 48 IF afterlt3gt ID afterlt2gt EX afterlt1gt MEM add 14 1 nswucuonu 2 Regwstera memory Address Dal III memory Clock 8 or13 Lec 2 Systems Architecture 11 49 IF afterlt4gt ID afterlt3gt EX afterlt2gt MEM afterlt1gt add 14 nstrucnonu Address Datan memory Clock 9 Lec 2 Systems Architecture 11 50 Systems Architecture Il Topi s Interfacing IIO Devices to Memory Processor and Operating System Memorymapped IO and Interrupts in SPIM This lecture was derived from material in the text Chapter 8 This lecture was derived from material in the text Chapter 8 All figures from Computer Organization and Design The HardwareSoftware Approach Second Edition by David Patterson and John Hennessy are copyrighted material COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS INC ALL RIGHTS RESERVED Notes Courtesy of Jeremy R Johnson Lec 6 System Architecture 11 Systems Architecture Topic 1 Interfacing IIO Devices to Memory Processor and Operating System Lec 6 System Architecture 11 Introduction Objective To learn how an IIO device communicates with a user program How is a user IIO request transformed into a device command and communicated to the device How is data actually transferred to or from a memory location What is the role of the operating system Topics Role of the OS Giving commands to the IIO system IIO commands Memory mapped IIO Communicating with the processor polling interrupts Transferring data between a device and memory Direct Memory Access DMA Designing an IIO system Lec 6 System Architecture 11 Characteristics of HO The responsibility of the OS arise from three characteristics of IIO systems The IIO system is shared by multiple programs using the processor IIO systems often use interrupts externally generated exceptions to communicate information about IIO operations Because interrupts cause transfer to kernel or supervisor mode they must be handled by the OS The lowlevel control of an IIO device is complex because it requires managing a set of concurrent events and because the requirements for correct device control are often very detailed Lec 6 System Architecture 11 4 Functions of the OS The OS guarantees that a user s program accesses only the portions of an llO device to which the user has rights The OS provides abstractions for accessing devices by supplying routines that handle lowlevel device operations The OS handles the interrupts generated by NO devices The OS tries to provide equitable access to the shared llO resources as well as schedule accesses in order to enhance system throughput Lec 6 System Architecture 11 5 Types of Communication Required The OS must be able to give commands to the IIO device eg read write disk seek etc The device must be able to notify the OS when the IIO device has completed an operation or has encountered an error Data must be transferred between memory and an IIO device Lec 6 System Architecture 11 Giving Commands to IIO Devices Dedicated IIO instructions eg Intel 80x86 command and device number specified in the instruction processor communicates the device address via a set of wires included as part of the IIO bus illegal to execute while in user mode Memorymapped IIO Portions of the address space are assigned to IIO devices commands and data are written to special addresses data and status info read from special addresses Memory system ignores operation determined by address IIO controller sees the operation and transmits it to the device Lec 6 System Architecture 11 Communicating with the Processor Polling Simplest way for an IIO device to communicate with the processor IIO device simply puts information in a status register and the processor must come and get the information Periodically check status bits to see if it is time for the next IIO operation lnterru ptd riven IIO The disadvantage of polling is that it wastes a lot of time When a device wants to notify the processor that it has completed some operation or that it needs attention it causes the processor to be interrupted An interrupt is similar to an exception except it is asynchronous with respect to instruction execution the processor must be notified of the device causing the interrupt interrupts must be prioritized according to the devices that caused them Lec 6 System Architecture 11 Determine impact of polling on three different devices Assume 400 cycles for polling operation and a 500 MHz clock Determine fraction of CPU time consumed in the following 3 cases assume that you poll often enough so that no data is lost and that the Overhead of Polling devices are potentially always busy Mouse must be polled 30 times per second Floppy disk transfers data to processor in 16bit units and has a transfer rate of 50 KBlsec Hard disk drive transfers data in 4 word chunks and can transfer at 4 MBlsec Lec 6 System Architecture 11 Overhead of Polling 1 Mouse 30 accesses per second 2 Floppy Drive 3 Hard Drive 30 x 400 12000 cycles per second for polling Fraction of processor clock cycles 12 x 103l500 x 105 0002 50KBsec 25K accessessec 2bytes pollmg access 25K x 400 cycles per second for polling Fraction of processor clock cycles 10 x 105l500 x 105 2 4MB sec 250 K accesses sec 16bytes pollmg access 250K x 400 cycles per second for polling Fraction of processor clock cycles 100 x 105l500 x 105 20 Lee 6 System Architecture H Transferring Data between a Device and Memory Using polling Initiate transfer and periodically check for completion Periodically check for updates from device eg mouse lnterru ptd riven OS initiates transfer and waits for interrupt to indicate that the transfer has completed or an error has occurred OS still transfers data is small chunks and must communicate through interrupts many times during the complete llO operation Direct Memory Access DMA Also interruptdriven but in this case the transfer is controlled by the device without intervention by the OS interrupt occurs only when entire transfer is complete or an error occurs Appropriate for highbandwidth devices with relatively large blocks of data Lec 6 System Architecture H 11 Overhead of InterruptDriven lO Assume hard disk drive transfers data in 4 word chunks and can transfer at 4 MBlsec 500 MHz clock Overhead of transfer including interrupt is 500 cycles Hard drive is transferring data only 5 of the time Interrupt rate when the disk is busy is the same as polling 250K x 500 125 x 106 cycles per second for disk Fraction of processor clock cycles 125 x 105l500 x 105 25 Assuming that the disk is only transferring data 5 of the time Fraction of processor clock cycles 25 x 5 125 Compare to polling the absence of overhead when the disk is not active is the major advantage of an interruptdriven interface Lec 6 System Architecture H 12 Overhead of DMA Assume hard disk drive transfers data in 4 word chunks and can transfer at 4 MBlsec 500 MHz clock Assume transfer with DMA and initial DMA setup takes 1000 cycles Overhead of interrupt at completion is 500 cycles If the average transfer is 8KB what fraction of the CPU is consumed if the disk is active 100 of the time ignore processorDMA controller bus contention Each DMA transfer takes 2 gtlt1073 sec 4MBsec 1000 500cyc1estransfer Cyclessec for disk 3 2gtlt 10 sec transfer 2 750 gtlt103 clock cycles sec Fraction of processor clock cycles 750 x 103l500 x 105 015 Lee 6 System Architecture H 13 Issues with DMA With DMA there is another path to memory This provides difficulties with virtual memory and cache Should physical or virtual addresses be used If virtual the DMA unit must translate to physical addresses If physical must ensure that addresses don t cross page boundaries otherwise memory addresses would not be contiguous Can break transfer into a sequence of page size transfers OS must not remap memory during DMA transfer The value of a memory location as seen by DMA and the processor may differ stale data or coherency problem value in cache different from memory Solved by routing through cache or cache flushing Lec 6 System Architecture H Designing an IIO System Design IIO system that ensures that latency is bounded by a certain amount Design IIO system to meet a set of bandwidth constraints given a workload Lec 6 System Architecture H 15 Designing an lO System Consider the following system 300 MHz CPU 50000 instructions in OS per IIO operation A memory backplane bus capable of a transfer rate of 100MBlsec SCSI2 controllers with a transfer rate of 20MBlsec and accommodating up to seven disks Disk drives with readwrite bandwidth of 5MBlsec and an avg seek plus rotational latency of 10ms If the workload consists of 64KB reads and the user program needs 100000 instructions per IIO operation find the maximum sustainable IIO rate and the number of disks and SCSI controllers required ignore disk conflicts Lec 6 System Architecture H Designing an IIO System To find max IIO rate find rate for two fixed components to determine which is the bottleneck Instruction execution rate 300 gtlt106 2000 IOs sec Instructions per 10 50 100 X 103 Max IIO rate of CPU Bus bandwidth 100 gtlt106 15621Ossec Bytes per IO 64 gtlt103 Max IIO rate of bus To determine the number of disks we need to know the time per IIO operation 10ms 64KBl5 MBlsec 228 ms 1000228 439 IIOs per sec 1562439 36 disks Lee 6 System Architecture H 17 Designing an lO System To compute the number of SCSI buses we need to know the transfer rate Transfer sizeTransfer time 64KBl228 ms 274 MBlsec Assume that disk accesses are not clustered so that we can use the full bandwidth of the bus 274 x 7 1918 so we can use seven disks per SCSI bus This calculation required several simplifying assumptions in practice where this is not the case simulation is used Lec 6 System Architecture H 18 Systems Architecture Topic 2 Memorymapped IO and Interrupts in SPIM Lec 6 System Architecture H Coprocessor 0 Exceptions Internal vs External Coprocessor 0 Registers RemswrName RemswrNumber Usage BadVAddr 8 Contains the memory address at which memory exception occurred Status 12 Interrupt mask and enable bits Cause 13 Exception type and pending interrupts bits EPC 14 Contains address of instruction that caused exception Coprocessor 0 instructions lwc0 swc0 mfc0 mtc0 Lec 6 System Architecture H 20 The Status register InterruptD maSk Old b Previousltgt CurrentO 7 9 5 so Q Q 52 z 8 q 8 Q 9 o x z o k 2 0 K 2 r 3 ig r he sfg ws 5 The Cause register 15 10 5 2 Pendingu Exceptionu interrupts code 21 Lec 6 System Architecture H The Status register Types of exception status Number Name Description 0 INT External interrupt 4 ADDRL Address error exception load or instruction fetch 5 ADDRS Address error exception store 6 IBUS Bus error on instruction fetch 7 DBUS Bus error on data load or store 8 SYSCALL Syscall exception 9 BKPT Breakpoint exception 10 RI Reserved instruction exception 12 OVF Arithmetic overflow exception Address of interrupt handler exception routine 8000 0080hex Lec 6 System Architecture H 22 MIPS Terminal lO Unused 1 1 Receiver controlD OxffffOOOO InterruptLi T T Ready enable Unused 8 Receiver dataD 0xffff0004 Received byte Unused 1 1 Transmitter controlD OxffffOOOS InterruptEi T T Ready enable Unused 8 Transmitter dataD OxffffOOOc Transmitted byte Lec 6 System Architecture 11 23 Systems Architecture ll Topics Exploiting Memory Hierarchy Virtual Memory IIO Devices and Communication Buses This lecture was derived from material in the text Chapter 7 This lecture was derived from material in the text Chapter 8 All figures from Computer Organization and Design The HardwareSoftware Approach Second Edition by David Patterson and John Hennessy are copyrighted material COPYRIGHT 1998 MORGAN KAUFMANN PUBLISHERS INC ALL RIGHTS RESERVED Notes Courtesy of Jeremy R Johnson Lec 5 Systems Architecture 11 Systems Architecture Topic 1 Exploiting Memory Hierarchy Virtual Memory Lec 5 Systems Architecture 11 Introduction Objective To use main memory as a cache for accessing secondary storage typically magnetic disk Additional Motivation To allow multiple programs to efficiently and safely share main memory To provide the programmer with the illusion of an unbounded memory Topics Virtual memory Mapping virtual addresses to physical addresses Hardware support for address translation TLB Page faults Implementing protection with virtual memory Common framework for memory hierarchies Lec 5 Systems Architecture 11 3 Virtual Memory In virtual memory blocks of memory called pages are mapped from a set of addresses called virtual addresses to another set called physical addresses A physical page resides in memory while a virtual page may reside in disk Sharing is accomplished by pointing two virtual addresses to the same physical address Address translation Lec 5 Systems Architecture 11 Mapping from a Virtual to Physical Address number and a page offset number same as the number of physical pages 3130292827 15141312 111098 Virtual page number I 29 28 27 Translation Physical page number Physical address Page offset 15141312 111098 Page offset In virtual memory an address is broken into a virtual page The virtual page number is mapped to a physical page The number of addressable virtual pages may not be the Lec 5 Systems Architecture 11 Design Decisions Many design decisions for virtual memory are motivated by the high cost of a miss called a page fault The time to process a page fault will take millions of cycles Pages should be large enough to amortize large access time utilize principle of locality 4 KB 64 KB page size Reducing the rate of page faults is crucial Use a fully associative scheme Page faults can be handled in software and hence clever algorithms can be used to choose replacement pages thus reducing the number of misses Using writethrough to manage writes in virtual memory will not work too costly Writeback strategy is used instead Lec 5 Systems Architecture 11 6 Locating a Physical Page Use a fully associative scheme to reduce page faults and to allow more flexible replacement policies Since search is to costly table lookup is used to find the physical location of a page Index table is called a page table The page table is indexed by the virtual page number Each program has its own page table The size of the page table is determined by the number of bits in the virtual address Since page tables are large and there can be many of them the entire table is not kept in memory use dynamic tables hash functions multiple levels and virtual memory itself Lec 5 Systems Architecture 11 Page Table I Page table register I Virtual address 3130 29 28 27 151413121110 9 a 21 o Virtual page number I Page offset l 20 12 x Valid Physical page number Page table 18 f0 then page is notD present in memory 29 23 27 151413121110 9 3 3 21 o Physical page number Page offset thsical address Lec 5 Systems Architecture 11 Page Faults If the valid bit for a virtual page is off a page fault occurs Control is transferred to the OS through an exception The OS must get the page from disk disk locations can be stored in the page table or an auxiliary data structure If there is no room for the new page a page currently residing in memory must be replaced In order to minimize page faults the least recently accessed page or some approximation to it will be replaced Implementing LRU is too costly must update data structure for every memory access Approximate using a reference E which is set whenever the page is accessed and periodically cleared by the OS and recorded to determine which pages were used during a particular time period Lec 5 Systems Architecture 11 9 Write Strategy and Page Replacement The difference in access times to cache and main memory is tens of cycles and a writethrough strategy can be used with the aid of a write buffer to hide latency of the write Since writes to disk take millions of cycles this approach is impractical for virtual memory Instead use a writeback policy perform individual writes to the page in memory and copy them back to disk when the page is replaced Copying an entire page is more efficient than the sum of individual writes A writeback operation though more efficient is still costly only write back when the contents of the page have changed determined by the setting of the dirty bit Lec 5 Systems Architecture 11 10 Making Address Translation Fast With virtual memory accessing memory requires two memory references one to determine physical address and one to access the contents of the desired location The key to improving access performance relies on the principle of locality it is likely that once a page is accessed it will be accessed again in the near future only need to perform address translation once saving the result in a buffer a special cache called the translationlookaside buffer TLB Since the TLB may replace access to the page table reference and dirty bit may be required Writeback is used since miss rate should be very small A TLB miss may not imply a page fault Lec 5 Systems Architecture 11 11 TranslationLookaside Buffer TLB size 324096 entries Block size 1 2 page table entries Hit time 05 1 cycles V39gggggrgeu Valid Tag Physical pagan Miss penalty 1030 cycles Miss rate 001 1 Physical memory Page table pageD Disk storage Lec 5 Systems Architecture 11 12 TranslationLookaside Buffer TLB is a cache that holds only page mappings It also includes the reference and the dirty bits for each data entry On every reference the virtual page number is looked up in the TLB If found hit the physical page number is used to form the address and the reference bit is turned on If the processor is performing a write the dirty bit is turned on TLB Miss check if it is a page fault or merely a miss lf page is not present in memory then it is a true page fault and the OS must be invoked through an exception TLB misses are much more frequent than true page faults TLB misses are handled in software or hardware Lec 5 Systems Architecture 11 13 An Example DECStation 3100 32 bit address 4 KB pages 20 bit virtual page number TLB contains 64 entries TLBquyassoda ve Tm Exception on TLB miss Avg TLB miss 16 cycles a Special instructions update TLB Replacement of TLB entry 16 determined by hardware andisrandonL Write access bit is checked on writes if the bit is off the process has me only read privileges to the page offset Physical address tag Cache 313029 1514131211109amp 3210 Page ofket index 14 Lec 5 Systems Architecture 11 Processing a ReadNVrite in the DECStation 3100 TLB and Cache Virtual address TLB access TL B miss D exception Write data into cacheD update the tag and putD the data and the addressD into the write buffer Cache miss sta Deliver dataD to the CPU Lec 5 Systems Architecture 11 15 Interaction with OS Maintain memory hierarchy data can not be in cache unless in memory flush cache of entries from page replaced to disk update page tables and TLB so that attempt to access data on the replaced page will generate a page fault Protection prevent one programs from writing to another programs portion of memory Each program has its own virtual address space organize page tables to map this to distinct sets of physical pages Must make sure only the OS can modify page tables done by putting page tables in address space of OS Two modes of execution user and supervisor Special instructions to modify TLB page table register usersupervisor mode bit on context switch need to flush TLB entries or use PID to distinguish Sharing allow programs to share memory eg editor code Have OS point virtual page to shared physical page Use write access bit to restrict sharing to read Lec 5 Systems Architecture 11 16 Ha Handling Page Faults ndling page faults requires using the exception mechanism to interrupt the active process transferring co int ntrol to the OS and later resuming execution of the errupted process Exception must be asserted before instruction completes so that the state remains as it was before the instruction otherwise can not properly restart instruction making instructions restartable is difficult Instruction address placed in EPC and cause of exception in cause register Virtual address determined from EPC or instruction in EPC depending on whether it was an instruction or data access that caused the fault Steps Save entire state of process includes all registers Look up page table entry and find location of referenced page on disk Choose replacement page if dirty need to write back Start a read to bring referenced page from disk to memory Since the last step takes millions of cycles usually start another process while waiting for read to complete Lec 5 Systems Architecture 11 Summary Virtual memory cache between main memory and secondary storage Supports large address space and allows sharing Misses page faults very costly large blocks pages fully associative using page table LRU or approximation policy used for replacement strategy Write back strategy used TLB used to reduce cost of extra address translation Virtual to physical mapping provides mechanism for protection and simplifies memory allocation Lec 5 Systems Architecture 11 18 Common Framework for Memory Hierarchies Question 1 Where can a block be placed One place direct mapped a few places set associative or any place fully associative Question 2 How is a block found There are four methods indexing limited search full search or table lookup Question 3 Which block should be replaced on a cache miss Typically least recently used or random block Question 4 What happens on writes Writethrough or writeback Lec 5 Systems Architecture 11 19 Model for Cache Misses Compulsory misses coldstart These are cache misses caused by the first access to a block that has never been in the cache Capacity misses These are cache misses caused when the cache cannot contain all the blocks needed during execution of a program Conflict misses collision These are cache misses that occur in a set associative or direct mapped cache when multiple blocks compete for the same set These misses are eliminated with a fully associative cache Lec 5 Systems Architecture 11 20 Topic 2 IIO Devices and Communication Buses Systems Architecture Lec 5 Systems Architecture 11 21 Introduction Objective To understand the basic principles of different IIO devices and to develop protocols for connecting IIO devices to processors and memory To analyze and compare performance of IIO devices and communication protocols Topics Design issues and importance of IIO IIO devices keyboard and monitor mouse magnetic disk network Buses synchronous vs asynchronous handshaking protocol bus arbitration Lec 5 Systems Architecture 11 22 Interfacing Processors and Peripherals Design Issues performance resilience expandability different devices Performance access latency throughput transfer bandwidth transfers per second Processor MainD memory OD controller Interface bus protocols interrupts Lec 5 Systems Architecture 11 23 Types and Characteristics of IIO Devices Diverse devices Behavior input vs output Partner human vs machine data rate Lec 5 Systems Architecture 11 24 Magnetic Disk Rotating disk with magnetic surface 3600 to 10000 RPM platters 010 per MB hard disk organized into platters each surface made up of Sectors tracks f 10005000 m if tracks divided into sectors K j 64200 512 bytes per sector Lec 5 Systems Architecture 11 25 Disk Performance Average disk access time avg seek time avg rotational delay transfer time controller overhead Avg seek time time to move head to track may be 25 of manufacturer reported time due to locality Avg rotational delay 05 rotationRPM Transfer time depends on rotation speed sector size track density caching used to improve transfer rate What is the average time to read a 512byte sector from a typical disk rotating at 5400 RPM Average seek time 12ms Transfer rate 5 MBlsec Controller overhead 2ms Lec 5 Systems Architecture 11 26 Disk Performance Average seek time 12 ms advertised averaged over all possible seeks 3 ms measured typically 25 of advertised Average rotational delay 5 rotation5400 RPM 5 rotation5400 RPM60 secmin 00056 sec 56 ms Transfer time 5KBI5MBIsec 00001 sec 1 ms Controller time 2ms Average disk access time 1256012ms197ms 356012ms107ms Lec 5 Systems Architecture 11 27 Buses Shared communication link which uses one or more wires to connect multiple subsystems versatile low cost can be bottleneck physical limits length of wire conflicting goals fast bus access vs high bandwidth must support a range of devices Types of buses Processormemory short highspeed custom Backplane high speed often standardized eg PCI IIO lengthy not directly connected to memory multiple types of devices often standardized eg SCSI Lec 5 Systems Architecture 11 28 Bus Configurations Backplane bu s Processor a VG devices Processormemory bus BackplaneD Lec 5 Systems Architecture 11 29 Bus Input and Output Control lines Dala lines I a 4 I I Control lines Processor Data lines 39 Conlrol lines a Disks Control lines b I I 39 Datalines Conlrol lines a I I Output Operation Input Operation a Read request a Write request b memory access b memory transfer c memory transfer Lec 5 Systems Architecture 11 30 Synchronous vs Asynchronous Synchronous use a clock and a synchronous protocol fast and small but every device must operate at same rate and clock skew requires the bus to be short Asynchronous don t use a clock and instead use handshaking can accommodate a wide variety of devices can be lengthened Lec 5 Systems Architecture 11 31 Handshaking Protocol Read Req Used to indicate a read request for memory Address put on the data lines at the same time DataRdy Used to indicate that data is now ready on the data lines Data is placed on data lines at the same time set by either memory or device depending on whether it is an output or input operation Ack Used to acknowledge the ReadReq of DataRdy signals ReadReq and DataRdy asserted until the other party has seen the control lines and the data lines have been read This indication is made by asserting the Ack signal Lec 5 Systems Architecture 11 32 Handshaking Protocol Read Req Data Ack DataRdy Lec 5 Systems Architecture 11 33 FSM Control for Handshaking Protocol O device Memory FeedReE ReadReq DEERE ReadReq 2D Release dataD lines deasserm DataRdy DataRdy 5D Read memor nes D assert Ack 6D Release dataD lines andD DataRdy 7 D Deassert Ack New O request Systems Architecture 11 34 Performance Comparison Synchronous bus Asynchronous bus 50 ns clock 40 ns per handshake 32 data bits 32 data bits 200 ns memory access 200 ns memory access 0 send address 50 ns step 1 40 ns Read memory 200 ns Maxsteps 234Read 200 ns send data 50 ns Steps 567 120 ns total time 300 ns total time 360 ns Bandwidth Bandwidth 4 bytes300 ns 4 MBIO3 sec 4 bytes360 ns 4 MBIO36 133 MBlsec sec 111 MBlsec Lec 5 Systems Architecture 11 35 Improving Bus Performance Data bus width By increasing the width of the data bus transfers of multiple words take fewer bus cycles Separate vs multiplexed address and data lines Separate data and address lines will improve performance of writes since the address and data can be sent at the same time Block transfers Allowing the bus to transfer multiple words in back to back bus cycles without sending an address or releasing the bus reduces the time to transfer a large block Lec 5 Systems Architecture 11 36 Performance Example Memory supports block access of 4 16 32bit words 64bit synchronous bus clocked at 200 MHz 5 ns clock with each 64bit transfer taking 1 cycle and 1 cycle to send an address Two cycles between each bus operation 200 ns memory access time for 1st 4 words and 20 ns for each additional set of 4 words Find the sustained bandwidth and latency to read 256 words for transfers that use 4word blocks and 16word blocks Lec 5 Systems Architecture 11 37 Performance Example 4word block transfer 1 clock cycle to send address 200 ns5nscycle 40 cycles to read data from memory 4 words 2 cycles to send data 4 words 2 cycles idle before next transfer Total time 45 cycles gtlt 2564 transfers 2880 cycles 14400 ns Bandwidth 256 x 4 bytes14400 ns 7111 MBsec 16word block transfer 1 clock cycle to send address 200 ns 40 cycles to read first 4 words of data from memory 20 ns 4 cycles to read each of the remaining 3 sets of 4 words Send 4 words of data 4 times 2 cycles to send 4 words of data overlap with next read 2 cycles idle overlap with next read Total time 1 40 4 x 4 cycles x 25616 transfers 912 cycles 4560 ns Bandwidth 256 x 4 bytes4560 ns 22456 MBsec Lec 5 Systems Architecture 11 38 Bus Arbitration Need a mechanism to determine which device can use the bus at a given time Use a bus master to initiate and control all bus requests Simplest scheme uses a single bus master Multiple bus masters are more efficient Deciding which bus master gets to use the bus next is called bus arbitration Use priority scheme but want to maintain fairness Lec 5 Systems Architecture 11 39 Bus Arbitration detail Daisy chain arbitration eg VME Grant lines run from highest priority to lowest Highpriority device that wants access intercepts bus grant signal Simple but cannot assure fairness and may limit speed Centralized parallel arbitration eg PCI Devices independently request the bus through multiple request lines Centralized arbiter chooses which device will act as a master The central arbiter is required and may become a bottleneck Distributed arbitration by selfselection eg NuBus in Mac ll Devices independently request the bus through multiple request lines Devices identify themselves to the bus and broadcast their priority Each device determines independently if it is the highpriority requestor Drawback requires more lines for request signals Distributed arbitration by collision eg Ethernet Devices independently request the bus which results in a collision A scheme is used for selecting among colliding parties to be a master Lec 5 Systems Architecture 11 40 Single Bus Master Processor Bus request lines I T Bus Disks Bus request lines Memory I I Processor T I Processor Bus Disks Bus request lines Memory T T Processor Bus Disks Memory 0mm MU Lec 5 Systems Architecture 11 41 Daisy Chain Arbitration Highest priority Lowest priority BusD arbiter u Request Lec 5 Systems Architecture 11 42 Systems Architecture l Topics Dealing with Pipeline Hazards Exploiting Memory Hierarchy Cache Memory This lecture was derived from material in the text Chapter 6 This lecture was derived from material in the text Chapter 7 Notes Courtesy of Jeremy R Johnson Lec 3 Systems Architecture 11 Systems Architecture Topic 1 Dealing with Pipeline Hazards Lec 3 Systems Architecture 11 Introduction Objective In the previous lecture we saw the modifications necessary to the singlecycle implementation of MIPS to support pipelining However we ignored the possibility of pipeline hazards In today s lecture we show how to deal with pipeline hazards A pipeline hazard is a situation where the next instruction cannot execute in the following clock cycle Topics Data hazards and forwarding Data hazards and stalls Control branch hazards stall and reducing the delay of branches dynamic branch prediction delayed branch Exceptions Lec 3 Systems Architecture 11 3 Data Hazards and Forwarding Problem Instruction depends on the result of a previous instruction still in the pipeline Example sub 2 1 3 and 12 2 5 or 13 6 2 add 14 2 2 SW 15 1002 Lec 3 Systems Architecture 11 Timing Diagram with Data Dependencies Time in clock cycles Value of D CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 reQiSter 21 10 10 10 10 10 20 20 20 20 ProgramD executionD orderD in instructions sub 2 1 3 and 12 2 5 or 13 6 2 add 14 2 2 sw 151002 CC9 20 Lec 3 Systems Architecture 11 Hazard Detection 1a EXIMEMRegisteer DlEXRegisterRs 1b EXIMEMRegisteer DlEXRegisterRt 2a MEMNVBRegisteer DlEXRegisterRs 2b MEMNVBRegisteer DlEXRegisterRt sub 2 1 3 and 12 2 5 1a or 13 6 2 2b add 14 2 2 no hazard sw 15 1002 no hazard Lec 3 Systems Architecture 11 Lec 3 Forwarding Time in clock 39 I CC1 CC2 CC3 CC4 CCS CCB CC7 CCB CCQ Value ofregister 2 10 10 10 10 10 20 20 20 20 20 Value of EXMEM X X X 20 X X X X X Value ofMEMWB X X X X 20 X X X X ProgramD execution orderD in instructions sub 2 1 3 and 12 2 5 or 13 6 2 add 14 2 2 SW 151002 Systems Architecture 11 Pipelined Datapath with Forwarding DEX EXMEM MEMNVB gt gt gt Regis1ers a No forwarding DEX EXMEM MEMNVB gt Regis1ers EXMEM Regis1eIRd MEMNVBRegis1eer ForwardingD I uni b VWh forwarding Lec 3 Systems Architecture 11 Control for Hazard Detection and Forwarding EX hazard MEM hazard if EXMEMRegWrite if MEMNVBRegWrite and EXMEMRegisteer at 0 and MEMNVBRegisteer at 0 and EXMEMRegisteer and EXMEMRegisteer lDEXRegisterRs lDEXRegisterRs and MEMNVBRegisteer lDEXRegisterRs then ForwardA 10 then ForwardA 01 MUXControl Source Explanation ForwardA 00 lDEX First ALU op comes from Reg File ForwardA 10 EXIMEM First ALU op forwarded from previous ALU result ForwardA 01 MEMWB First ALU op forwarded from data memory or and earlier ALU result ForwardB 00 lDEX 2nd ALU op comes from Reg File ForwardB 10 EXIM EM 2nd ALU op forwarded from previous ALU result ForwardB 01 MEMWB 2nd ALU op forwarded from data memory or and earlier ALU result Lec 3 Systems Architecture 11 9 Datapath with Control for Forwarding IDEX Lec 3 Systems Architecture 11 10 or 4 4 2 memory Clock 3 and Instrudion Example 1 sub 2 1 3 4 2 5 Registers beforelt1gt beforelt2gt DataD memory Lec 3 Systems Architecture 11 Example 2 add 9 4 2 beforelt1gt or 4 4 2 and 4 2 5 Registers In struction InstructionD memory ForwardingD Clock 4 Lec 3 Systems Architecture 11 12 Example 3 afterlt1gt add 9 4 2 or 4 4 2 sub 2 Instruction Registers 2 DataD memory memory FcrwardingD uni Clock 5 Lec 3 Systems Architecture 11 13 afterlt2gt Clock 6 memory afterlt1gt Instruction 4 Registers Example 4 add 9 4 2 ForwardingD unit or4 and 4 Lec 3 Systems Architecture 11 Data Hazards and Stalls Problem Sometimes a data dependency can not be resolved with forwarding eg a load dependency Time in clock cycles ProgramD CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 execution D in instructions lw 2 201 and 4 2 5 or 8 2 6 add 9 4 2 slt 1 6 7 Lec 3 Systems Architecture 11 15 Inserting Stalls into the Pipeline If lDlEXMemRead and DlEXRegisterRt FlDRegisterRs or DlEXRegisterRt FlDRegisterRt stall the pipeline insert nop by setting control 0 ProgramD Time in clock cycles EXECUNOH U CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 CC 10 orderD in instructions lw 2 201 and 4 2 5 or 8 2 6 add 9 4 2 slt167 m l39 l Lec 3 Systems Architecture 11 16 D HI DWrite PCWrite memory atapath with Control for Stalls IDEXMem Read HazardD detectionD unlt IDEX Control DataD memory Registers In strudion ForwardingD Lec 3 Systems Architecture 11 Example 1 beforelt1gt Iw 2 201 and 4 2 5 Control PCWrite Registers Instruction InstructionE memory Clock 2 beforelt2gt EXMEM I beforelt3gt Lec 3 Systems Architecture 11 or 4 4 2 Clock 3 Lec 3 and 4 9 6 E 6 E Example 2 Iw 2 201 beforelt1gt beforelt2gt HazardD detedionD unit R 3951 egl ers D 31am memory Systems Architecture 11 P CWrite or 4 4 2 m em ory Clock 4 Example 3 bubble and 4 2 5 Control Registers Instruction ForwardingD unit DataD memory beforelt1gt Lec 3 Systems Architecture 11 20 Example 4 add942 or442 39and425 PCWrite Registers 2 Instru cti on m em ory ForwardingD uni Clock 5 39 bubbb DataD memory 39wv2n Lec 3 Systems Architecture 11 21 Example 5 afterlt1gt add 9 4 2 or 4 4 2 and 4 bubble Registers Instruction memory Furwarding D unit Clock 6 Lec 3 Systems Architecture 11 22 Example 6 39add 9 4 2 39or4 land 4 afterlt1gt afterlt2gt Instruction Registers Clock 7 Lec 3 Systems Architecture 11 23 Branch Hazards Problem An instruction must be fetched every cycle yet the decision of whether to branch doesn t occur until MEM pipeline stage programm Time in clock cycles executiOquot D cc 1 cc 2 cc 3 cc 4 cc 5 cc 6 cc 7 cc 8 cc 9 orderD in instructions 40 beq 1 3 7 44 and 12 2 5 48 or 13 6 2 52 add 14 2 2 72 Iw 4 507 Lec 3 Systems Architecture 11 24 Reducing the Delay of Branches Assume branch not taken Reduce cost if branch taken which implies fewer instructions need to be flushed Lec 3 Systems Architecture 11 25 and 12 2 5 beq 1 3 7 sub 10 4 8 beforelt1gt beforelt2gt FFush HazardD detecn39onD unit Registers ForwardingD unit Clock 3 Lec 3 Systems Architecture 11 26 Iw4 507 bubble Echo beq 1 3 7 sub 10 beforelt1gt lFFu sh HazardD detech39onD unit Re gisters Clock 4 Lec 3 Systems Architecture 11 27 Loop Prediction Not taken Predict taken Predict taken Taken Taken Not taken Not taken Predict not taken Predict not taken Try to predict whether or not a branch will be taken Use branch table to keep track if branch was taken previously 2bit prediction improves prediction rate eg loop Lec 3 Systems Architecture 11 28 Branch Delays Branch Delay Always execute instruction after branch Rely on compiler to put safe instruction in branch delay slot a From before add 51 52 53G D if52 0 D Becomes 5 From target 0 From fall through sub t4 t5 t6 D D D add 51 52 53G E if 51 0 thenD D add 51 52 53G E if 51 OthenD D Delay 5ot I D add 51 52 53G if 51 Othen D 5ub t4 t5 t6 Delay 5ot 5ub t4 t5 t6 Becomes Becomes D add 51 52 53G D D E if 51 Othen D 5ub t4 t5 t6D D Lec 3 Systems Architecture 11 29 Excep ons Problem Another control hazard is an exception eg arithmetic overflow In this case we need to transfer control to an exception handler Need to flush pipeline Lec 3 Systems Architecture 11 30 Iw 16 507 slt 15 6 7 add 1 2 1 HazardD detectiom Registers memory ForwardingD unit Clock 5 Lec 3 Systems Architecture 11 31 sw 25 10000 bubble mop bubble bubble or 13 HazardD detectionEI unit Registers DataD memory 4000044 mm 40000040 ForwardingD unit Clock 6 Lec 3 Systems Architecture 11 32 Performance Improvement Problem Assume 2ns for memory access ALU op and 1ns for register file access Assume 1l2 load instructions immediately followed by instruction using result Branch delay for misprediction is 1 cycle and 1l4 of the branches are mispredicted Assume full clock delay on all jumps Assume 22 loads 11 stores 49 Rformat 16 branches and 2 jumps What is the CPI for pipelined implementation What is the performance gain over the single cycle implementation What is the performance gain over the multicycle implementation Lec 3 Systems Architecture 11 33 Systems Architecture Topic 2 Exploiting Memory Hierarchy Cache Memory Lec 3 Systems Architecture 11 34 Introduction Objective To design a memory system that provides the illusion of a very large memory that can be accessed as fast as a very small memory Principle of Locality Programs tend to access a relatively small portion of their address space at an instant of time Topics memory hierarchy and temporal and spatial locality The basics of cache Cache performance miss rate miss penalty Improving cache performance associativity multilevel caches Lec 3 Systems Architecture 11 35 Principle of Locality Principle of Localitv Programs tend to access a relatively small portion of their address space at an instant of time Temporal locality locality in time if an item is referenced it will tend to be referenced again soon Spatial locality locality in space if an address is referenced items whose addresses are close by will tend to be referenced soon Cache is a small but fast memory that is used to store or prefetch items that have been recently referenced or will likely be referenced soon This greatly speeds up memory access due to principle of locality Lec 3 Systems Architecture 11 36
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'