Computer Systems Architecture
Computer Systems Architecture CS 365
Popular in Course
Popular in ComputerScienence
This 114 page Class Notes was uploaded by Leland Swift on Monday September 28, 2015. The Class Notes belongs to CS 365 at George Mason University taught by Staff in Fall. Since its upload, it has received 6 views. For similar materials see /class/215126/cs-365-george-mason-university in ComputerScienence at George Mason University.
Reviews for Computer Systems Architecture
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 09/28/15
Pipelining CS 365 Lecture 12 Prof Yih Huang CS 365 Traditional Execution ll23412345123 add 1d beq CS 365 Pipelined Execution Ill2 1 NU p ANUJh HNWLU HNWLU l NWLLII law hm DJ hm U CS 365 Basic Ideas El Do not wait for an instruction to complete to start the next Start the Cycle 0 of the next instruction when the previous one enters Cycle 1 El Instruction executions are overlapped EIPipelining increases instruction throughput as opposed to decreasing the execution time of individual instructions CS 365 Easier Said Than Done El In every cycle activities of all five stages take place El Many problems arises with overlapped executions Structural hazards Control hazards Data hazards CS 365 Pipeline Hazards I El Structural Hazards The data path cannot support the combination of instructions that we want to execute in the same cycle El Consider what happens when an Rtype instruction is followed by a BBQ add IF RR RW beq IF RR CS 365 Summary of MIPS Lite Instruction Executions Action for Rtype Action for mmryrefa39eme Action for Action for IR MemoryIPC ALUOJt A op B ALUOJt A signextend if A B hen R15 0 PC ALUOJt Reg R1511 ALUOJt CS 365 Resource Con icts 0n the Multicycle Datapath CS 365 Pipeline Hazards 11 El Control Hazards When we decide to branch other instructions are in the pipeline beq add IF RR RW sub IF RR RW E Q an EEIMM Which one is the next CS 365 9 Pipeline Hazards 111 El Data Hazards data dependencies an instruction depends on the result of a previous instruction still in the pipeline Writing new value of r1 4 Addr1r2r3 IF RR RW Sub r4 r1 r10 IF RR Reading new value of r1 not available yet CS 365 10 Lessons El To achieve pipelining and avoid hazards we need to redesign the datapath and instruction execution steps CS 365 Pipeline Stages RType LD ST BEQ J Stage 1 IR lt MemPC 1F PC lt PC 4 gage 2 Read Regrs and Regrt Stage 3 RS op RT Calculate Calculate Calculate Set PC to EX RSlmmd RSlmmd PC Immd Immd Compare use ALU RS and RT Stage 4 Read Write Set PC DM memory memory accordingly Eagle s1 Write to Rd Write to Rt CS 365 Graphically Representing Pipelines eeeeeee an CS 365 13 Solving Structural Hazards Pipelined Datapath CS 365 14 CS 365 Discussions El Make sure you understand Why two extra adders are added El Is the assumption of using separate instruction and data memory reasonable CS 365 Consider Control Hazard beq mmmwm mmmwm mmmwm When is the Target IF RR EX DM RW deCiSion mmmww WM mmmwm Solution 1 Stalling The Pipeline OR CS 365 IF RR EX DM IF RR EX DM RW RR EX DM RW IF RR EX DM IF RR EX DM RW Decision is in IF EX DM RW Solution 2 Branch Prediction El Make a guess about the branch decision and start execute the guessed path before the decision is in aka speculative execution El If guess was wrong abandon those in the pipeline and jump to the right target El Branch Prediction Strategies Predict branch fails Predict branch succeeds Look into history CS 365 Predict Branch Fails EIWe guess the branch will fail That is the next IF will fetch the next sequential execution El If guess is right just proceed El If guess is wrong abandon sequential instructions and fetch the instruction from the target address CS 365 Predict branch fails Guess Is Right beq IF RR EX DM RW IF RR EX DM RW Following IF RR EX DM RW instructions 1F RR EX DM RW Decision is in Do not branch CS 365 CS 365 Predict branch fails Guess Is Wrong beq IF RR EX 2m g5 IF RR xi at 23 F IF at in 2m in ollow1ng instructions Target x instructions Decision is in D0 branch IF RR EX DM RW IF RR EX DM RW IF RR EX DM RW CS 365 EINotice that programmers are not aware of branch predictions right or wrong guesses affect only performance El Delayed Branches Make it official that CPU always executes the instr following a branch The branch determines the next next instruction Notice the programmer awareness Discussion CS 365 Target sub Example BEQ r1 r2 target A e add r10 rllr12 add r20 r21 r22 sub r30 r10 r20 0 O r30 r10 r20 Delay slot executed regardless of the branch decision CS 365 Delayed Branch with Wrong Guess beq IF RR EX DM RW Delays10t x IF RR 2m 5 ii is always I nished IF iii 13 2 iii Follow1ng instructions 1F RR EX DM RW IF RR EX DM RW Target instructions IF RR EX DM RW Decision is in D0 branch Discussions El Compilersprogrammers must be smart enough to make good use of the delay slots CI The problem is not entirely solved We still need to stall before the decision comes in CI The situations are exacerbated by deeper pipelining El Branch predictions are still important CS 365 25 Smart Branch Predictions El Observations Branches of ifelse statements are hard to predict Branches of loops typically repeat previous decisions For performance loops are more important than other control structures EIMany modern processors use special purpose hardware to remember the targets of recent branch instructions CS 365 26 Sub Or Add CS 365 Recall Data Hazards 2 1 3 RW 12 2 5 DM 13 6 2 EX DM 14 2 2 EX 15 1002 E RR EEE 2 available Sub RR EX RW Solution 1 Stalling El Simple but inefficient CS 365 And RR EX RW Or RR EX El Postpone subsequent instructions until data is available Add RR E gtlt RM SW RR EX RW1 Solution 2 Internal Forwarding EINew value of 2 is available after the EX of Sub but not in 2 yet El Use special circuits to forward the new value to subsequent instructions Sub r2rlr3 And r12r2r5 Or r13r6 r2 Add r14 r2 r2 SW r15 100r2 9 Types of Forwarding EIALU forwarding forward an ALU output to subsequent instructions This is the case we have seen El Memory Forwarding forward a memory output 1d result to subsequent instructions CS 365 30 Memory Forwarding Ld 142 1d RR EX DM RW Or 1013 Or RR EX Rwl Sub 20831 10 Sub RR Exl lRwl Notice that we still have to stall for one cycle CS 365 Delayed Loads El Make it official in the ISA that the result of a load will not be available to the next instruction El Have the compiler find something useful to do in load slots by reordering CS 365 MIPS Solution El Delayed result with internal forwarding LdRR EX DM RW Or RR EX Rw Sub RR EX RM Nop or something useful C5365 33 Summary EIThree types of pipeline hazards Structural hazards Control hazards Data hazards El Compilersprogrammers could reorder the code to avoid hazards and eliminate bubbles Elldeal cases no bubbles one instr per cycle El Stalling bubbles is the last resort but must be supported by hardware CS 365 CS 365 Exercise Code Reordering for i0 iltN i Zi Xi yi El Registers r1 points to Xi r4 holds Xi r2 points to yi r5 holds yi r3 points to Zi r6 holds Zi r7 holds 1 r8 holds N CS 365 loopzlw r40rl lw r50r2 add r6 r4 r5 sw r6 0r3 addi rlrl4 addi r2r24 addi r3r34 addi r7r7l bne r7r8 loop Chapter Three Building an ALU to support MIPS integer arithmetic Material from Appendix B5 on CDROM Review Boolean Algebra amp Gates Problem Consider a logic function with three inputs A B and C Output D is true if at least one input is true Output E is true if exactly two inputs are true Output F is true only if all three inputs are true Showthe truth table for these three functions Showthe Boolean equations for these three functions Show an implementation consisting of inverters AND and OR gates See Appendix B of textbook on CDROM An ALU arithmetic logic unit Let39s build an ALU to support the andi and ori instructions we39ll just build a 1 bit ALU and use 32 ofthem Operation op a b res result Possible Implementation sumofproducts i i Review The Multiplexor Selects one of the inputs to be the output based on a control input S note we call this a 2inpnt mwc even though it has 3 inputs Jgt Lets build our ALU using a MUX Different Implementations Not easy to decide the best way to build something Don39t want too many inputs to a single gate Don t want to have to go through too many gates for our purposes ease of comprehension is important Let39s look at a1bit ALU for addition Carryln c abacinbcin sum a xor b xor cin CarryOul How could we build a1bit ALU for add and and or How could we build a 32bit ALU Building a 32 bit ALU Carivln Kl allw gtResull ovum l IE In l Carryout 39 gtResulll gtResull2 Resuim What about subtraction a b Two39s complement approch just negate b and add How do we negate A very clever solution B m ljpeialio Carrle I I Carryout Adding a NOR function Can also choose to invert A How do we get A nor B Operl nn CIlryln l ARuult CunyOul Tailoring the ALU to the MIPS Need to support the setonlessthan instruction slt remember slt is an arithmetic instruction produces a 1 if rs lt rt and 0 otherwise use subtraction ab lt 0 implies a lt b Need to support test for equality beq t5 t6 t7 use subtraction ab 0 implies a b Supporting slt Use this ALU for bits 030 Use this ALU for most significant bit bit 3 1 Supporting slt Test for equality Notice control lines 0000 and 0001 or 0010 add 0110 subtract 0111 slt 1100 nor 39Note zero is a 1 when the result is zero Arithmetic and Logic Unit CS 365 Lecture 5 Prof Yih Huang C3365 Inside 21 Processor Branch W Floating Point Arithmetic Circuits 03 8 COHFIOI a Elimgar m ihjmu Loglc Internal Bus Data Cache Instruction Cache C3365 Arithmetic and Logic Unit ALU CI The part of a processor circuit that actually gets the computations done operation result C3365 Numbers EIBits are just bits no inherent meaning EIBinary numbers base 2 gt decimal 02nl DASCII codes El Of course it gets more complicated numbers are finite over ow fractions and real numbers negative numbers EIHow do we represent negative numbers C3365 Possible Representations El Sign Magnitude One39s Complement Two s Complement 00 0 000 0 000 0 001 1 001 1 001 1 0102 0102 0102 0113 0113 0113 1000 100 1004 1011 101 1013 1102 1101 1102 1113 1110 1111 El Most of the modern architectures use two s complement C3365 Two s Complement Numbers 23 22 21 20 00010 01010 EI10 in 8bit two s complement C3365 0000 0000 0000 0111 0111 1000 1000 1000 1111 1111 1111 C3365 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 1111 32bit Signed Numbers Two s Complement 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 1111 0000 0000 0000 1111 1111 0000 0000 0000 1111 1111 1111 0000 0001 0010 1110 1111 0000 0001 0010 1101 1110 1111 Decimal 0 1 2 2l47483646 2l47483647 2l47483648 2l47483647 2l47483646 3 2 1 Two39s Complement Operations EINegating a two39s complement number invert all bits and add 1 remember negate and invert are different El Exercises in 6 bits Negate 12 Negate 5 C3365 Sign Extensions EIMIPS 16 bit immediate gets converted to 32 bits for arithmetic Elcopy the most significant bit the sign bit into the other bits 0010 3 0000 0010 1010 gt 11111010 4 bit number 8 bit equivalent C3365 9 Additions amp Subtractions El Just like regular binary numbers 0010 1111 1111 0110 0001 1111 0010 1111 1111 0110 0001 1111 C3365 10 Over ows EIResult too large to store in finitesize computer words eg adding two nbit numbers does not always yields an nbit number EIDepends on the kind of numbers you have in mind Signed or unsigned 0010 1000 0110 0001 C3365 Detecting Over ow EINo over ow when adding a positive and a negative number EINo over ow when signs are the same for subtraction El Over ows when the value affects the sign A B result AB gt0 gt0 lt0 AB lt0 lt0 gt0 A B gt0 lt0 lt0 A B lt0 gt0 gt0 C3365 Effects of Over ow EIArchitecture and case dependent El Solution 1 just remember it and leave the handing to software The condition ag register of IA32 El Solution 2 exceptioninterrupt Control jumps to predefined address for exception Interrupted address is saved for possible resumption Used by MIPS C3365 Discussion DIA32 provides an addc add with carry instruction What is its use C3365 Review Boolean Algebra amp Gates El Problem Consider a logic function with three inputs A B and C Output D is true if at least one input is true Output E is true if exactly two inputs are true Output F is true only if all three inputs are true EI Show the truth table for these three functions EI Show the Boolean equations for the three functions EI Show an implementation consisting of inverters AND and OR gates C3365 Design An Over ow Detector El Inputs SA sign of A SB 0P SA SB 0F Sign of B OP operation 0 0 0 0 for add 1 for sub 0 0 1 El Output OFO no over ow l 0 1 0 over ow 0 1 1 El Truth Table El Boolean equation for OF 1 0 0 El A circuit design of OF 1 0 1 according to the equation 1 1 0 above 1 1 1 C3365 Review The Multiplexer Select Output El Selects one of the inputs to be the output based on a control input EINote we call this 2input multiplexer even though it actually has three inputs C3365 More Inputs ABCD Select Output CI The general case Ninput multiplexer needs l logZN l select lines El You should be able to design its logic circuit C3365 Second Exercise El Let us build a onebit ALU to support addition and logic or Operation 0 for add 1 for or operation result C3365 19 Solution El Truth Table El Sum of product C3365 20 Supporting MIPS Logic Instructions EIMIPS provides bitWise and or xor and nor instructions EIInput operation 3 bits determine the output operation 3 a gt result C3365 21 32bit ALU EIBoth inputs A and B are 32 bit Wide Size of the truth table El Rather we will just cascade 32 1bit ALU How about carries We need to refine the spec of the 1bit ALU C3365 22 Two Solutions EITruth table and sum of product El Use multiplexer C3365 23 1bit Adder Ci Cout AB ACin BCin A B Sum A xor B xor Cin El How could we build a 1bit ALU for add and or Cl How could we build a 32bit ALU C3365 24 Building a 32bit ALU Operation CarryOut C3365 Resuuu Resum Resuuz 331 Resunm bat What about subtraction a b El Two s complement approach negate b and add El How do we negate Bin en Operation Carryln I I El A clever solution C3365 CarryOut C3365 Tailoring the ALU to the MIPS EINeed to support the setonlessthan instruction slt remember slt is an arithmetic instruction produces a 1 if rs lt rt and 0 otherwise use subtraction ab lt 0 implies a lt b EINeed to support test for equality beq t5 t6 offset use subtraction ab 0 implies a b Opevauan Supporting slt C3365 Ope anon Carryln Binvert C3365 Ema ate C3365 Test for equality we Resth anon lines 000 001 ZW 010 110 111 Over uw El Notice control and or add subtract slt El Ouput zero1 when result is 0 El Important points about hardware all of the gates are always working the speed of a gate is affected by the number of inputs to the gate the speed of a circuit is affected by the number of gates in series on the critical path or the deepest level of logic What is the critical path in our 32bit ALU C3365 31 Ripple Carry Adder Is Slow EILogic circuit speed is determined by the number of gates a signal have to pass in the worst case EIAssuming each lbit ALU adds x gate delay what is the delay of a 32bit ALU C3365 32 9D 9D Carryln a 5 E E E 3 Carryln Result03 39 C 39 C 39 C ALUO P0 gt PI 0 9 U U U Carrylookahead unil C1 1 Cl 0 O O 9 H 0 I carryln 4 II II II Result47 a O O 8 I I M x ALU1 J P1 gt P 1 0 G1 gt 9I1 c C O 1 x c U C C2 ci2 gt B E Carryln Q U C O O Result811 a O O U ALU2 a E O P2 gt P 2 F U O 62 gt gl 2 B A quotU 0 c3 O O c3 II II 8 5 5 C U gt gt arr I39I y Result1215 O 3 g g 39g 39g E Q 3 o o ALU3 D 0 0 P3 gt P3 U m m N N Gs gt 9I3 M m Q Q C4 ci4 FD gr gr CarryOut m LI Exercise A B Gout 0 0 0 kill Cin 0 1 Cin propagate 1 O Cin propagate S 1 1 1 generate GAandB PAxorB A3 C3365 39 39 39 35 Multiplications EIN bits X N bits 2N bits result El Paper and pencil example unsigned Multiplicand 0 0 1 1 Multiplier 0 1 0 1 C3365 36 Unsigned Combinational Multiplier EIStage i accumulates A 2 i if Bi C3365 37 Discussions EIMultiplication is expensive CIA combinational multiplier uses a great deal of silicon 32 32bit adders needed EIWe will discuss designs that are slower but less silicon demanding Due to its complexity we will first present a basic but suboptimal design and refine it tvwce C3365 38 C3365 Clock tick Clock tick Clock tick P7 Clock tick El One step per clock tick 11 clock cycles Shift and Add needed for 11bit multiplications Multiplicand Example 0101 X 0011 Product I To add to add 20 CS3 Unsigned ShiftAdd Multiplier 064bit Multiplicand reg 64bit ALU 64bit Product reg 32bit multiplier reg Mul plicand Shift Le I 64 bits V Multiplier Shift Right 32 bits 64bit ALU I IS Multiplier datapath oontrol 65 C3365 Observations EIHalf bits in multiplicand always 0 64 bit adder is a waste El Improvement Use 32 bit multiplicand Don t shift multiplicand left shift the product right instead 21 Example 0101 X 0011 ProductlHHlHl To add to add C3365 43 ShiftAdd Multiplier Version 2 D bit Multiplicand reg 32 bit ALU 64bit Product reg 32bit Multiplier reg Mul plier39 Shift Right c3355 44 22 A Second Example Product Multiplier Multiplicand 0000 0000 001 1 0010 0010 0000 0001 0000 0001 0010 0011 0000 0001 0010 0001 1000 0000 0010 0000 1100 0000 0010 0000 0110 0000 0010 Observations C3365 EIProduct register wastes space that exactly matches size of multiplier EIImprovement combine Multiplier register and Product register 23 C3365 Example 0101 X 0011 To add or not to add Product I Multiplier C3365 A Second Example Multiplicant Initial Product Product a er 1 5 shi Product a er 2nd shi Product a er 3rd shift Product a er 4th shift Product a er 5th shift Ill 131331 133311 133311 133311 133311 24 Where we are headed Single Cycle Problems what if we had a more complicated instruction like floating point wasteful of area One Solution use a smaller cycle time have different instructions take different numbers of cycles a multicycle datapath instructlun register Data PC Address A Registemt l J Memuryms g a Registers Mammy Register data B Data EB S E Registemt M ultlcycle Approach We will be reusing functional units ALU used to compute address and to increment PC Memory used for instruction and data Our control signals will not be determined solely by instruction eg what should the ALU do for a subtract instruction We ll use a finite state machine for control Review finite state machines Finite state machines a set of states and next state function determined by current state and the input output function determined by current state and possibly input inputs We ll use a Moore machine output based only on current state Review finite state machines Example B 21 A fn39end would like you to build an electronic eye for use as a fake security device The device consists o 395 39 l 39 1 Left Middle and Right which ifamerted indicate that a lightshould be on Only one light is on at a time and the light moves from lgt to tight and then from light to left thus scaring away thieves who believe that the device is monitoring their activity Draw 39 g l 39 l 39 39 39 Wecify the electronic eye Note that the rate of the eye s movement will be controlled by the clock Weed which Mould notbe too great and that thwe are ementially no inputs Multicycle Approach Break up the instructions into steps each step takes a cycle balance the amount of work to be done restrict each cycle to use only one major functional unit At the end of a cycle store values for use in later cycles easiest thing to do introduce additional internal registers Multicycle Datapath with control signals quot5quot mm 57 u Multicycle Datapath amp Control Five Execution Steps Instruction Fetch Instruction Decode and Register Fetch Execution Memory Address Computation or Branch Completion Memory Access or Rtype instruction completion Writeback step INSTRUCTIONS TAKE FROM 3 5 CYCLES High level view of finite state machine control Start Instruction fetchdecode and register fetch Figure 5 37 l I l I emery access Instructions Figure 5 38 Retype instructions Branch instruction Jump instruction gure 5 39 gure 5 40 gure 5 M Step 1 Instruction Fetch Use PC to get instruction and put it in the Instruction Register Increment the PC by 4 and put the result back in the PC Can be described succinctly using RTL quotRegisterTransfer Languagequot IR MemoryPC PC PC Can we gure out the values of the control signals What is the advantage of updating the PC now Step 2 Instruction Decode and Register Fetch Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch RTE A RegIR25 21 B RegEIREZOlSJJF ALUOut PC sign extendIR15 0 ltlt 2 We aren39t setting any control lines based on the instruction type we are busy quotdecodingquot it in our control logic Instruction Fetch amp Decode instructlun denude instructan rctcn REESE Em M crchad ALUSrcA El lurD El Mammy rcrcrcncc FSM Retype FSM Branch FSM Jump FSM Frgurcsaa FlgurEESB Frgurcszm Frgurcsm Step 3 instruction dependent ALU is performing one of three functions based on instruction type Memory Reference ALUOut A sign extendIR15 0 Rtype ALUOut A op B Branch if AB PC ALUOut Step 4 Rtype or memoryaccess Loads and stores access memory MDR Memory ALUOut Memory ALUOut B Rtype instructions finish RegIR15 ll ALUOut The write actually takes place at the end of the cycle on the edge Writeback step I RegIR2016 MDR What about all the other instructions Finite state machine for memory access instructions Op LW ur Op SW Mammy address umputatmn ALUSrcA 1 ALU Src 1n ALU 0p nu M emvvnte MemRead Equot 1nrD1 D 1 Writerback step Tu state m Figure 5 a7 REQWHIE MemtuREg 1 REgDst u Finite state machine for Rformat instructions From state 1 OP R39iYPe Execution ALUSrcA 1 ALUSch on ALUOp 10 R type completion RegDst 1 RegWrite MemtoReg 0 To state 0 Figure 537 Finite state machine for branch instruction From state 1 Op 39BEQ39 Branch completion ALUSrcA 1 ALUSch 00 ALUOp 01 PCWriteCond PCSource 01 To state 0 Figure 537 Finite State Machine for jump From state 1 OP 39J39 Jump comple ion PCWrite PCSource 10 To state 0 Figure 537 Summary Action for Rtype Action for Action for Regmmsmu Load MDR ALUOut Complete Finite State Machine Wmcimn 2th lnsliuclmn decade quotmm mm Start Mammy Eddies campulnlmn Pcvvme PCSmAme m inierhack step RegDs u Renge Memla eg 1 Simple Questions How many cycles will it take to execute this code lw t2 0t3 lw t3 4t3 beq t2 t3 Label add t5 t2 t3 sw t5 8t3 assume not Label What is going on during the 8th cycle of execution WWW In what cycle does the actual addition of t2 and t3 takes place Exceptions Hardest part of control is to implement exceptions amp interrupts lO device request External Interrupt Invoke the operating Internal Exception system from user program Arithmetic over ow Internal Exception Using an unde ned Internal Exception instruction Internal or Exception or malfunctions External interrupt How are exceptions handled In our design we will consider two types of exceptions Arithmetic overflow Execution of an undefined instruction Actions on exception Save address of offending instruction in the Exception Program Counter lEPCl Transfer control to the operating system at a prespecified address exception handler OS then takes appropriate action Exception handling For the OS to take appropriate action it must know the reason for the exception Two ways to communicate reason to OS Have a Status register which holds a field that indicates the reason for the exception Vectored interrupts Address to which control is transferred depends upon the cause of the exception MIPS uses first method above it has a register called Cause in addition to the EPC register Datapath amp Control with support for exceptions Exception Handling Datapath additions EPC Cause for undefined instruction Cause 0 arithmetic overflow Cause 1 Control Signals EPCWrite CauseWrite lntCause sets the Cause register PCSrc has to be modified so that one of its sources is the OS entry point Three steps 1 Write Cause 2 EPC PC 4 Have to use ALU so need to expand multiplexors for ALUSrcA and ALUSch 3 Write PC Datapath amp Control with support for exceptions States for handling exceptions To State 0 to begin next instruction PCWriie PCSource 11 Complete FSM including support for exceptions insimcimn decade Registev rat2h Memawaddvess campmaimn Implementing the Control Value of control signals is dependent upon what instruction is being executed which step is being performed Use the information we ve acculumated to specify a finite state machine specify the finite state machine graphically or use microprogramming Implementation can be derived from specification Graphical Specification of FSM insrmciiun mm mm Wm Wm gw mnpuli lzm Pcvvme ALUSrcA ALUSch PcSaurce m 1 an ALUOP m wmmm as Finite State Machine for Control Implementation camm iagm Outputs 33331331 323 WW 532 mm upmmm PLA Implementation lfl picked a horizontal or vertical line could you explain it ROM Implementation ROM quotRead Only Memoryquot values of memory locations are fixed ahead of time A ROM can be used to implement atruth ta e ifthe address is mbits we can address 239quot entries in the ROM our outputs are the bits of datathat the address points to Hwooooow Howoooow m is the quotheightquot and n is the quotwidthquot ROM Implementation How many inputs are there 6 bits for opcode 4 bits for state 10 address lines ie 210 1024 different addresses How many outputs are there 16 datapathcontrol outputs 4 state bits 20 outputs ROM is 210 x 20 20K bits and a rather unusual size Rather wasteful since for lots of the entries the outputs are the same ie opcode is often ignored MIPS Programming CS 365 Lecture 3 Prof Yih Huang CS 365 Exercise 1 El Create a routine at address 1000 to add up 1 to 10 Store the result in 1FFO CS 365 Procedure Call Conventions El Passing argumentsparameters 4 parameters are passed in registers 4 7 a0 a3 Additional ones in memory El Returning results Results are returned in register V0 EIThose are conventions CS 365 Procedure Frames El A procedure frame contains all data pertaining to one instance of a procedure one clone one procedure frame Local variables in registers or memory Parameters in registers or memory Return address El Procedure frames are stored in a stack El Register 29 sp for stack pointer points to the top location of the stack CS 365 Example 1 F GO HO int ab char X int aX GO HO GO F s Frame G s Frame H s Frame int ab char X int aX return addr return addr return addr Example 2 Fabint n CS 365 if n0 return 0 if n1 return 1 return Fabn 1 Fabn 2 Fab s Frame int n Storage for Fabn1 F abn2 return value return addr Procedure Frames on MIPS Higher memory addresses Argument 6 Argument 5 fp gt Saved registers Stack grows Local variables 1 sp gt Lower memory addresses CS 365 Dutles of a MIPS Caller 1 Save on the stack registers aOa3 tOt9 they are called callersaved registers 2 Pass arguments First 4 are in aOa3 Remaining are pushed on to stack and appear at the beginning of called procedure s stack frame 3 Execute jal CS 365 CS 365 Duties of a MIPS Callee El Before it starts running 1 Allocate memory for stack frame 2 Save s0s7 fp ra in the stack frame they are called calleesaved 3 Establish frame pointer El Before returning 1 Place return value in v0 2 Restore all calleesaved registers 3 Pop the stack frame 4 Return by jumping to ra CS 365 Example The Frame of G on the Platform High Memory D has fp saved fp 1 t 4fp ra parame er ppp 8fp 53 3 local var1ables 12fp xx XX uses t1 16fp yy yy uses t2 20fp zz zz uses s3 sp A return value Low Memory The Entrance Code of G sw fp 0sp save the fp of my claller mv fp sp setup my fp sw ra 4fp save the ra of my caller sw s3 8fp save the s3 value of my caller addi sp fp 24 setup my sp Function body of G starts variable XX is accessed by 12fp variable yy is accessed by l6fp variable 22 is accessed by 20fp CS 365 11 High Memory Example the Frame of F 12fp p7 8fp p6 4fp p5 El has fp 39 saved fp 7 parameters Egg 3 p S p1p2p3p4p5p6p7 12fp X 2 local variables 16fp y X uses t2 513 39 y uses s3 Low Memory CS 365 12 When 0fp saved fp My fp 4fp ra calls 8fp s3 12fp XX l6fp yy 20fp 22 Default sp in G 1 2 Caller saved registers of G a0 p7 6 Arguments P to F p5 sp when G calls F CS 365 The Code for G to Call F sw tl 0sp backup my tl sw t2 4sp backup my t2 sw a0 8sp backup my a0 ppp copy pl to p4 to a0 to a3 Store p7 p6 p5 in 12sp 16sp 20sp jal f LabelX stored in ra addi sp sp 24 Delay slot of jal Label X CS 365 addi sp sp 24 restore the normal sp lw tl Osp restore my tl lw t2 4sp restore my t2 lw a0 8sp restore my a0 G continues fp of G s caller fp ofG H ra of G s caller 53 of G s caller Stack Grows XX yy ZZ tl ofG sp ofG t2 of G a0 ofG P7 ofF P6 ofF P5 ofF sp of G fp of F fp ofG quotwhen ra of G calling F 53 ofG X y sp of F CS 365 Exercise Give the Entrance Code of F CS 365 Chapter 2 Performance Measure Report and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation Why is some harztware better than athersfar m erentpragrams What factors of system performance are harztware related e g Do we need a new machine or a new operating system How does the machine s instruction set a ect performance Which of these airplanes has the best performance lme passequot 239s Rm 2 m 5 22d Eaemg 731mm 1m 53m 592 Eaemg 747 Am mu m EACSvdcammdz 132 Aann 135m DmghsDCVXVSEI 146 272 544 Hnw rrum user Is the Cnncnme cnmmmd m the 7m Hnw rrum hluuer Is the m Ihm the Dnunlzs non TWO no 39 ns of performance le2 NEW Pans Shem passengeus 39 39 quot F quot Enema m a 5mm mumph om 28mm BADSud cnmm 3 huuvs 135 mph 32 17mm Which has higher performance ncy Tzsls er dz hnur week see us Pe nrmznce pu 7 through ban W R espunse me and hvuughpm u en ave m uppusmun Computer Performance TIME TIME TIME Response Time latency How long does it take for my job to run How long does it take to execute ajob How long must I wait for the database query Throughput Howmany jobs can the machine run at once What is the average execution rate How much work is getting done If we upng a machine with a new processor what 110 we increase If we add a new machine to the lab what 110 we increase Execution Time Elapsed Time counts everything disk and memory accesses IO etc a useful number but often not good for comparison purposes CPU time doesn39t count IIO or time spent running other programs can be broken up into system time and user time Our focus user CPU time time spent executing the lines of code that are quotinquot our program Book39s Definition of Performance For some program running on machine X PerformanceX 1 Execution timeX quotX is n times faster than Yquot PerformanceX I PerformanceY n Problem machineA runs a program in 20 seconds machine B runs the same program in 25 seconds Clock Cycles Instead of reporting execution time in seconds we often use cycles seconds cycles seconds X program program cycle Clock ticks indicate when to start activities one abstraction Lime cycle time time between ticks seconds per cycle clock rate frequency cycles per second 1 Hz 1 cyclesec A 200 Mhz clock has a 6 x109 5 nanoseconds cycle time 200 X 10 How to Improve Performance seconds 7 cycles X seconds program program cycle So to improve performance everything else being equal you can either reduce the of required cycles for a program or decrease the clock cycle time or said another way Increase the clock rate How many cycles are required for a program Could assume that of cycles of instructions st instruction nd instruction r i rd instruction This assumption is incorrect dz erent instructions take different amounts of time on dz erent machines Why hint remember that these are machine instructions not lines of C code Different numbers of cycles for different instructions IIIl tlme Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more time than accessing registers Impartantpaint changing the cycle time often changes the number of cycles required for various instructions more later Example Our favorite program runs in 10 seconds on computer A which has a 400 Mhz clock We are trying to help a computer designer build a new machine B that will run this program in 6 seconds The designer can use new or perhaps more expensive technology to substantially increase the clock rate but has informed us that this increase will affect the rest of the CPU design causing machine B to require 12 times as many clock cycles as machine A for the same program What clock rate should we tel the designer to targetquot Example Let C number of cycles Execution time C X clock cycle time CI clock rate On computer A CI400 MHz CI400 X 106 10 seconds gt C 400 X 107 On computer B number of cycles 12 X C What should be B s clock rate so that ourfavorite program has smaller execution time 12 X CI clock rate lt10 gt12 X400 X 10710 lt clock rate e clock rate gt 480 MHz Now that we understand cycles A given program will require some number of instructions machine instructions some number of cycles some number of seconds We have a vocabulary that relates these quantities cycle time seconds per cycle clock rate cycles per second CPI cycles per instruction 3 39 H L39LCPI CPI Average cycles per instruction for the program Consider a program with 5 instructions Instruction Another way of saying it is 11 5 X 22 OR CPU cycles instructions gtlt CPI Aspects of CPU Performance seconds instructims cycles seconds cpu tlme program program in tmctim vr le Instruction CPI Clock cycle Count ime Program X X Compiler X X Instruction X X Set Organization X X Technology X Program Performance seconds instructims cycles seconds cpu t1me program program 1nstruct1m cycle Measuring the components of CPU time for a given program Instruction Count Pro ler or simulator CPI Depends on hardware implementation AND the given program s instruction mix Clock rate published Assignment 3 Find the components of CPU time for given program salts and the program you wrote for Assignment 2 matrix multiplication program You re given Clock rate cycles for different categories of instructions arithmetic data handling etc You have to find Number of instructions executed by your program CPI for your program Use the profiling technique explained in class Performance Performance is determined by execution time Do any of the other variables equal performance of cycles to execute program of instructions in program of cycles per second average of cycles per instruction average of instructions per second Common pitfall thinking one of the variables is indicative of performance when it really isn t CPI Average cycles per instructionquot CPI CPU Time x Clock Rate Instruction Count Clock Cycles Instruction Count CPU time Clock cycle time x ZCPIJ XIJ 51 1 I CPI Z CPI X F Where F 1 J J J Instructlon Count quotinstruction frequencyquot CPI Example Suppose we have two implementations of the same instruction set architecture ISA For some program Machine A has a clock cycle time of 10 ns and a CPI of 20 Machine B has a clock cycle time of 20 ns and a CPI of 12 What machine is faster forthis program and by how much If two machines have the same ISA which afaur quanti es e g clock rate CPI execution time of instructions will always be identical CPI Example For machine A CPU time IC gtlt CPI gtlt Clock cycle time CPU time IC X 20 X 10 ns 201C us For machine B CPU time IC X 12 X 20 ns 241C ns of Instructions Example A compiler designer is trying to decide between two code sequences for a particular machine Based on the hardware implementation there are three different classes of instructions Class A Class B and Class C and they require one two and three cycles respectively The first code sequence has 5 instructions 2 of A 1 of B and 2 ofC The second sequence has 6 instructions 4 of A 1 of B and 1 of C Which sequence will be faster How much What is the CPI for each sequence of Instructions Example A compiler designer is trying to decide between two code sequences for a particular machine Based on the hardware implementation there are three different classes of instructions Class A Class B and Class C and they require one two and three cycles respectively The first code sequence has 5 instructions 2 of A 1 of B and 2 ofC The second sequence has 6 instructions 4 of A 1 of B and 1 of C Which sequence will be faster How much What is the CPI for each sequence 2 1 1 2 2 3 10 CPI for sequence 1 w 212 CPI for sequence 2 W 4 1 1 CPU cycles for sequence 1 105 x 5 10 CPU cycles for sequence 2 96 x 6 9 9 6 MIPS example Two different compilers are being tested for a 100 MHz machinewith three different classes of instructions Class A Class B and Class C which require one two and three cycles respectively Both compilers are used to produce code for a large piece of software The first com piler39s code uses 5 million Class A instructions 1 million Class B instructions and 1 million Class C instructions The second com piler39s code uses 10 million Class A instructions1 million Class B instructions and 1 million Class C instructions Which sequence will be faster according to MIPS Which sequence will be faster according to execution time MIPS example Two different compilers are being tested for a 100 MHz machine with three different classes of instructions Class A Class B and Class C which require one two andthree 39 39 a ag b software The first Compiler39s code usess million Class A instructions 1 million Class B instructions and 1 million Class C instructions u u A million Class B instructions and 1 million Class C instructions Which sequence will be faster according to MIPS Which sequence will be faster according to execution time m Number of Instmcticns Executiorltirrlegtlt106 1C ICgtlt clockrate clockrate ICgtlt CPIgtlt c1ockcyc1etimex106 ICgtlt CPIgtlt106 CPIgtlt106 Forsequence A 5gtlt 1 1gtlt21gtlt3gtlt106 710 CPI 6 511gtlt10 7 10 Executlontlme 5 1 1gtlt 106 XTX 0 lseconds 1 100x105 100x106 7 MIPamp 6 e70 137x10 Forsequence B executiontime01556conds butMIPS80 lll Profiling a program Identifying the basic blocks while P do if chen A else B fi if R then breakfi C od and Profiling a program 1 Find basic blocks of program 2 Create a counter variable for each basic block 3 Insert code that increments the counter for a basic block at the beginning ofthat block 4 Print out counters at the end of the program 5 Count instructions in each basic block 6 From steps 5 and 6 you have info about the instructions executed by the program Bas s of Evaluat on Pvus vepvesemawe u a 2 Ac ua Tamel kamau mmcuMu mm m measuve wamm memw cause 39Puname mm used dessvwvesemawe umpmvements FuHApphcahun Benchmavks usem m veamv wasw mm 23W m was desmn owe Mdenmv peak pEak mav be 3 mm capammv and W 1mm anphcauun pmerma bm enecks penuvmance Bench marks Performance best determined by running a real application Use programs typical of expected workload Or typical of expected class of applications eg compilerseditors scientific applications graphics etc Small benchmarks nice for architects and designers easy to standardize can be abused SPEC System Performance Evaluation Cooperative companies have agreed on a set of real program and inputs can still be abused Intel s other bug valuable indicator of performance and compiler technology SPEC 89 Compiler enhancements and performance SPEC pennrmance ralin Eco espressu splcE duduc quot3537 l Ennlml malnxSEIEI mum lnmcalv Benchmark Cummlev I mm mm SPEC 95 spscm SPEC 95 Does doubling the clock rate double the performance Can a machine with a slower clock rate have better performance 5n mu 15m cm m2 MHI 2m I I Pemum p emum Pm spscw 1 an an Inn cm m w H a I p 2mm Amdahl39s Law Execution Time After Improvement Execution Time Unaffected Execution Time Affected Amount of Improvement Example quotSuppose a program runs in 100 seconds on a machine with multiply responsible for 80 seconds of this time How much do we have to improve the speed of multiplication if we want the program to run 4 times fasterquot How about making it 5 times faster Principle Make the common case fast Example Suppose we enhance a machine making all oatingpoint instructions run ve times faster If the execution time of some benchmark before the oatingpoint enhancement is 10 seconds what will the speedup be if half of the 10 seconds is spent executingfloating point instructions We are looking for a benchmark to show off the new floatingpoint unit described above and want the overall benchmark to show a speedup of 3 One benchmark we are considering runs for 100 seconds with t e d oatingpoint hardware How much of the execution time would oating point instructions have to account for in this program in order to yield our desired speedup on this benchmark CS 365 Instruction Set Architectures CS 365 Lecture 2 Prof Yih Huang CS 365 Instruction Set Architecture De nition El Data types and structures Encoding and representation El Instruction set El Instruction format EIAddressing modes accessing data and instructions El Exception conditions Examples El IBM 3 603 70 El Motorola PowerPC El DEC VAX Alpha El HP PARISC El Sun Sparc El SGI MIPS El Intel X86 CS 365 Registers El Registers are a small set of storage cells inside the processor MIPS provides 32 X86 provides 8 for arithmetic El Registers are directly accessible through machine instructions El Registers are much much faster than memory CS 365 MIPS Registers Name Register Usage zero 0 The constant value 0 at 1 By assembler only vovl 23 Values for results and expression evaluation a0a3 47 Arguments t0t7 815 Temporaries s0s7 1623 Saved t8t9 2425 More temporaries k0kl 2627 Reserved for the operating system gp 28 Global pointer sp 29 Stack pointer fp 30 Frame pointer ra 31 Return pointer CS 365 Discussion EIUsages are software conventions They are not built into the hardware El Except registers 0 and 31 all others are treated the same by the hardware Register 0 is hardwired 0 Writing to it will not changed its value Register 31 is implied in the jal instruction CS 365 MIPS Arithmetic El Allinstructions have 3 operands all of them registers EIExample Ccode ABCD E F A MIPScode add tO 31 32 add 80 tO s3 sub 54 85 80 El Variables must be moved between registers and memory before and after computation CS 365 Arithmetic Instruction Format CS 365 I opcode l rt 5 bits rs l 5 bits El opcode000000 El rd rs funct rt El Functions see the third table of Fig 318 ADDSUBMULTDIV SignedUnsigned AND OR XOR NOR Set On Less SignedUnsigned EIWe ignore shamt shift amount for now l rd 5 bits lshamtl funct I 6 bits 5 bits 6 bits Exercises El add t0 s1 s2 000000 10001 100101010001000001 100000 opcode rs rt rd shamt funct Elxor a2 t8 V0 opcode rs rt rd shamt funct CS 365 Arithmetic with Immediates Immediate I I opcode rs rt 6 bits 5 bits 5 bits 16 bits El opcode001XXX Elrt rs op Immediate EIXXX determines operations addi addiu add immd signedunsigned andi ori xori lui Load Upper Immediate El See the 2nd row in 1st table of Fig 225 CS 365 Exercises Eladd t1 s3 10 001000 110011 101001 1 0000000000001010 XXX000 rss3 rttl Immediate10 Eladd t1 s3 10 0010001100111010011 XXX000 rss3 11t1 Immediate10 CS 365 11 Load 32bit Constants El Example to load the number below into s0 00000000 00111101 00001001 00000000 V V 61 2304 El Step 1 lui s0 61 opcode 001111 The value of s0 becomes 00000000 00111101 00000000 00000000 Cl Step 2 addi 0 s0 2304 CS 365 12 Memory CIA set of data entries indexed by addresses El Typically the basic data unit is byte Elln 32 bit machines 4 bytes grouped to words El Have you seen the DRAM chips in your PC 7 CS 365 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 000A OOOB 000C 000D 000E 000F one Where Are the Variables El Each variable has a home in memory A particular location in the memory is assigned to store the content of the variable CI The address of a variable is determined by The compiler in high level languages You if you do machine programming EIVariables are moved from memory to registers before computations CS 365 Example C AB Memory 0000 0001 0002 Processor 0003 so 0004 0005 X 31 0006 32 0007 0008 9 3333 i B 0008 Reglsters 0000 0000 000E C OOOF a 3 CS 365 15 D1scuss10ns El Registers are part of the processor Fast but limited in numbers EIMemory is slooooooooooooooooooow El For best performance Use registers for important data Minimize the traffic tofrom memory CS 365 16 Load and Store Instructions El Move data between memory and registers El Example Ccode A7 h A8 MIPScode lw t0 3253 add t0 52 t0 sw tO 28s3 El Store word sw has destination last CS 365 17 Instruction Format I opcode rs rt Offset 6 bits 5 bits 5 bits 16 bits El Load opcode100XXX rt Memoryrs Offset El Store opcode 101XXX Memoryrs Offset rt EIXXX determines data sizes byte half word word El See the 5th and 6th rows in 1st table of Fig 225 CS 365 18 Example Ellw t2 24s3 100011 1100111010101 0000000000001100 XXX011 rss3 Itt2 Offset24 El sw Vl 8sp XXX rsV1 Itsp Offset 8 CS 365 19 Stored Program Concept El Instructions are bits EIPrograms are stored in memory just like data EIHow does the processor distinguish instructions from data It does not until recently AMD64 does make the distinction CS 365 20 Instruction Execution Cycles CIA special register called Program Counter PC points to the address of the next instruction for execution El CPU reads from memory the instruction pointed by PC CI The control logic makes the designated function carried out El PC is increased to point to the next instruction PC 4 in the case of MIPS CS 365 21 Example El Assembly Instructions 1w t2 2483 add 80 81 82 El Machine instructions in memory 1004 10001110 01101010 00000000 00001100 1008 00000010 00110010 10000000 00100000 CS 365 22 Example Continued Memory E Processor 1004 10001110 SO 01101010 SI 00000000 32 00001100 5 1008 00000010 i O O l l O O l O Registers 10000000 00100000 3 00001004 PC CS 365 23 Control Instructions El Sequential execution is achieved by increasing PC 4 EINonsequential executions If switch loops in C Need to give a different next instruction address other than PC4 El Control instructions update the PC EIThey are also called branch instructions for they allow the processor to branch from thede n texecuuonpa i CS 365 24 Conditional Braches El MIPS conditional branch instructions bne tO tl Label beq tO tl Label EIExample if ij h i j bne 50 51 Label add 53 50 51 Label CS 365 25 Instruction Format I opcode rs rt Offset 6 bits 5 bits 5 bits 16 bits El opcode000XXX where XXX 000 El PCPCOffset if test conditiontrue El Test conditions XXX100 rsrt beq XXX101 rs rt bne XXX110 rslt0 blez XXX111 rsgt0 bgtz CS 365 26 Unconditional Branches EIAssembly j label I opcode 26bit Immediate El Action CS 365 27 Control El MIPS unconditional branch instructions j label El Example if ilj beq 84 85 Labl hij add 83 84 85 j Lab2 el8e hi j Labl 8ub 83 84 85 Lab2 CS 365 Exercise El Give the machine instruction of the beq in the previous page CS 365 29 Supporting Procedure Calls F0 GO HO GO HO El Observe the control ow thru the 3 proc EINeed machine instructions to remember the return point CS 365 30 JAL Jump and Link EIJump to address of procedure While storing the return address PC4 in register 31 ra and jump to target 000011 26bit Immediate EITo return from the procedure execute j r 3 1 CS 365 j r jump register instruction El j 1 51 jump to address in s1 Iopcode rs rt rd shamt funct I 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits El Opcode000000 funct001000 El Action PCers CS 365 Summary MIPS Instruction Formats Rformat IopcodeI rs I rt I rd IshamtI funct I Iformat IopcodeI rs I rt I 16bitlmmediate I Jformat I opcode I 26bit Immediate I CS 365 33 Assembly vs Machine Languages CIAssembly provides convenient symbolic representation much easier than writing down numbers eg destination first EIMachine language is the underlying reality eg destination is no longer first CS 365 34 Floating Point Arithmetic CS 365 Lecture 8 Prof Yih Huang C5365 Scienti c Notation exponent decimal point Sign magnitude 602 x 1023 Mantissa radix base Sign magnitude C5365 Floating Point Numbers EIWe need a way to represent numbers with fractions eg 31416 very small numbers eg 000000001 very large numbers eg 315576 gtlt109 El Representation sign exponent significand 1Sign gtlt significand x 2 more bits for significand gives more accuracy more bits for exponent increases range exponent C5365 Issues EIArithmetic X El Representation Normal form El Range and Precision El Rounding El Exceptions eg divide by zero over ow under ow El Errors C5365 C5365 IEEE 754 Standards El single precision 8 bit exponent 23 bit significand El double precision 11 bit exponent 52 bit significand C5365 IEEE 754 Single Precision 398 E Exponent M Magnitude 1 8 bits 23 bits El Exponent biased 127 binary integer Real exponent e E 127 El Mantissa Signed magnitude normalized significant with a hidden 1 EIValue I p p I I p 1 1Mfgtlt 215127 Discussions EILeading 1 bit of signi cand is implicit EIEXponent is biased bias of 127 all Os is smallest exponent all Is is largest El Magnitude of numbers that can be represented 239126 to 21272 23923 EIOr in decimal numbers 18X103938 to 34X1038 C5365 Exercise El Give the single precision representation of 10 CI Give the single precision representation of 0875 El Give the single precision representation of 01 C5365 Exercise El Calculate the value of 010l00l0l1l0l00l1l1l0l0l0lol0l0l0l0lol0l0l0l0l0l0l0l0l0l0l0 C5365 C5365 Floating Point Arithmetic 112 00000012 El Step 1 alignment of decimal point El Step 2 add significands El Step 3 normalization
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'