COMPUTER ORGANIZATION CSCI 2500
Popular in Course
verified elite notetaker
verified elite notetaker
TMATH 110 I
verified elite notetaker
T LIT 432 A
verified elite notetaker
ARCH - 10001 - 001
verified elite notetaker
TPSYCH 409 A
verified elite notetaker
Popular in ComputerScienence
This 83 page Class Notes was uploaded by Santos Fadel on Monday October 19, 2015. The Class Notes belongs to CSCI 2500 at Rensselaer Polytechnic Institute taught by James Teresco in Fall. Since its upload, it has received 25 views. For similar materials see /class/224838/csci-2500-rensselaer-polytechnic-institute in ComputerScienence at Rensselaer Polytechnic Institute.
Reviews for COMPUTER ORGANIZATION
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/19/15
Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring 2009 Topic Notes Data Paths and Control We have spent time looking at the MIPS instruction set architecture and building up components from digital logic primitives Our next goal is to see how we can use the physical devices we have studied to construct the hardware that can execute MIPS instructions A MIPS Subset Implementation To keep things manageable we will consider a subset of key MIPS instructions 1 memory access lw sw 2 arithmeticlogical add sub and or slt 3 control ow beq j The same ideas and techniques that we will use to implement this basic subset can be used to build a machine to implement the entire MIPS ISA and in fact most modern ISAs Let s think about what needs to be done to implement these instructions First recall the loop that our machine will execute l Fetch the instruction from memory at the location indicated by the program counter 2 Update the program counter to point to the next instruction to be executed 3 Decode the instruction 4 Execute the instruction 5 Gotol The rst two steps are done the same way regardless of the instruction we re going to execute During the decode and execution steps the implementation becomes instructionspeci c At that point the instruction may result in a register value being written into memory a register value being read from memory two registers being set as inputs to the ALU and the ALU result written back to another register or the PC possibly modi ed by a conditional branch or uncondi tional jump CSCI 2500 Computer Organization Spring 2009 We ll follow the basic implementation in PampH Chapter 4 We start with an abstract view and ll in the details The rst view is in Figure 41 Let s understand what s in this diagram 0 Functional units Separate instruction and data memories this allows the instruction to be fetched and a data value to be read or written from memory in the same instruction cycle Register le the 32 32bit registers we saw earlier in the semester Program counter PC register Main ALU Two additional adders one that always adds 4 to the PC for when we are simply going to advance to the next instruction and another that computes branch targets 0 Data paths PC gets passed to the instruction memory and to the 4 adder The fetched instruction is decoded and appropriate bits are sent to the input of the second branch target adder when PC offsets are part of the instruction to the register le to determine which registers are to be used by the instruction needed by nearly all instructions and directly as an ALU input for immediate mode operands The result of the PC adders is sent back to update the PC Register le outputs are fed into the ALU and into the data memory Main ALU outputs can determine the address for a main memory access or can be fed back to the register le for storage A value read from the data memory may also be passed back to be stored in the register le This view is too simplistic for several reasons In our rst re nement of the original abstract diagram we add some multiplexers and control lines This re nement is shown in PampH in Figure 42 What have we added in this re nement o Multiplexors replace the wired or points in the diagram 7 those places that two possible inputs come together The MUX at the top selects which value is used to update the PC The MUX whose output goes to the data input of the register le selects between an ALU result and a value read from memory 2 CSCI 2500 Computer Organization Spring 2009 The MUX whose output goes to the main ALU input selects between a second register to the ALU and an immediate value taken from a bit eld of the instruction 0 Control lines to determine the operation of the individual components The control structure is guided mainly by the instruction hence the new communication path from the instruction to the Control oval That Control decodes the instruction and determines which of our functional units are involved in this instruction and what operations they need to perform If the instruction involves storing a value in a register the RegWrite line is set and the value sent to the Data input of the register le is stored in the destination register If the instruction is 1w the MemRead line is set causing the data memory to retrieve the value at the address computed by the ALU and sends it to the data input of the register le which also requires that the MUX selects the memory output to be passed to that input If the instruction is sw the MemWrite line is set causing the value retrieved by the source register to be written to the memory location as determined by the output of the main ALU The main ALU is always computing something and in those cases that its result is important a set of control lines tell the ALU which of its functions to compute Finally if it is a branch instruction the Branch line is set If the main ALU also produced a zero result which would cause the ALU to set the Zero line the PC MUX selects the value from the branch target ALU instead of the 4 ALU to be passed to the PC Building these Components Our design consists of N E combinational units 7 ALUs state elements 7 memories and registers data signals 7 information that is stored in memory registers or used as inputs and outputs of ALUs 4 control signals 7 information that controls what the combinational units are to compute which values should be passed through by the multiplexors and when state elements should assert their values as data signals drive the bus or update their values based on input data signals All of our components need to be synchronized properly to ensure that inputs and outputs are available at appropriate times CSCI 2500 Computer Organization Spring 2009 For example we have seen ip ops and registers built from those ip ops that load new values only on the leading edge of the clock In those cases we need to make sure that the input is presented and control lines set appropriately when the clock that controls those ip ops goes high This is sometimes easy 7 value is ready and available when we need it Other times the timing is more subtle Since combinational units are always computing we need to make sure the input values are presented and the control lines remain correct long enough for the output to be computed and captured We consider subsets of the design proposed earlier First the PC and instuction memory 0 Memory is usually thought of as a state element but the instruction memory is never mod i ed by our simple data path so it is always producing the instruction value at the location speci ed on the instruction address Of course the program has to get there somehow so there must be a write capability but we will not consider it for now 0 The PC is just a single register It can always be writing its value to the instruction address input and should read a new value at the end of the instruction execute cycle once we have computed the new PC value 0 The adders are combinational units along the lines of those we constructed One is hardwired to add 4 and could be replaced with a simpler circuit than a ripplecarry adder if we wanted to save some gates and delay 7 remember your lab problem Next we consider the implementation of Rformat instructions op tl t2 t3 This will write a value to register 111 as a result of applying the speci ed operation on 112 and s t 3 Thus we need our register le to be able to produce two output data values and receive one input data value We also need to be able to determine which of the 32 registers is to be used to each operand This information comes directly from the bits of the instruction that specify the two source registers rs and rt and the destination register rd To achieve this we can rst decode those 5bit values using 3 5to32 decoders calling the decoded signals rso rsl rss1 rto rtl rtgl rdo rdl rdgl We can then implement each register 239 as a 32bit register CSCI 2500 Computer Organization Spring 2009 Data Input bus LK Register i 0 rs bus output It bus output That takes care of the register le What about the main ALU Appendix C on the CD with the text describes the ALU construction We have seen just about everything we need so we will look at the book s gures to see how an ALU tailored to this MIPS subset can be constructed Key ideas 0 construct circuits to compute each needed input 0 multiplex the outputs based on an operation selection control line set 0 AND and OR are trivial 0 we know how to build an addersubtractor this ALU works similarly 0 Also need slt support set on less than we can tell that a lt b if we nd that a 7 b lt 0 however the bit of the ALU that detects whether this value is negative is the highorder sign bit but we want to set the loworder bit in this case Appendix C shows special circuitry needed to accomplish this all bits have a Less input which will be 0 on all but the loworder bit where it is connected to a copy ofthe sign bit 0 And to support beq we need to have an output that subtracts one of the registers being compared to the other then checks if the result is zero 0 The ALU has 4 control lines but only 6 meaningful combinations as seen in Figure CS 13 in this table the rst control line is Ainvert which is only used for the NOR func tionality CSCI 2500 Computer Organization Spring 2009 the second is Bnegate used when we want subtraction either for the sub instruction or because we need the result of a subtraction for slt or NOR where we only care about the invert part of the negate the last two control the multiplexor that selects among the outputs of the AND gate OR gate full adder or Less Implementing Remaining Instructions in our Subset First the load and store instructions which are in the Iformat 1w t0 1200t1 sw t2 32t1 In either case we need to retrieve the value 111 from the register le and add to that the offset which is part of the instruction itself We ll use the main ALU for this The value computed is the effective address for the memory access We can t just take the 16 bits from the instruction and add it directly to the contents of the base register The offset may be negative so we will need a sign extension unit that will copy the contents of the high order bit of the offset into all higher bits giving us the 32bit equivalent For a 1w instruction we instruct the memory to retrieve the value at the effective address and it will be stored back in the register le at the destination For a sw instruction we need to take the value in the source register from the register le and present it as input to the memory and tell the memory to write This gives the data path shown in Figure 410 of PampH which ignores branch and jump instruc tions For the beq instruction we also need to sign extend the offset value but then shift it left by 2 before feeding it to the branch target adder The left shift by 2 accounts for the fact that the branch offset is a number of words not bytes that must be added to the PC4 value to obtain the branch target location Other than that we need to send the two register values to the ALU to see if they are equal which we accomplish by testing if the difference produces a 0 result The data path for this part is shown in Figure 49 of PampH If we put together everything we have seen so far we get the data path of Figure 4 ll of PampH This handles all of our instructions except j Adding Control Now we want to add the details of the control to the data path First we consider the ALU We saw that it has 4 control lines When do we want to set these lines 6 CSCI 2500 Computer Organization Spring 2009 This process is a familiar one for us based on the instruction opcode eld and if the opcode indicates an Rformat instruction the funct eld we can compute 4 expressions and hence circuits that set the ALU control lines appropriately Figures 412 and 413 in PampH show some details of this but we will not worry about those details at this point Next we consider how the elds of the instruction are used to construct the rest of the needed control signals A re nement is shown in Figure 415 of PampH 0 Our instruction memory produces the 32bit instruction word Bits 150 the address eld for an Iformat instruction are sent to the sign extension unit to be used as potential input values by the ALU Bits 50 the f unc 1 eld for an Rformat instruction are sent to the ALU control to compute the appropriate ALU control lines Bits 2521 the source register rs are sent to the register le to select read register 1 Bits 2016 the source register rt are sent to the register le to select read register 2 o The write register is more complex For 1w we use the rt eld in bits 2016 For Rformat instructions we use the rd eld in bits 1511 An additional multiplexor and a control line RegDst control which eld is passed to the write register selection input 0 The other control lines are computed from the opcode in bits 3126 as shown in Figure 417 of PampH The details of the conversion of the opcode are just combinational logic which again we can gure out or look up in the text PampH has a series of gures 419 420 and 421 that show how each type of instruction uses this data path Adding Jump The nal re nement is to add the data path and control to implement the j instruction as seen in Figure 424 Using Multiple Cycles to Implement Instructions The design we ve been studying is a singlecycle implementation 7 meaning that one clock cycle results in one instruction being executed This is not used in real life mainly because of the ine iciency 0 Every instruction takes the same amount of time 7 we don t make the common case fast 7 CSCI 2500 Computer Organization Spring 2009 0 We have redundant elements the two memory systems multiple ALUadder units If we break our instructions down to operate over a series of shorter clock cycles we can use only the number of cycles we need and potentially reuse some components There are a number of ways to do this 7 we will consider a common approach called pipelining Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring 2009 Topic Notes MIPS Programming We spent some time looking at the MIPS Instruction Set Architecture We will now consider how mpmgmnnMESwmeymmwge Omammmo MESwmeymMmammnmwmdmw add a sub a addi a lw a sw a 511 a srl a and a andi a or a ori a nor a beq a bne a j L b c b c b n nb nb b n b n b c b n b c b n b c b L b L Control Structures in MIPS Assembly Suppose we want to implement a whi l e loop in MIPS assembly Consider the hi ghlevel language code CSCI 2500 Computer Organization Spring 2009 while alb a i where a is stored in register sO b in 31 and the increment i is in 32 Our MIPS assembly code might look like this Loop beq 30 31 EndLoop if ab goto EndLoop add 30 30 32 a a i j Loop goto Loop EndLoop code after loop Note that we use a beq instruction here as we want to branch in the case where we no longer wish to execute the contents of the loop Now consider a slightly more complicated loop while a i k i 2 We need to deal with an array access here Suppose we have made the following register assign ments The start ofarray a is in 33 i is in 30 k is in 31 and j is in 32 All are int values One way to implement this code would be LoopTop 311 tO 30 2 t0 i 4 add tO tO 33 t0 address of ai lw tl OtO t1 contents of ai bne tl 31 EndLoop if ai l k goto EndLoop add 30 30 32 i i j j LoopTop jump back to the top of the loop EndLoop code after loop Afew notes 0 We need to multiply i by 4 to get the correct offset since we re assuming a is an array of wordsized values 0 We might be tempted to write lw tl t033 2 CSCI 2500 Computer Organization Spring 2009 to access the value at an offset of tO from our base register 33 But that is not valid MIPS the offset part of the lw and 3w instructions needs to be a constant not a register The MIPS designers could have provided such an instruction it would be Rformat instead of Iformat but they chose not to Before we can complete our next example we need a couple of additional instructions 7 reading and writing single bytes from memory These instructions lb for load byte and 3b for store byte work just like the lw and 3w instructions except that only singlebyte values are processed lb a nb Loads the single byte at an offset of n from register b and stores it sign extended into register a 3b a nb Stores the byte in the bottom 8 bits of register 51 into memory at an offset of n from register b String Processing Examples Armed with these instructions we can write our next example a string copy function like C s 3trcpy 3trcpyde3t 3rc Recall that C strings are terminated with a null 0 character For now we ll just look at the main loop of this function Assume register 31 holds the address of the start of the de3t string and that 32 holds the address of the start of the 3rc string Our task is to write a loop that copies characters bytes from the source string to the destination string LoopTop lb tO O32 temp reg char at 32 3b tO O31 char at 31 gets temp reg addi 32 32 1 increment 32 addi 31 31 1 increment 31 bne tO zero LoopTop branch if we didn t just copy a null For another example CSCI 2500 Computer Organization char all put something in a for i0 iltlO i ail ai Spring 2009 Assuming 3 0 contains the address of a here s one way to write this add 30 zero zero ForLoop slti tl 30 10 iltlO beq tl zero LoopEnd if add t2 30 31 1b t3 0t2 addi t2 t2 1 3b t3 Ot2 ail addi 30 30 1 i j ForLoop LoopEnd i0 done branch out t2 gets address of ai t3 gets ai t2 gets address of ail gets t3 back to check loop condition MIPS Subroutines and Programs You are all familiar with functionmethod calls in highlevel languages In assembly language we usually refer to these as subroutines The idea is the same as a function or method call 7 the program branches from its sequence of instructions to execute a chunk of code elsewhere then returns to continue where it left off We ll now look at how to write and call subroutines in MIPS assembly Special Registers and Instructions Recall that there were several registers reserved to help support subroutine calls 0 aOa4 argument registers e a place for the caller to place values to send to the subroutine o VO Vl return value registers e a place for subroutines to return values to the caller o ra return address register 7 where to resume executing the caller when the subroutine is nished We also have a couple of special jump instructions 0 Jump andLink CSCI 2500 Computer Organization Spring 2009 j 511 address This instruction puts the address of the next instruction PC4 into register ra then jumps to the address This is a Jformat instruction just like the stande jump instruction j 0 Jump toRegister jr a Jumps to the address speci ed in the given register This is an Rformat instruction Assuming the subroutine hasn t changed the ra register this can be used to return from the subroutine Registers and the Stack We said previously that the s registers 3037 are the ones assigned to variables and the t registers t O 117 are temporary values This becomes important when we start looking at subroutines The accepted convention for register use by subroutines 0 110117 are always available for use by a subroutine if a subroutine calls another subroutine it must assume that the called subroutine will modify 110117 if this is a problem for the calling subroutine it should save any values it has in 110 117 to memory and restore them after the subroutine call 0 s Os7 should be unchanged by a subroutine call if a subroutine calls another subroutine it can expect its values in s Os7 to remain upon return if the called subroutine wishes to make use of 3037 it should save the caller s values in any of these registers it will use in memory then restore them before return Since subroutines can be called by anyone we don t know which s registers if any are important to the caller So we have to save these if we use them So where do we save values in memory when we need to save them On the stack The stack is a section of memory dedicated to saving registers to manage subroutine calls CSCI 2500 Computer Organization Spring 2009 We have a special register sp called the stack pointer that indicates the top of the stack The stack grows and shrinks as registers are saved and restored during a program s execution If we have a subroutine that will need to make use of sO sl and s2 we need to do the following at the start of the subroutine s code addi sp sp l2 make room for 3 4 byte values sw sO Osp push sO sw sl 4sp push sl sw s2 8sp push s2 Then before returning lw s2 8sp pop s2 lw sl 4sp pop sl lw sO Osp pop sO addi sp sp 12 restore the stack pointer Note that we pop in the opposite order as we push A First Complete Subroutine Let s return to our string copy code LoopTop lb tO Os2 temp reg char at s2 sb tO Osl char at sl gets temp reg addi s2 s2 l increment s2 addi sl sl l increment sl bne tO zero LoopTop branch if we didn t just copy a null In order to make this a subroutine we need to get values from the subroutine argument registers and save and restore values of any s registers we decide to use Our code becomes strcpy addi sp sp 8 make space for 2 words on the stack sw s2 4sp save s2 sw sl Osp save sl add sl aO zero copy argO into sl add s2 al zero copy argl into s2 LoopTop lb tO Os2 temp reg char at s2 sb tO Osl char at sl gets temp reg addi s2 s2 l increment s2 addi sl sl l increment sl CSCI 2500 Computer Organization Spring 2009 bne tO zero LoopTop branch if we didn t just copy a null lw sl Osp restore sl lw s2 4sp restore s2 addi sp sp 8 restore sp jr ra return from subroute Note that we could so something simpler here save and restore a0 and al and use those in place of sl and s2 throughout A Recursive Example We will develop a MIPS assembly subroutine to implement the following C function int factorial int x if xltl return 1 return x factorialx l Since this subroutine calls another subroutine itself we need to save ra and any temp registers we care about before making the recursive call We will assume a subroutine mul tiply exists and we will use that to do our multiplication to get extra practice with subroutines Here is some code for this factorial make space for 2 words on the stack addi sp ssp 8 save ra and a0 on the stack sw aO 4sp sw ra Osp slti tO aO l is x lt l beq tO zero Ll if no skip over if part x gt 1 just return 1 addi VO zero l return value is l we could restore a0 and ra but we know they haven t changed when we take this quick exit so let s not but we still need to put sp back the way we found it addi sp sp 8 jr ra return to caller 7 CSCI 2500 L1 Computer Organization Spring 2009 here xgtl so we need to make a recursive call addi aO aO l get x l for the parameter jal factorial recursive call will put answer in v0 We now want to set up for the multiply but we destroyed a0 above but have it on the stack so let s load it lw aO 4sp add al vO zero put factorialx l result into al jal multiply multiply a0al put result in v0 v0 is where we want our answer so no work there but multiply could have changed a0 and did change ra lw ra Osp restore ra lw aO 4sp restore a0 addi sp sp 8 restore sp jr ra return to caller Trace through this with the call factorial 3 Complete Programs and SPIM We will use a simulator called SPIM to execute MIPS programs 0 SPIM reads MIPS assembly language programs SPIM simulates the execution of each instruction SPIM displays values of registers and memory SPIM allows breakpoints and stepbystep execution SPIM provides primitive 10 to allow interactive processing Three SPIM versions are available 0 spim e the commandline version o xspim e the X11 version Macs Unix workstations o PCspim 7 Windows version We will grade submissions with spim but you are welcome to use any version you wish when developing MIPS programs CSCI 2500 Computer Organization Spring 2009 A complete program consists of several parts including the MIPS assembly code we have been considering in our previous examples Our rst complete example will show a few of the parts of a MIPS program beyond what we have seen See csterescoj sharedcs2500examplesspim simplesimple 3 Things to notice about this example 0 Comments in MIPS assembly are any pa1t of a line after a character We can de ne variables in the data portion of the program Here we de ne three wordsized variables with initial values The names are the labels the initial values are given after the word directive These labels are local labels meaning that they exist only within this assembly source le We also de ne an uninitialized wordsized variable with the space directive and 4 for the number of bytes of space to allocate The program is de ned in the text portion 7 the code section We include align 2 to make sure our code starts on a word boundary Note that align n forces the next de ned item to align on a 2nbyte boundary globl main makes the main label an external or global label which allows the main subroutine to be callable from outside of this le in this case by the simulator The program itself is de ned starting at the main label Note that we can use 1w and sw to load registers from and store registers to named memory locations Our rst example of SPIM s primitive 10 is this syscall syscall can perform one of several functions as determined by the value in VO Here we have a l in VO which speci es printint The int to print must be placed in aO Finally when main completes we return from the subroutine with j r ra To run this at the command line spim simple s This should print our answer We can learn much more about the execution of the program running SPIM in interactive mode Run SPIM without any commandline parameters to get the spim prompt and type help to see the options available CSCI 2500 Computer Organization Spring 2009 Some things to notice when stepping through our program 0 Execution starts at 0x00400000 by default which is the code that sets up the call to main 0 After a few steps main gets invoked with a j al call 0 When we get to main we nd that we are executing a lui instruction rather than the expected lw Why The lw instruction requires a base register and offset lw t0 offset base The assembler allows us to write lw t0 numl which constitues a di erent ad dressing mode and inserts appropriate instructions to perform a load from a labelled memory location It computes the appropriate address loads it into the reserved assembler register at then issues a lw instruction in the machine s memory addressing mode Another simple example See csterescoj sharedcs2SOOexamplesspim simplehello c A few things to note 0 We can de ne a string constant as part of our data segment with an ascii z directive o The la is a pseudoinstruction that directs the assembler to load the given register with the address of the given label 0 A syscall with VO set to 4 is the printlstring operation and will print the null terminated string starting at the address in 510 Next we look at a complete version of the factorial program See csterescoj sharedCSZSOOexamplesspim factorialfactorial s o The factorial subroutine is identical to what we looked at last time 0 We ll in the missing multiply routine with the one you did for the lecture assignment changed to use 510 and 511 as the parameters and to put the answer into VO o The main subroutine will use two s registers and will call subroutines so we begin by pushing sO sl and ra onto the stack Note that SPIM initialized sp for us appro priately CSCI 2500 Computer Organization Spring 2009 0 We print a prompt string with the prinLString syscall and then read in an int from the keyboard with the readint syscall Note a complete list of syscall codes is in Figure B9 l 0 After the call to factorial we use a series of syscal ls to print the answer 0 Finally we pop the stack and return from main MIPS Pseudoinstructions We have mentioned the idea of pseudoinstructions a few times These are instructions that exist in MIPS assembly that don t exist in machine language 0 The pseudoinstructions map to one or more MIPS machine instructions 0 These exist for convenience of the programmer human or compiler Here are a few common MIPS pseudoinstructions 0 move a b 7 move copy contents ofregister b to register 51 This is assembled into add 30 sl zero o blt a b L 7 branch on less than 7 if 51 is less than b branch to label L This is assembled into two machine instructions slt sat a b bne at L Note the use of the assembler reserved register 5111 0 1i 7load immediate o la 7 load address 0 Sgt sle sge 7 set on o bge bgt ble blt 7 conditional branches Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring 2009 Topic Notes Building Memory We ll next look at how we can use the devices we ve been looking at to construct memory Tristate Buffer We ve seen and used inverters We also mentioned when we rst saw inverters that we can put two together in series and have a bu er We can draw one of these as a triangular symbol but with no circle at the end What use is that It will slow down but also boost the current in the circuit Recall that large fanout can cause trouble with not enough current to drive subsequent gates A buffer can take care of that We think about a wire or an output as having values of O or 1 It can be easy to think about the value 0 as the absence of a value but that s not true Connected to ground or to a low gate output is different than not connected You can think of the wire as transmitting some value A 0 value means the wire like a pipe spilling out 0 s a 1 value is like a pipe spilling out 1 s Or like a pipe with hot water owing vs a pipe with cold water owing But a disconnected wire or a pipe with no water source isn t spilling out anything So even in our circuits so far there could be a wire or a gate not connected to anything Now we ll look at a new device called a lrz39slale bu er in l I control When the control line is high we have a wire with some amplifying properties like a buffer When control is low it looks like a broken wire physically disconnected cscx 25cm Cumpmev ovgamzauun Spnng mm mm wunyhnwn39s bun far we Whawselsadzvlcehh my Well we have Wm ma emu whzxe w hmw um exacdyam gr number afmnpms wdlbe hlgh as ma Mux A1 We cmddn39t Just ne me amputs afthz ANDs mgnhszecmlse we can39t feedback me antth af same gmes dqmth me antth hues afathzxs We m m feed thzm dqmth an OR gate m u antth cscx 25cm Cumpmev ovgamzauun Spnng zana D n v on n Mm Smce nkan amw msme buffers wdbe eameecea we cansafelydn wlmlanks m wuea m Driving 1112 Bus We can use Lhs mu m allnw an af mum mpms m be placed am a we fax mmlssmn samzwhzxe else e g E eEe 3 Ths wdlbe Ise whzn we we m unmet up mlr CPLVs m shank afmemmy In um case me bu 39exs canbe used m deem whatth aw bus 5 bung usedm send am rmmuw cm m memmy mm m memmym m cm axmnhzx Bmhcm39 ae dam Mame If we wmea mulnph msme buffers and had mulnph bus drivers we can makz mm mm sendmg aw ggmbackwuxdsthmughamgams Building Mammy Devices Snppuse we wlshmbuddadme um hnlds m afmzmaxy CSCI 2500 Computer Organization Spring 2009 logn 4X1 RDNVRT e CS What does all this mean 1 4 gtlt l 4 is the number of units of memory 1 is the addressability the smallest number of bits we can address N D is a bidirectional data bus 0 when we want to store a value in the memory we put that value on D 0 when we want to retrieve a value from memory its value is put on D o D consists of a number of wires equal to the addressability of the memory E A is the address lines 0 need to be able to select from among all 71 units of memory 0 so we need log2n address lines o speci es which bits are being read from or written to the data bus 4 RDWRT selects whether the chip should be reading or writing 7 who s driving the bus V39 GS is chip select Are we doing anything with this chip right now How can we build this out of components we have seen We use 4 Dtype ip ops and connect them up to our inputs as appropriate CSCI 2500 Computer Organization Spring 2009 01 10 11 decoder How it works 1 Correct address line is set high by the decoder This will disable 3 of the 4 CLK inputs and 3 of the 4 AND gates that mask Q outputs of the ip ops N Data line is fed as input unconditionally to all ip ops E When CS chip select is low all 4 input AND gates are low inhibiting any CLK input Also and the AND gate at the output is low disconnecting the tristate bulTer 4 When RDWRT is high an input to all 4 input AND gates is low so CLK is inhibited UI When RDWRT is low the AND gate controlling the tristate bulTer is low meaning the output is not being sent to D 0 When both CS and RDWRT are high the output being sampled from one of the D ip ops is being passed through to D One of the tristate bulTers is driving the bus gt1 When CS is high and RDWRT is low we assume someone else is driving the bus and is putting the values on the bus that we wish to store in one of our ip ops CSCI 2500 Computer Organization Spring 2009 We have a 4bit memory 4 x 4 Memory We normally don t think about storing our memory just one bit at a time How about building a device that can store 4 4bit values 44 A 7 2 4x4 RDNVRT 6 CS Here we have the same 2 address lines a RDWRT line and a chip select but instead of a single data line we have 4 data lines We will build our 4gtlt4 device out of 4 4gtlt 1 devices 30 E1 12 D3 ML L L2 L2 4x1 4x1 4x1 4x1 RDWTRT RDWTRT RDWTRT RDWTRT cs cs cs cs RDN RT v CS Note that the data bus D is actually 4 wires and we only connect one of the 4 to each of the 4gtlt l s 4 x 8 Memory So far we ve gone from the Dtype ip ip which we can think of as a lgtltl memory device to a 4gtlt l by adding 2 address lines and using a decoder to have those address lines activate one of 4 l gtltl devices Then we went from 4gtltl to 4gtlt4 by having 4 data lines each of which goes to one of the 4gtltl devices CSCI 2500 Computer Organization Spring 2009 We could do the same thing for a 4gtlt8 device 7 8 data lines each of which is fed into a separate 4 gtlt 1 device But if we already have 4gtlt4 devices we can expand to 4gtlt 8 D DCD3 D4D7 A 4 4 D D A A 4x4 4x4 RDWTRT RDWTRT cs cs RDN RT CS That s 4 bytes of memory lKB memory Now let s consider how we might build a kilobyte of memory out of our 4gtlt 8 devices The device we re trying to build will look like this 10 1Kx8 We ll need 256 4gtlt8 devices each of which looks like this CSCI 2500 Computer Organization Spring 2009 We refer to each of these 4gtlt 8 devices as a bank of memory Of our 10 address lines 8 are used to select among the 256 banks and the other 2 select among the 4 bytes on the bank chosen If we think of our address lines as bits AQASA0 two main possibilities come to mind for how to organize things 0 Option 1 A1 and A0 select the byte within a bank and A9A2 select the bank 0 Option 2 A9 and A8 select the byte within a bank and A7A0 select the bank In either case we want all 8 data lines wired to each bank we want the RDWRT wired to each bank Our two lines to select a byte within a bank are wired to the address lines of the bank The other 8 address lines are passed through a decoder and the decoder outputs are ANDed with the CS inputs of the whole circuit then wired to the CS inputs of each bank Bank 0 A 2 A4x8 10 D cs Bank 1 A 4x8 L 1 CS 8to256 I0LJI Bank 255 A 4x8 D cs CS CSCI 2500 Computer Organization Spring 2009 Which bytes are stored on which chips with the two layout options Bank 1 Option 1 1 Option 2 0 073 0 256 512 768 1 4P7 1 257 513 769 2 8711 2 258 514 770 254 101671019 254 510 766 1022 255 102071023 255 511 767 1023 Each possible con guration has advangates and disadvantages Option 1 means we can lose a chip and still have large chunks of contiguous memory available Option 2 has advantages for chip layout SINIM Layout You have probably seen and maybe used memory chips of the SIMM or single in line memory module type DDDDDDDD m n 7 address lines 64 data lines Suppose we have 1 KB of memory of 8bit bytes This requires 10 bits for full addressing 8 data lines But this is not normally how it would be set up More likely you would nd 64 data lines and only 7 address lines What does this mean It s really an 8byte addressable memory We loadstore memory in chunks of 8 bytes at a time so there are only 128 addressable chunks When memory is requested say address 1 A AQASA0 0000000001 The 7 high bits ofthe address Agmg 0000000 are sent to the SIMM on the address lines We get back bytes 077 even though we only wanted 1 It s up to the CPU which still has the full address including AgAle to pick out the byte it s interested in To access location 517 1000000101 we request 8byte chunk 64 and take byte 5 ofthose that come back CSCI 2500 Computer Organization Spring 2009 This memory is organized using Option 2 from our discussion about how to arrange memory among multiple banks This may at rst seem wasteful but the memory chip can get all 8 bytes easily and more wires in and out means we can transfer more memory more quickly Plus there s a good chance that any memory access will be followed by additional memory access to nearby locations think 7 local variables an array sequential program text This locality is a natural feature of most programs All machines today have memory that is addressable in some chunk larger than one byte The decisions about how we break this down have ripple effects throughout the architecture We will soon see this in much detail Error Detection and Correction You have probably heard about error correcting memory If we want to do this we need to build in some redundancy We can detect a singlebit memory error by adding a parity bit For example if we have an 8bit data value and we want to maintain even parity we add a 9th bit that makes the total number of bits that are set an even number byte even parity bit 00000000 0 00000001 1 01110100 0 11000111 1 11111111 0 From this we can check each time we retrieve a byte that is has even paity If not we know that something is wrong But that s all it tells us Since we don t know which of the 9 bits are wrong we can t x it Some of you may have had computers that crash with a Blue Screen of Death saying that a memory parity error was detected This is not just a Windows thing I have had Unix systems crash with a kernel error that a memory parity error was detected To x it we need more extra bits Here s one error correction scheme that can x a single bit error but at the expense of 4 extra bits for each byte of memory 50 overhead We have 12 bits used to represent the 8bit value We number them 121 and use the ones whose numbers are powers of 2 as parity bits CSCI 2500 Computer Organization Spring 2009 12 11 10 9 8 7 6 5 4 3 2 1 1100 1011 1010 1001 1000 0111 0110 0101 0100 0011 0010 0001 We use the parity bits as follows 1 Position 1 stores the even parity of oddnumbered bits 2 Position 2 stores the even parity of bits whose number has the 2 s bit set 3 Position 4 stores the even parity of bits whose number has the 4 s bit set 4 Position 8 stores the even parity of bits whose number has the 8 s bit set So to store the value 94 01011110 we rst ll in the data bits 0 1 0 1 i 1 1 1 i 0 i i 1100 1011 1010 1001 1000 0111 0110 0101 0100 0011 0010 0001 Position 1 stores the even parity ofthe bits at 3 5 7 9 11 4 of those are set to 1 so we set that bit to 0 Position 2 stores the even parity ofthe bits at 3 6 7 10 11 3 ofthose are set to 1 so we set that bit to 1 Position 4 stores the even parity ofthe bits at 5 6 7 12 3 ofthose are set to 1 so we set that bit to 1 Position 8 stores the even parity ofthe bits at 9 10 11 12 2 of these are set to 1 so we set that bit to 0 0 1 0 1 g 1 1 1 1 0 1 g 1100 1011 1010 1001 1000 0111 0110 0101 0100 0011 0010 0001 When we retrieve a value from memory we can make sure it s OK by computing the 4 parity bits and comparing to the stored parity bits If they all match we re OK If there s any mismatch we know there s an error Let s introduce an error into our stored value We ll change the third bit to a 1 0 1 1 1 g 1 1 1 1 0 1 g 1100 1011 1010 1001 1000 0111 0110 0101 0100 0011 0010 001 So we recompute the parity bits on the value we retrieved and compare to the stored bits bit computed stored match P1 0 0 0 P2 0 1 1 P4 1 1 0 P8 1 0 1 We can quickly detect the mismatches in P2 and P8 hey XOR 11 CSCI 2500 Computer Organization Spring 2009 This means that the bit at position 1010 has an error and must be ipped Convenient Think about how this might be implemented 0 the memory itself doesn t even need to know 0 we can drop in some XORs to generate parity bits to be stored in memory 0 we XOR again to regenerate parity bits for retrieved values 0 still more XOR to do correction This works even if the parity bit is the one that has an error It just ends up xing the parity bit It does not work for 2bit errors Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring 2009 Topic Notes Parallel Programming Intro Given an multicoreSMT or a computer with multiple processors on separate chips a symmetric multiprocessor SMP how can we make use of the multiple processing units This level of parallelism is at a much higher level than the instructionlevel parallelism we looked at before There the compiler andor architecture takes a single program made up of a sequential series of instructions and executes those instructions in parallel in a way that produces the same result as a onebyone sequential execution of the instructions For a computer with multiple processors we need to provide multiple streams of instructions to be executed by the processors A single stream of instructions will only make use of one of our processors at a time The easiest way to program this systems is to program them just like a regular singleprocessor system but to run multiple programs at once Each program being run will be assigned to a CPU by the operating system However we would like to consider an approach where a single program can make use of these multiple CPUs If we are going to do this we rst need to think about how we would break down the problem to be solved into components that can be executed in parallel then write a program to achieve it Consider some examples 0 Taking a census of Troy One person doing this would visit each house count the people and ask whatever questions are supposed to be asked This person would keep running counts At the end this person has gathered everything If there are two people they can work concurrently Each visits some houses and they need to report in along the way or at the end to combine their information But how to split up the work Each person could do what the individual was originally doing but would check to make sure each house along the way had not yet been counted Each person could start at the city hall get an address that has not yet been visited go visit it then go back to the city hall to report the result and get another address to visit Someone at city hall keeps track of the cumulative totals This is nice because neither person will be left without work to do until the whole thing is done This is the masterslave method of breaking up the work CSCI 2500 Computer Organization Spring 2009 The city could be split up beforehand Each could get a randomly selected collection of addresses to visit Maybe one person takes all houses with even street numbers and the other all houses with odd street numbers Or perhaps one person would take everything north of Hoosick St and the other everything south of Hoosick St The choice of how to divide up the city may have a big effect on the total cost There could be excessive travel if one person walks right past a house that has not yet been visited Also one person could nish completely while the other still has a lot of work to do This is a domain decomposition approach 0 Grading a stack of exams Suppose each has several questions Again assume two graders to start Each person could take half of the stack Simple enough But we still have the potential of one person nishing before the other Each person could take a paper from the ungraded stack grade it then put it into the graded stack Perhaps it makes more sense to have each person grade half of the questions instead of half of the exams maybe because it would be unfair to have the same question graded by di erent people Here we could use variations on the approaches above Each takes half the stack grades his own questions then they swap stacks Or we form a pipeline where each exam goes from one grader to the next to the nished pile Some time is needed to start up the pipeline and drain it out especially if we add more graders These models could be applied to the census example if dilTerent census takers each went to every house to ask di erent questions Suppose we also add in a grade totaler and recorder person Does that make any of the approaches better or worse 0 Adding two 17 0007 000 gtlt 10007 000 matrices Each matrix entry in the sum can be computed independently so we can break this up any way we like Could use the masterslave approach though a domain decomposition would probably make more sense Depending on how many processes we have we might break it down by individual entries or maybe by rows or columns In each of these cases we have taken what we might normally think of as a sequential process and taken advantage of the availability of concurrent processing to make use of multiple workers processing units Some Terminology Sequential Program sequence of actions that produce a result statements variables called a process task or thread of control The state of the program is determined by the code data and a single program counter CSCI 2500 Computer Organization Spring 2009 Concurrent Program two or more processes that work together Big difference multiple program counters To cooperate the processes need communication and synchronization which can be achieved through shared variables or message passing How to Achieve Parallelism 0 We need to determine where concurrency is possible then break up the work accordingly 0 This is easiest if a compiler can do this for you 7 take your sequential program and extract the concurrency automatically This is sometimes possible especially with xedsize array computations If the compiler can t do it it is possible to give hints to the compiler to tell it what is safe to parallelize But often the parallelization must be done explicitly the programmer has to create the threads or processes assign work to them and manage necessary communication Finding Concurrency We nd opportunities for parallelism by looking for parts of the sequential program that can be run in any order Before we look at the matrixmatrix multiply we step back and look at a simpler example 1 a 10 2 b a 5 3 c a 3 4 b 7 5 a 3 6 b C a 7 print a b c Which statements can be run in a different order or concurrently but still produce the same an swers at the end 1 has to happen before 2 and 3 since they depend on a having a value 2 and 3 can happen in either order 4 has to happen after 2 but it can happen before 3 5 has to happen after 2 and 3 but can happen before 4 CSCI 2500 Computer Organization Spring 2009 o 6 has to happen after 4 so 4 doesn t clobber its value and after 5 because it depends on its value 0 7 has to happen last This can be formalized into a set of rules called Bernstein s conditions to determine if a pair of tasks can be executed in parallel Two tasks P1 and P2 can execute in parallel if all three of these conditions hold l 02 2h 01 3 01m02 where I and O are the input and output sets respectively for task 239 Bernstein 1966 The input set is the set of variables read by atask and the output set is the set of variables modi ed by atask Back to our example let s see what can be done concurrently initialize matrices just fill with junk for i0 iltSIZE i for j0 jltSIZE j ai j ij bi j i j matrix matrix multiply for i0 iltSIZE i for each row for jO jltSIZE j for each column initialize result to O cij 0 perform dot product fork0 kltSIZE k ci j ci j ai kbk j sumO for i0 iltSIZE i for j0 jltSIZE j sum ci j CSCI 2500 Computer Organization Spring 2009 The initialization can all be done in any order 7 each i and j combination is independent of each other and the assignment ofa i j and b i j can be done in either order In the actual matrixmatrix multiply each c i j must be initialized to 0 before the sum can start to be accumulated Also iteration k of the inner loop can only be done after row i of a and column j of b have been initialized Finally the sum contribution of each c i j can be added as soon as that c i j has been computed and after sum has been initialized to 0 That granularity seems a bit cumbersome so we might step back and just say that we can initialize a and b in any order but that it should be completed before we start computing values in c Then we can initialize and compute each c i j in any order but we do not start accumulating sum until c is completely computed But all of these dependencies in this case can be determined by a relatively straightforward com putation Seems like ajob for a compiler And in this case it can be Unfortunately not everything can be parallelized by the compiler If we change the initialization code to for 10 iltSIZE i for j0 jltSIZE j if i 0 j 0 aij ij bij i j else ai j a39 j l i j l bij billljl i j it can t be parallelized so no matter how many processors we throw at it we can t speed it up Approaches to Parallelism Automatic parallelism is great when it s possible We got it for free at least once we bought the compiler It does have limitations though 0 some potential parallelization opportunities cannot be detected automatically 7 can add di rectives to help 0 bigger complication 7 this executable cannot run on distributedmemory systems Parallel programs can be categorized by how the cooperating processes communicate with each other CSCI 2500 Computer Organization Spring 2009 0 Shared Memory 7 some variables are accessible from multiple processes Reading and writing these values allow the processes to communicate 0 Message Passing 7 communication requires explicit messages to be sent from one process to the other when they need to communicate These are functionally equivalent given appropriate operating system support For example one can write messagepassing software using shared memory constructs and one can simulate a shared memory by replacing accesses to nonlocal memory with a series of messages that access or modify the remote memory The automatic parallelization we have seen to this point is a shared memory parallelization though we don t have to think about how it s done The main implication is that we have to run the parallelized executable on a computer with multiple processors Our rst tool for explicit parallelization will be shared memory parallelism using threads A Brief Intro to POSIX threads Multithreading usually allows for the use of shared memory Many operating systems provide support for threads and a standard interface has been developed POSIX Threads or pthreads A good online tutorial is available at https computing llnl govcomputingtutorials pthreads You read through this and remember that it s there for reference A Google search for pthread tutorial yields many others Pthreads are available on the Solaris nodes in the cluster and are standard on most modern Unix like operating systems The basic idea is that we can create and destroy threads of execution in a program on the y during its execution These threads can then be executed in parallel by the operating system scheduler If we have multiple processors we should be able to achieve a speedup over the singlethreaded equivalent We start with a look at a pthreads Hello world program See csterescoj sharedcs2500examplespthreadhello The most basic functionality involves the creation and destruction of threads 0 pthreadgreate 3THR 7 This creates a new thread It takes 4 arguments The rst is a pointer to a variable of type pthreadi Upon return this contains a thread iden ti er that may be used later in a call to pthreadj oin The second is a pointer to a pthread ttri structure that speci es thread creation attributes In the pthreadhel 10 program we pass in NULL which will request the system default attributes The third argument is a pointer to a function that will be called when the thread is started This function CSCI 2500 Computer Organization Spring 2009 must take a single parameter of type void and return void The fourth parameter is the pointer that will be passed as the argument to the thread function pthread xit 3THR 7 This causes the calling thread to exit This is called implicitly if the thread function called during the thread creation returns Its argument is a return status value which can be retrieved by pthreadj oin pthreadj oin 3THR 7 This causes the calling thread to block wait until the thread with the identi er passed as the rst argument to pthreadj oin has exited The second argument is a pointer to a location where the return status passed to pthread xit can be stored In the pthreadhello program we pass in NULL and hence ignore the value Prototypes for pthread functions are in pthreadh and programs need to link with libp thread a use lpthread at link time When using the Sun compiler the mt ag should also be speci ed to indicate multithreaded code A slightly more interesting example See csterescoj sharedCSZSOOexamplesproctreeihreads This example builds a tree of threads to a depth given on the command line It includes calls to pthreadsel f This function returns the thread identi er of the calling thread Try it out and study the code to make sure you understand how it works A bit of extra initialization is necessary to make sure the system will allow your threads to make use of all available processors It may by default allow only one thread in your program to be executing at any given time If your program will create up to n concurrent threads you should make the call pthreadsetconcurrencyn1 somewhere before your rst thread creation The 1 is needed to account for the original thread plus the 71 you plan to create You may also want to specify actual attributes as the second argument to pthreadg reate To do this declare a variable for the attributes pthreadattrt attr and initialize it with pthreadattrinit ampattr and set parameters on the attributes with calls such as pthreadattrsetscope ampattr PTHREADSCOPEPROCESS 7 CSCI 2500 Computer Organization Spring 2009 I recommend the above setting for threads in Solaris Then you can pass in ampattr as the second parameter to pthreadc reate Any global variables in your program are accessible to all threads Local variables are directly accessible only to the thread in which they were created though the memory can be shared by passing a pointer as part ofthe last argument to pthreadc reate Brief Intro to Critical Sections As you may have been shown in other contexts concurrent access to shared variables can be dangerous Consider this example See csterescoj sharedcsZSOOexamplespthreadrdanger Run it with one thread and we get 100000 What if we run it with 2 threads On a multiprocessor it is going to give the wrong answer Why The answer is that we have concurrent access to the shared variable counter Suppose that two threads are each about to execute counter what can go wrong counter really requires three machine instructions 239 load a register with the value of counter s memory location it increment the register and in store the register value back in counter s memory location Even on a single processor the operating system could switch the process out in the middle of this With multiple processors the statements really could be happening concurrently Consider two threads running the statements that modify counter ThreadA ThreadB A1 R0 counter B1 R1 counter A2R0RO1 B2R1Rll A3 counter R0 B3 counter R1 Consider one possible ordering A1 A2 B1 A3 B2 B3 where counterl7 before starting Uh oh What we have here is a race condition that can lead to interference of the actions of one thread with another We need to make sure that when one process starts modifying counter that it nishes before the other can try to modify it This requires synchronization of the processes When we run it on a singleprocessor system the problem is unlikely to show itself we almost certainly the correct sum when we run it However there is no guarantee that this would be the case The operating system could switch threads in the middle of the loadincrementstore resulting in a race condition and an incorrect result We need to make those statements that increment counter atomic We say that the modi cation of counter is a critical section CSCI 2500 Computer Organization Spring 2009 There are many solutions to the critical section problem and this is a major topic in an operating systems course But for our purposes at least for now it is suf cient to recognize the problem and use available tools to deal with it The pthread library provides a construct called a mutex short for the mutual exclusion that we want to enforce for the access of the counter variable allows us to ensure that only one thread at a time is executing a particular block of code We can use it to x our danger program See csterescoj sharedcsZ500examplespthreadmodanger We declare a mute like any other shared variable It is of type pthreadmutext Four func tions are used 0 pthreadmutexinit 3THR 7 initialize the mute and set it to the unlocked state 0 pthreadmutexlock 3THR 7 request the lock on the muteX Ifthe muteX is unlocked the calling thread acquires the lock Otherwise the thread is blocked until the thread that previously locked the mute unlocks it o pthreadmutexlock 3THR 7 unlock the muteX o pthreadmutexdestroy 3THR 7 destroy the mute clean up memory A few things to consider about this Why isn t the access to the mutex a problem Isn t it just a shared variable itself 7 Yes it s a shared variable but access to it is only through the pthread API Techniques that are discussed in detail in an operating systems course and that we may discuss more here are used to ensure that access to the mute itself does not cause a race condition Doesn t that lockunlock have a signi cant cost 7 Let s see We can time the programs we ve been looking at See csterescoj sharedcsZ 50 Oexamplespthreaddangertimed See csterescoj sharedcsZ 50 Oexamplespthreadmodangertimed Perhaps the cost is too much if we re going to lock and unlock that much Maybe we shouldn t do so much locking and unlocking In this case we re pretty much just going to lock again as soon as we can jump back around through the for loop again Here s an alternative See csterescoj sharedcsZ 50 Oexamplespthreadmodangercoarse In this case the coarsegrained locking one thread gets and holds the lock for a long time should improve the performance signi cantly But at what cost We ve completely serialized the compu tation Only one thread can actually be doing something at a time so we can t take advantage of multiple processors If the computation was something more signi cant we would need to be more careful about the granularity of the locking Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring 2009 Topic Notes MIPS Instruction Set Architecture vonNeumann Architecture Modern computers use the vonNeumann architecture Idea a set of instructions and a loop 1 Fetch an instruction Update next instruction location Decode the instruction Execute the instruction GOTO 1 Basic picture of the system scratchpad MicrosequencercontrOI BRAIN St re u microcode arithmetic logic unit i The ALU knows how to do some set of arithmetic and logical operations on values in the scratch pad Usually the scratchpad is made up of a set of registers The microsequencer brain controls what the ALU reads from the scratchpad and where it might put results and when CSCI 2500 Computer Organization Spring 2009 We will get into details of the microsequencer later This is what makes up the central processing unit CPU Expand this idea a bit scratchpad CPU Chi M icroseq uencer ntr l p BRAI store microcode L arithmetic ogic unit lots of pins Memory Address Bus Data Bus i mouse other devices CPU interacts with memory and other deVices on buses These buses just carry signals that represent the data More on these later too We ll have to worry about how we can connect the CPU memory other deVices to these buses There are a variety of speeds startup rates 0 mouse keyboard slow 0 disk network fast CSCI 2500 Computer Organization Spring 2009 MIPS Instruction Set Architecture We will look at the MIPS instruction set architecture ISA Recall that the ISA is a key abstraction 0 interface between hardware and lowlevel software 0 standardizes instructions machine language bit patterns etc o advantage different implementations of the same architecture 0 disadvantage sometimes slows innovation 0 the instructions are the language of the machine 0 discussion how important is binary compatibility MIPS processors are in extensive use by NEC Nintendo Cisco SGI Sony etc MIPS is an example of reduced instruction set computer RISC architecture RISC architectures have a fewer number of simple instructions than complex instruction set com puter CISC architectures We will discuss the relative advantages more later For now 0 the good news not many instructions or addressing modes to learn 0 the bad news a single instruction performs only a very simple operation so programming a task takes many instructions All modem ISAs have similar types of instructions MIPS Arithmetic Instructions MIPS arithmetic instructions have three operands add a b c This instruction takes the sum of scratchpad values b and c and puts the answer into scratchpad value a It is equivalant to the C code CSCI 2500 Computer Organization Spring 2009 What if we want to code the following a b c d We need to do it in two steps add a b c add a a d Note that multiple operands may refer to the same scratchpad location The sub instruction is similar MIPS Registers and Memory In MIPS the operands must be registers 32 registers are provided each register stores a 32bit value compilers associate variables with registers registers are referred to by names such as s O and 111 we use the s registers for values that correspond to variables in our programs we use the t registers for temporary values more on this later For example consider this example from the text f gh ij We choose or better yet a compiler chooses registers to store the values of our variables f in sO gin sl hin s2 i in s3 and j in s4 We ll also need to temporary values which we will store in 110 and 111 The MIPS code add 110 31 32 add 111 33 34 sub s0 t0 tl What if you need more variables than there are registers Access memory CSCI 2500 Computer Organization Spring 2009 0 consider as a large onedimensional array with an address 0 a memory address is an index into this array of values 0 memory is a much larger storage space than registers but access to that space is slower o MIPS arithmetic and other instructions can t operate directly on values in memory 0 data must be transferred rst from memory into a register then the answer transferred back Since registers are 32bit 4byte values we often access memory in words instead of bytes 0 232 bytes with byte addresses from 0 to 232 7 l o 230 words with byte addresses 07 47 87 232 7 4 0 words must be aligned on 4byte boundaries So suppose we have the following C code to translate to MIPS A12 h A8 where A is an array of wordsized values We have the address of the rst element of A in register 33 and h has been assigned to s2 First note that the values in the array A are wordsized so each entry takes 4 bytes We can nd entries in the array AO s30 Al s34 A2 s38 A8 s332 A12 s348 The notation to get the value at location s34 for example is 4 33 So what we d like to write add 48s3 32 32s3 But we can t since MIPS arithmetic instructions can t operate on values in memory We ll have to copy the value A 8 into a temporary register add into a temporary register then store the value in A 12 The code CSCI 2500 Computer Organization Spring 2009 1w tO 3233 add 110 32 110 3w tO 4833 The new instructions are to load a word 1w and store a word 3w Aside why is it OK to overwrite the value in 110 in the add instruction even though our original C code doesn t change A 8 The address in 33 is called a base register and the constants we add to it are called the a sets Immediate Addressing Mode We often need to deal with constants So far the only way we d be able to add a constant to a register is by having that constant in a register or a memory location and how exactly would we get it there So there is a special format of the add instruction add immediate speci ed as addi Its third operand is a constant value used in the addition addi 32 32 4 MIPS Machine Language MIPS assembly instructions correspond to 32bit MIPS machine instructions For example add 111 31 32 This corresponds to the machine instruction OOOOOOlOOOllOOlOOlOOOOOOOOlOOOOO Somehow the fact that this is an add instruction and which registers are involved is encoded in this particular 32bit value We interpret the 32bit value in this case by breaking it down into elds according to the instruction format op r3 rt rd 3hmat funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 0 17 18 8 O 32 000000 10001 10010 01000 00000 100000 CSCI 2500 The meanings ofthe elds 0 op the opcode 7 6 bits Computer Organization Spring 2009 0 rs the rst register source operand 7 5 bits why rt the second register source operand 7 5 bits 0 0 rd the register destination operand 7 5 bits 0 shmat the shift amount 7 5 bits more later 0 f unct the variant of the operation 7 6 bits This is an example of an R type register instruction which is encoded in the R format These are the instructions that require three registers to be speci ed The 32 registers are encoded as follows constant return Other instructions don t need three registers Immediate mode instructions for example need 2 These are I type instructions stored in the I format registers and a constant value addi sl s2 lOO op rs rt address 6 bits 5 bits 5 bits 16 bits 8 17 18 100 001000 10001 10010 OOOOOOOOOOllOlOO Here three of the elds are replaced by a single l6bit eld called address For the addi instruction this stores the constant value to be added The load and store instructions use this format as well CSCI 2500 Computer Organization Spring 2009 lw t0 1200t1 This instruction s function is to retrieve the value from memory at the address pointed to by the contents of 111 offset by 1200 and store the value in 110 op rs rt address 35 9 8 1200 010101 01001 01000 OOOOOlOOlOllOOOO The sw instruction is similar with opcode 43 MIPS Logical Instructions We will look quickly at the logical shift instructions 11 and srl which stand for shift left logical and shift right logical respectively These instructions use the shamt eld in an Rformat instruction sll t2 30 4 lop rs rt rd shmat funct l llol 4 l 0 l Note that rs is not used Recall that these are quick ways to multiply and divide by powers of 2 Bitwise and or nor and the immediate versions andi and ori follow the Rformat much like add and addi MIPS Control Flow Instructions Any nontrivial program needs to make decisions hence the conditional branch instructions beq regl regZ label bne regl regZ label beq will cause a branch to the statement labeled label if the values of regl and regz are equal and continue to the next instruction otherwise bne branches when not equal These use the Iformat for the machine instruction bne s0 sl Exit CSCI 2500 Computer Organization Spring 2009 m The address has to t in 16 bits so does this mean we can only branch to a 16bit address No we usually use the address eld as a relative offset to the program counter PC PC relatz39ve addressing So if the label Exit is 44 away in the positive direction from the current program counter we store 11 in the address eld We divide by 4 since we know the bottom 2 bits are 0 s anyway addresses are wordaddressable This means we can jump anywhere from 72 l7t02 l7 from the current PC Note the PC is incremented by 4 as instructions are wordsized early in the execution of a given instruction Step 2 of Fetch Update Decode Execute Therefore by the time we re really executing an instruction it contains the address of the next instruction to be executed A branch that is taken simply needs to modify the PC before we fetch the next instruction There is also an unconditional jump instruction j label No registers here so we have more bits available for the address This is aJ format instruction If we want to jump to memory location 4848 the instruction is Again the bottom 2 bits are always 0 so we divide our intended jump target by 4 when encoding the instruction We can also perform inequality comparisons with two more instructions slt tl 32 31 slti t2 t4 8 These are set on less than instructions and set the value of the target register to 1 if the second operand is less than the third 0 otherwise We can use these to implement all of the conditional and looping constructs we are used to in highlevel languages Suppose i is in s0 j is in sl and his in s3 if ij h i j CSCI 2500 Computer Organization Spring 2009 MIPS assembly bne 30 31 Label add 33 30 31 Label Slightly more complex if ilj h i j e13ehi j assembles to beq 30 31 ElsePart add 33 30 31 j OverIf ElsePart sub 33 30 31 OverIf And an inequality if iltjhij e13ehi j assembles to 31t tO 30 31 beq tO zero ElsePart add 33 30 31 j OverIf ElsePart 3ub 33 30 31 OverIf Larger Constants in MIPS So far we have seen how to get 16bit constants to use in immediate mode instructions But what if we want a 32bit constant MIPS requires that all instructions t in a single 32bit word so we can t have an opcode and the whole 32bit constant at once It takes two instructions First load upper immediate CSCI 2500 Computer Organization Spring 2009 lui t0 Oxa532 This sets 110 to Oxa532 0000 It is an Iformat instruction using the address eld of that format to get the 16 bits for the top half of the speci ed register We can then put in appropriate lower order bits ori t0 Ox8d7e This will or in the bottom bits to have the constant speci ed leaving the upper bits that we ve already set alone t O is now Oan 3 2 8 d7e What Else is lVIissing The MIPS ISA doesn t provide instructions for operations that can easily be expressed as an exist ing operation For example you might want to copy a value in one register to another move t0 tl This is valid MIPS assembly language but not valid MIPS machine language This is a pseudoin structz39on As assembler would encode this as add 110 tl zero In this case there s no extra cost It s still just one instruction Other pseudoinstructions may translate to more than one instruction For example the pseudoin struction bgt which branches on greater than bgt sl s2 Label would likely translate to slt t0 s2 sl bne t0 zero Label When determining relative costs of di erent translations of highlevel language into assembly this pseudoinstruction should be considered to cost twice as much as a regular instruction or a pseudoinstruction that corresponds directly to a single machine instruction The text goes into more detail about the MIPS ISA including the mechanisms for procedure calls IO and more We may return to some of this later 11 Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring 2009 Topic Notes C and Unix Overview This course is about computer organization but since most of our programming is in C in a Unix environment we ll spend some time getting you up to speed on C and Unix C for C Programmers C is a subset of C so much of the familiar syntax if you re a C programmer is the same in C builtin base types like char short int long float double preprocessor for include define etc familiar control structures if while do switch for operators 1 lt gt execution begins by calling a function named main function de nition syntax is the same multiple les and headers handled similarly comment blocks with lines with Main difference no classes no constructorsdestructors no inheritance no operator overloading no new or delete memory is allocated through system calls no templates no String or Boolean types no cout cin use stdio library functions such as printf scanf fopen fclose CSCI 2500 Computer Organization Spring 2009 Object oriented programming is still possible it s just not directly supported by the language A Very Simple C Program We begin by compiling and running a very simple C program hello c in a Unix environment More about some of these steps later but for now we will just look at the example See csterescoj sharedcsZSOOexampleshello Things to note from this simple example 0 We run a program named gcc which is a free C compiler o gcc in its simplest form can be used to compile a C program in a single le gcc helloc In this case we re asking gcc to compile a C program found in the le hello c Since we didn t specify what to call the executable program produced gcc produces a le a out The name is aout for historical reasons When we want to run a program located in our current directory in a Unix shell we type its name For example when we wanted to run gcc we typed its name and the Unix shell found a program on the system in a le named gcc How does it know where to nd it The shell searches for programs in a sequence of directories known as the search path Try env So if we want to run a out we should be able to type its name But our current directory always referred to in a Unix shell by is not in the search path We need to specify the as part of the command to run aout Of course we probably don t want to compile up a bunch of programs all named a out so we usually ask gcc to put its output in a le named as one of the parameters to gcc gcc o hello helloc Here the executable le produced is called hello 0 And in the program itself let s make sure we understand everything At the top of the le we have a big comment describing what the program does who wrote it and when Your programs should have something similar in each C le 2 CSCI 2500 Computer Organization Spring 2009 We are going to use a C library function called printf to print a message to the screen Before we can use this function we need to tell the C compiler about it For C library functions the needed information is provided in header les which usually end in h In this case we need to include stdio h Why See man 3 printf More on the Unix manual later A C program starts its execution by calling the function main Any commandline parameters are provided to main through the rst two arguments to main traditionally declared as argc the number of commandline parameters including the name of the program itself and argv an array of pointers to character strings each of which represents one of the commandline parameters In this case we don t use them but there they are Our call to printf results in the string passed as a parameter to be printed to the screen The n results in a new line Our main function returns an int value A value of 0 returned from main generally indicates a successful execution while a nonzero return indicates an error condition So we return a 0 A Bit More Complex Example We next consider an overly complicated C program that simply computes the greatest common denominator of two integer values See csterescoj sharedcs2SOOexamplesgcd c Lots of things to notice here 0 We have four les gcd c the implementation of the gcd function gcd h a header le with a prototype for the gcd function gcdmain c a main program that determines the input numbers computes the GCD and prints the answer and Make file a make le that gives a set of rules for compiling these les into the executable program gcdmain When executing functions from both gcdmain c main and gcd c gcd will be used Both of these are included in our executable le gcdmain 0 Start with gcd c This is a very simple recursive function to compute the greatest common denominator using the Euclidean Algorithm CSCI 2500 Computer Organization Spring 2009 There is no main function here so if we try to compile this by itself as we did with hello c we will get an error Instead we have gcc use compile only mode to generate an object le gcd o from gcd c gcc c gcdc gcd o is a compiled version of gcd c but it cannot be executed C and many other languages require a two steps for source code to be converted into an executable The rst step compiles source code into object code the second takes a collection of object code les and links together the references in those les into an executable le There s much more to discuss here but this should su ice for now 0 Next up gcd h Much like stdio h tells the compiler what it needs to know about printf among other things we have gcd h to tell other C functions what they need to know about the function gcd Namely that it s a function that takes two ints as parameters and returns an int Any C le that contains a function that calls gcd should include quotgcd hquot o The driver program gcdmain c We include several header les to tell the compiler what it needs to know about C library functions and our gcd function that are called by functions de ned here This is where our main function is de ned We can de ne local variables to functions just like local variables in a Java or C method In this case we look at the arguments to main that provide the commandline param eters of our program argc and argv If we have fewer than three commandline parameters including the program name itself which is always there we prompt the user for two numbers with print f then read in two numbers from the terminal with scanf scanf is a very strange thing It will make a bit more sense when you are more familiar with print f but for now we can summarize what we see there as read in two integer values represented by the d s in the format string and put them into the place pointed at by the address of a and the address of b then return the number of values that matched the input with the correct format Right The scanf call forces us to think a bit about pointers which are the key to understand ing so much of how C works scanf s parameters after the format string are always a list of pointers to a place in memory where there is room to put the values being read in In this case we want the two int values to end up in the local variables a and b so we have to take the address of those variables with the amp operator Don t worry it will make better sense when you see more examples 4 CSCI 2500 Computer Organization Spring 2009 Next we check to make sure that the input to scanf did in fact represent two int values If not we print an error message and exit Otherwise we continue Some things to notice in the error condition gtk We use fprintf instead of print f This is because we want to give this out put special signi cance Rather than sending it to the standard output which is what printf would do we send it to standard error by using fprintf and specifying stderr as the rst parameter 9 Other than that it works just like print f We give it a format string In this case it includes one speci er a s which means to expect an additional parameter which is a character string Here the string is argv O the rst commandline parameter which is always the name of the program This labels the error message with the program name gtk Once we have detected the error we don t want to continue so we call the exit function with an error code of l to terminate execution We could also use the call return 1 In the case where at least two commandline parameters were provided we try to con vert them argv l and argv 2 to integer values This is done with the overly complicated strtol function which we use then check error conditions gtk The man page for strtol tells us we need to include two additional header les stdlib h and limits h It also tells us about the parameters to strtol which are the string which we would like to convert to a number a pointer into the string at the point beyond which we matched a number which we don t care about so we pass in NULL and the base to use for the conversion We also see that the number is the return value 96 96 Error checking for strtol is messy 7 we need to check the variable errno de ned in errno h to see if an error condition was encountered If so errno will be a nonzero value and we print an error message and exit 96 Note that the error check here has two s s so we have two additional parameters to f p rint f Finally we re ready to check that the numbers entered are nonnegative and if so we print out the answer obtained by the gcd function call inside of a print f parameter This le includes a main function so we might think we could compile it to an exe cutable as we did with hello c but if we try we ll nd that it doesn t know how to nd the gcd function Again we ll have to compile but not link gcc c gcdmainc This produces the object le gcdmain 0 We need to link together our two object les which together have the function de nitions we need gcc o gcdmain gcdmaino gcdo CSCI 2500 Computer Organization Spring 2009 This gives is gcdmain which we can run The Makefile which you will learn about as part of Lab 1 contains rules to generate a sequence of calls to gcc that will correctly compile and link the gcdmain executable The bad news that was a lot of trouble just to write a simple program The good news you will have a lot of examples to go on and you can ask a lot of questions Example Matrix Multiplication We ll get started by using a matrixmatrix multiply as a running example We start with a simple implementation of a matrixmatrix multiply and use it to leamreview a bit about C and Unix See csterescoj sharedcszSOOexamplesmatmult This is another example of separate compilation e The function in timer c will be useful throughout the semester We tell matmult c about it with the line include quot timer hquot This provides a prototype of the function in timer c In many cases this le would also de ne any data structures or constantsmacros used by the functions it de nes This is a good model to use as you move forward and develop more complicated C programs Group functions as you would group methods in a Java class or member functions in a C class Along those same lines the include les in angle brackets include ltstdiohgt include ltstdlibhgt include ltsystimehgt specify systemwide header les By convention though most compilers don t really make a distinction systemwide header les are in angle brackets while your own header les are in double quotes Each le can then be compiled separately to create an object le 0 le from the C source These object les are all listed at the linking step What happens for function dif f gettime at compile time Link time The program uses two system calls print f and gettimeo fday To see how these work we can look at their man pages man printf CSCI 2500 Computer Organization Spring 2009 to see everything we wanted to know about a particular system call But if you do this you might get a man page for a commandline utility called printf instead of the system call printf Not what we were looking for The Unix manual is divided up into sections The most important of these sections for our purposes are Section 1 User Commands and Section 3 Library Functions If we don t ask for a section we get section 1 Since section 1 contains an entry for print f that s what it produced To force it to give you the system call manual page you can use in Linux on a Mac or with FreeBSD man 3 printf This tells it to look in section 3 which contains system calls in the C library How did I know to look in section 3 Mainly because the printf man page in section 1 told me so at the bottom under the See Also section In Solaris the syntax is a bit more complex man s 3C printf Again I knew to request section 3C of the manual by looking at the bottom of the print f 1 man page Fortunately you only need to concern yourself with what section of the manual to use when you look something up that it in more than one section For example man gettimeo fday brings up the man page we want for the gettimeo f day system call in section 3C when requested in Solaris section 2 the system calls section under FreeBSD Mac OS X or Linux If you see a reference to something like ctime 3C in the See Also section of a man page such as that in gettimeofday s man page that means the ctime man page is in section 3C I will use this notation frequently throughout the semester You will nd the Unix manual very helpful as we move forward So what does gettimeofday3C do See the man page and look at the usage in the example program 0 what s going on with memory management 0 what would happen ifwe declared s truc t timeval variables instead of s truc t t imeval gettimeofday 3C returns wall clock times This is the amount of elapsed real time So if our process is taking turns on the CPU with other processes see the Operating Systems course and it is not always running it continues to accumulate wall clock time but not CPU usage time There are also system calls to examine CPU usage time which we may consider later The Makef ile is using the GNU compiler gcc with the option 0 for optimization If you want to run this with a different compiler or optimization ags you can change the CC line in the Make fi 1e CSCI 2500 Computer Organization Spring 2009 If we compile and run this program it reports initialization and matrix multiplication times Ini tialization is just lling in matrices a and b Then we compute the value of each element of c using the dot product of the corresponding row of a and column of b Remember your data structures and algorithms what is the complexity of matrixmatrix multiply How long does it take you to run this program Example Reversing a String Solution to be developed in class Focus on o strings 0 dynamic and stack memory allocation More Unix Basics To access a Unix system you need to have an account A Unix account includes 0 usemame and password 0 userid and groupid 0 home directory 0 shell About the username o A username is typically a sequence of alphanumeric characters of length no more than 8 o the usemame is the primary identifying attribute of an account 0 usemame is often used as the basis of an email address 0 the name of your home directory is usually related to your usemame About the password 0 a password is a secret string that only the user knows 0 not even the system knows since the password is stored in an encrypted form and there is no method to decrypt CSCI 2500 Computer Organization Spring 2009 0 when you enter your password the system encrypts it and compares to a stored string 0 it s a good idea to include numbers and or special characters don t use an English word About the userid o a userid uid is a number an integer that identi es a Unix account 0 each userid is unique within a system 0 it is easier and more ef cient for the system to use a number than a string like the usemame 0 You don t need to know your userid but you can nd it by typing id at the command prompt Groups and the groupid 0 Unix includes the notion of a group of users 0 members of a Unix group can share les and active processes 0 each account is assigned a primary group 0 the groupid gid is a number that corresponds to this primary group 0 a single account can belong to many groups but has only one primary group Home directory 0 a home directory is a place in the le system where les related to an account are stored o a directory is like a Windows or Mac folder 0 many Unix commands and applications make use of the account home directory as a place to look for customization les for example The Unix shell 0 a shell is a unix program that provides an interactive session a textbased user interface 0 when you log in to a Unix system the program you initially interact with is your shell 0 there are a number of popular shells that are available bash csh tcsh ksh etc o the shell presents you with a customizable prompt CSCI 2500 Computer Organization Spring 2009 this prompt and much more can be changed by startup les or dot les 0 every Unix process has a current working directory the shell starts with the home directory as the current working directory the shell runs programs on your behalf Files and le names a le is a basic unit of storage usually storage on a disk every le has a name Unix le names can contain any characters although some make it di icult to access the le 0 Unix le names can be long how long depends on your speci c avor of Unix 1024 on my Mac each le can hold some raw data Unix does not impose any structure on les les can hold any sequence of bytes 0 many programs interpret the contents of a le as having some special structure text le sequence of integers database records etc Directories 0 a directory is a special kind of le Unix uses a directory to hold information about other les 0 we often think of a directory as a container that holds other les or directories 0 Mac and Windows users A directory is the same idea as a folder 0 folders are used as a GUI interface to directories and not unique to UnixLinuxFreeBSD 0 each le in the same directory must have a unique name 0 les that are in dilTerent directories can have the same name Unix File System CSCI 2500 Computer Organization Spring 2009 the lesystem is a hierarchical system of organizing les and directories the top level in the hierarchy is called the root and holds all les and directories the name of the root directory is File paths the pathname of a le includes the le name and the name of the directory that holds the le and the name of the directory that holds the directory that holds the le and the name of the up to the root the pathname of every le in a Unix lesystem is unique to create a pathname you start at the root so you start with then follow the path down the hierarchy including each directory name and you end with the lename in between every directory name you put a absolute pathnames start with a and fully specify the le relative pathnames do not and are relative to the current working directory is a relative pathname that always refers to the current working directory is a relative pathname that always refers to the parent of the current directory File attributes accessmodi cation times size ownership owner group permissions 7 hey it s octal Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring 2009 Topic Notes Memory Hierarchy Our next topic is one that comes up in both architecture and operating systems classes memory hierarchies We have thought of memory as a single unit an array of bytes or words From the perspective of a program running on the CPU that s exactly what it looks like In reality what we think of as main memory is just part of a hierarchy Small fast expensive DiskNirtual Memory Tape Remote Access etc Large slow cheap We have already considered how the use of registers is different from main memory in particu lar for the MIPS ISA where all computation must operate on values from and store results in registers From PampH we have the technologies access times and costs as of 2008 for these types of memory Technology Access time Cost per GB SRAM 0525 ns 20005000 DRAM 5070 ns 2075 disk 520 ms 0202 Note in the last line that this is the equivalent of 5000000 to 20000000 ns We will consider 0 the layer or layers in most cases of cache faster memory between registers and main memory 0 virtual memory which allow us to pretend to have more memory than we really have by using disk space to store parts of memory not currently in use CSCI 2500 Computer Organization Spring 2009 Caches We have noted from time to time in the course that memory access is very slow compared to registers and the ALU In reality the difference between processor speeds and memory speeds is even greater than we have considered and growing 0 For decades CPU performance clock speed doubled every 1824 months 0 Memory speeds have increased linearly at best 0 Improvements in memory technology have focused on increased density rather than speed address space increases by 15 bits per yearia factor of four every 1824 months Early vonNeumann architectures could rely on memory producing a value nearly as quickly as the CPU could operate on it Today we are very far from that Caches help keep the necessary values close to the CPU for faster access 0 A small fast memory is placed between the CPU and main memory likely multiple levels 0 whenever a value is needed for an address we rst look for it in the cache 7 only if it is not there do we look to main memory and copy it to cache 0 when we write a value we write it to cache then update it in memory later concurrently with other computations o terminology data found in cache is a cache hit not found is a cache miss Our goal is to avoid memory accesses as much as possible The cache is an associative memory where we can look up a memory value by its address in a table Cache key value A 4 CPU Memory slow D 4 D H Associative memory The key is the address value is the contents of memory at that address When the CPU makes a memory request the cache needs to nd the appropriate value very quickly 2 CSCI 2500 Computer Organization Spring 2009 o Ideally we can look up the etries that match the key instantly o Realistically there is a hash function that maps addresses to a set of possible cache entries 0 Depending on the hash function we may or may not need to store the key itself in the cache 0 The hash function depends on how we organize the cache 0 Generally there is an organization into lines of cache 7 groups of addresses that are stored and readwritten in the cache together typical 16 or 32 bytes this works well with the idea of a wider memory bus recall discussion when we built memory circuits and organized it into banks 7 it is very convenient if the memory can readwrite cacheline size chunks Direct Mapped Caches The rst cache organization we consider allows for direct access to a cache line from memory by breaking the address down into three parts 101 10111 1011 o the rst chunk is the key 0 the middle chunk is the cache line or cache address or index 0 the last chunk speci es the entry within the cache line In this case we would have 16 addresses per line in the cache and 32 lines of cache i of our memory actually ts in cache but each address if it is in the cache is in a speci c location we need only check if the entry at the desired line is the actual address we want 7 this is a fast comparison if the desired line is currently in the cache we just use it if not we have to go get it evicting the old value from the cache line 0 if we repeatedly use values from two diiTerent parts of memory that happen to map to the same cache line we will have a lot of cache misses Fully Associative Caches Here a cache entry is not restricted to a single location In fact it can be in any line We break our address into only two chunks for a fully associative cache 3 CSCI 2500 Computer Organization Spring 2009 10110111 1011 The key is all bits above the number of bits to specify an entry within a line 0 when looking up we have to check every entry in the cache to see if it s in there this sounds like a search think expensive but we can build a circuit for this 0 if we nd it use it otherwise we need to go get it 0 But then which line do we replace It can go anywhere We can consider many approaches to select a victim for eviction 7 LRU 7 leastrecently used LRU approximations random saturating counters 7 keep track of frequency of recent access 959 replace nondirty unmodi ed lines The decision must be fast but a poor decision will lead to another cache miss in the near future 7 also expensive 0 we are much less susceptible to an unfortunate access pattern compared to direct mapping since the associativity means exibility Set Associative Caches In between the other approaches 7 each address maps to a small set of possible lines o for 2way each address could be in one of two places 0 easier to check than in the fully associative case but if we are using 3 or 4 lines repeatedly we will end up with lots of cache misses 0 but still better off than the direct mapped We will almost de nitely do lookups in parallel as in this 4way example CSCI 2500 Computer Organization Spring 2009 There are tradeoiTs and all three approaches and various levels of associativity within set associa tive are really used Cache Discussion There are three ways that a memory value will fail to be found in cache 1 a compulsory miss7the rst time a value is needed it must be retrieved from memory 2 a con ict miss7since our last access another value that maps to the same cache location has evicted the value we want 3 a capacity miss7the value we want has been evicted not because of a direct con ict but because it was the least recently used value Notice that looking at a stream of memory references it is di icult to identify the reasons for misses they are equally di icult to avoid See diagrams of cache organization for directmapped Figure 57 and 4way set associative Fig ure 517 Some cache statistics 0 Most primary caches hold a small amount of data78 to 32K bytes 7 amazingly this value has been fairly constant for 40 years 0 a good primary cache delivers better than 95 of requests Why Locality locality locality Spatial locality and temporal locality 7 it s why caches work so well and why com puters are anywhere near as fast as they are today given the discrepancy between CPU speeds and memory speeds spatial locality when we access a memory location we re likely to access other items in nearby memory locations in the near future groups of local variables arrays code for a subroutine or loop temporal locality when we access a memory location we re likely to access that loca tion again in the near future code in a loop loop index or accumulator variables 0 missing 2 of requests on to a memory that is 50 times slower and it is often much worse means the average access time is 098 002 gtk 50 198 cycles or half ofthe desired speed 7 we must have a high hit rate ie low miss rate 0 Secondary and tertiary caches can be effective at reducing miss rate but they must be much larger secondary caches might be 256K bytes or larger while tertiary caches are measured in megabytes CSCI 2500 Computer Organization Spring 2009 Comparing the cache organizations 0 Fully Associative is much more exible so the miss rate will be lower 0 Direct Mapped requires less hardware cheaper and will also be faster 0 TradeolT of miss rate vs hit time 0 Levels of associativity are the compromises Consider this extreme example for 10ilt10000i ai ai ai64 ai128 ai64 ai64 ai128 a i a i64 and a i128 belong to the same set for a directmapped cache with 64 en tries but if we instead organize as a 4way set associative cache we re OK since we can hold all 3 in the cache at the same time Cache management instructions are readonly 7 motivation for Harvard cache where data and instructions are separated avoid need to worry about write policies see below at least for instructions 0 data is mutable usually 7 need to keep track of dirty lines 7 those that have been changed 0 if a value has changed it must be written back to memory some time at least before it gets evicted o the line never changes no need to write it back 7 avoid unnecessary writebacks Write policies Writethrough 7 memory is always written back but usually only when the MMU is not otherwise busy Writeback 7 memory is updated only when the line is evicted probably need some sort of write queue in either case and what if you write before you read variable initialization gtk read it into the cache rst gtk perhaps writearound for writes to lines not in cache 6 CSCI 2500 Computer Organization Spring 2009 gtk how likely are we to be reading them soon anyway 0 Another technique victim cache a place to keep recently evicted lines these are good candidates for reinsertion at least keep them here until written back especially useful for directmapped caches can bene t set associative caches Multiprocessor issues cache coherency For example the Power 4 architecture has 4 cores each with its own L1 cache but the 4 share an L2 cache 0 if a line is in 2 L1 caches how do we mantain consistency 0 do lines in an L1 also exist in L2 This topic would be investigated extensively in an advanced architecture course and somewhat in operating systems Handling Cache lVIisses in the Datapath and Control When we have a cache hit our datapath and control be it singlecycle or pipelined works as we saw previously In the case of a cache miss a requested memory value instruction or data is not immediately available We must wait an appropriate amount of time for the value to be placed into the cache before we can obtain the desired value The basic idea in either datapathcontrol approach is to stall the processor when a cache miss is detected We wait for the memory values to populate the appropriate cache line then restart the instruction in question The processor can continuously attempt to fetch the value until it works it s a hit During this time bubbles are inserted into the pipeline Virtual Memory Another place caching is very common is to use main memory as a cache for data being read from and written to a disk Main memory is slow compared to registers and cache but is very fast compared to disks where we have to wait for gasp actual mechanical parts But we ll now consider just the opposite 7 the use of disk to store values that we want to pretend exist in main memory but for which there is not enough main memory 7 CSCI 2500 Computer Organization Spring 2009 When the memory requirements of a program or the sum of the requirements of the many pro grams which may be running on a system are larger than physical memory we depend on this idea 7 virtual memory We will just consider the basics 7 look forward to Operating Systems for a more thorough discus sion 0 Programs generate that is the CPU puts onto the address bus logical addresses 0 Each program in a multiprogrammed environment may have its own private address space 0 Even a single program can have a larger address space than the available physical memory 0 Basic goal assign logical addresses to physical addresses translate logical addresses to physical addresses when accessed 0 Logical memory is broken into pages usually around 4K in size often stored on secondary store disk 0 Pages are brought into frames of physical memory of the same size physical logical memory memory frames pages 0 Operating system keeps track of a map called a page table CSCI 2500 Computer Organization Spring 2009 page a o l 1 A 1 pagan page i page 2 2 3 7 pages me We 3 95922 iogmal 4 paga l memory physical memm 0 Entries in a page table either give a disk address for a nonresident page or the information to translate the logical address to the appropriate physical address phismall mumow o Accessing a nonresident page causes a page fault 7 analogous to a cache miss but much more costly We re talking millions of CPU cycles 7 bring in the page oflogical memory into aframe ofphysical memory possibly displac ing a page currently in physical memory 7 takes long enough that the OS is likely to let a different process use the CPU While the page fault is being serviced 0 Looking up page table entries is expensive it is a er all an additional memory access for each real memory access and often is supported by special hardware 9 CSCI 2500 Computer Organization Spring 2009 0 Speci cally we usually keep a very small cache oftranslations called a translation zoom Sldz bu zr TLB that is responsible for servicing a majority oflookup reque s incai page tahia A TLB hit is still more expensive than a direct memory access if we had no paging at all but much better than the two references from before TLB is typically around 54 entries tiny but good enough to get a reasonable hit rate locality is your fnendz o In large systems the page table might not be memory resident which leads to things like multilevel page tables inverted page tables and other ideas 7 come back for os Computer Science 2500 Computer Organization Rensselaer Polytechnic Institute Spring 2009 Topic Notes Pipelines We have all seen and experienced examples of pipelining in our daily lives The book uses a laundry analogy Figure 425 but any kind of assembly line type of operation might be a good example The laundry analogy is a good one Consider how much more quickly laundry can be nished if we take advantage of the fact that the washer and dryer and in the book s example the folder and the storer can all operate in parallel and each stage can start doing its work as soon as the previous stage completes its work We don t process a single load of laundry any more quickly but the overlap in successive loads leads to a faster overall completion time Similar ideas can be used to create a pipeline of instructions being executed by a processor MIPS instructions can be executed in phases These phases can be used to create such a pipeline We ll consider a pipeline for MIPS which is typical of many RISC pipelines We have seen the instructions in our MIPS subset include ve steps 1 IF instruction fetch 2 ID register read and instruction decode 3 EX execute or calculate address 4 MEM memory read or write 5 WE write back register For our example we will assume that the register access stages cost 100 time units each while memory access and ALU operations cost 200 Figure 426 in the text shows how long the different instructions in our MIPS subset will take given these latencies Figure 427 in the text shows the singlecycle and a simple pipelined execution Note that the speed of our pipeline in this case is limited to the speed of slowest component A single instruction now takes as much as 900ps to complete Note also that the register accesses are strategically set up so that a register read takes place in the second half of a 200ps slot and the register write takes place in the rst half This will be bene cial later on CSCI 2500 Computer Organization Spring 2009 Figure 428 shows a symbolic representation of the execution of an add instruction showing which components are used in each step The MIPS instruction set was designed with pipelines in mind so it has features that make pipelin ing easier 1 all instructions are the same length allowing the next IF in the pipeline to proceed immedi ately 2 there are very few instruction formats allowing both register read and instruction decode to be in the ID pipeline stage 3 the limitation on memory access to just the 1w and sw instructions allows EX to combine execution and address calculation 4 memory alignment means we always retrieve the entire instruction in a single memory access A typical RISC system might have a 5stage pipeline like this A system with a more complex instruction set may have a 1218 stage pipeline Goal 100200 stage pipelines to get very signi cant speedups Many architectures now have multiple pipelines as well The original Pentium had two pipelines and a smart compiler could keep both pipelines busy effectively doubling the number of instructions completed in the same number of cycles If 2 is good why not 4 or more Too much duplication of hardware Another option just have multiple functional units not all stages of the pipeline This is especially useful if the execute stage takes longer anyway This is a superscalar processor Hazards CSCI 2500 Computer Organization Spring 2009 In the ideal situation we can always be executing an instruction at each stage of the pipeline to achieve maximum throughput However we need to make sure that the end result of a pipelined execution is identical to a purely sequential execution Several things called hazards can happen that introduce problems that will prevent the ideal throughput Hazards fall into three categories 1 Structural hazards occur when two instructions executing in a pipeline need the same phys ical hardware memory or ALU at the same time 7 the implementation and instruction set design can avoid these N Data hazards occur when an instruction needs to access data that has not yet been produced by another instruction further ahead in the pipeline but which has not yet completed its execution For example add sO t0 tl sub t2 sO t3 Here the value in register s0 needs to be read for the sub instruction before it has been written by the add instruction A simple solution is to have the compiler introduce dummy instructions or bubbles to stall the pipeline This is costly and there s really no reason to wait to store the result of the add into s s O and then retrieve it from s O for the sub instruction It has been computed in time so we can forward or bypass the value from one internal state register to another as shown in Figure 429 of PampH Figure 430 shows that even this is not always suf cient to avoid a bubble 7 using a value being read from memory in one instruction as an operand in the next is impossible without stalling E Control hazards occur when a branch instruction is executed but subsequent instructions are already in the pipeline behind it 7 instructions that are not supposed to be executed at all If we notice a branch we need to go back into our pipeline and cancel any instructions that we ve started into the pipeline that are no longer going to happen because there was a branch This will introduce bubbles into the pipeline We can t start doing more useful work in that slot in the pipeline because we d already have had to fetch the instruction and we don t know what instruction that will be The bubbles might be inserted by a compiler in which case we don t have to worry during execution since the items in the pipeline after the branch are not going to do any real work The big danger is that the cancelled instruction could destroy some context before we realize it s not supposed to happen In this simple pipeline we re probably safe since nothing is written back to a register or memory until the last step but longer pipelines may require more care CSCI 2500 Computer Organization Spring 2009 Delayed Branching and Branch Prediction We can try to minimize the effect of control hazards with two techniques delayed branching and branch prediction Consider the execution of this code in a pipelined system j somewhere add 345 The jumpbranch is in the pipeline but by the time we know it s a branch the add is already in the pipeline A compiler can do this on purpose 7 we know the add is going to happen even though we re taking a branch before we get there Most modern architectures have a delayed branch of l or 2 instruction cycles to allow this opti mization If we don t have something to do in those delay slots the compiler may have to ll them with nops This helps eliminate some bubbles in the pipeline but when we do have nops that s still a bubble Compilers or programmers can also unroll loops to help eliminate branches and keep pipelines full This is generally a good thing anyway because branches aren t doing useful work 7 just wasted time But what about conditional branches bne l oop add If we take the branch then the add instruction should never have happened and we have to kill the instruction Branch prediction is very useful 7 try to determine which instruction is most likely to be executed after a branch in an attempt to keep the pipeline going Consider if C 51 else 52 CSCI 2500 Computer Organization Spring 2009 Which is more likely Programmers probably make the then part the more likely case So a compiler might want to set things up to start pipelining 81 after the condition is checked How about a while loop or a for loop while C 51 Here C will be false only once for the life of the while loop so the best assumption is to predict a successful branch another time around the loop The UltraSparc III actually has special branch instructions that a compiler can use when it knows a certain branch is likely to be taken the vast majority of the time Some rules of thumb 1 If a branch leads deeper into code assume the branch will fail 2 Otherwise assume the branch will be taken This gives about an 80 success rate for humanwritten code Today s branch prediction techniques in optimizing compilers are more intelligent and clever and can get more like 98 No matter how good our branch prediction is it will sometimes fail and we need to be able to make sure instructions can be cancelled One possibility allow instructions to do everything but store their result until we re absolutely sure Another headache multiple conditional branches in the pipeline Pipelined Datapath and Control We will now consider how to construct a data path and control to manage the 5stage pipeline for our MIPS subset In Figure 433 we see the singlecycle data path we looked at before redrawn to show the pipeline phases For the most part information ows lefttoright in this diagram The exceptions in blue represent hazards o WB puts a result back into the register le 7 this is a data hazard o MEM may replace the PC with a branchjump target 7 a control hazard CSCI 2500 Computer Organization Spring 2009 Figure 434 shows instructions being executed by a pipeline 0 stages are labeled by the components being used in each 0 note that the register le is written in the rst half of a cycle and read in the second half this reasonable assumption helps us avoid some potential hazards later on o in this case no hazards arise We will need to add registers to our data path to support pipelining These registers are shown in Figure 435 0 each set of registers holds the values passed between each pair of adjacent stages 0 each is large enough to hold the necessary values The text presents a series of gures showing the active parts of the pipeline during the execution o The top half of Figure 436 shows the IF stage the instruction from memory is retrieved and stored in the IFID pipeline registers PC4 is computed and stored in the IFID pipeline registers The PC is updated with either PC4 or the result of a branch instruction 0 The bottom half of Figure 436 shows the ID stage the instruction stored in the IFID pipeline registers is used to retrieve 2 values from the register le which are both sent to the IDEX pipeline registers the immediate eld of the instruction is signextended to 32 bits and stored in the IDEX pipeline registers the PC4 value is passed along from the IFID pipeline registers to the IDEX pipeline registers for use later we don t need all of these values but we don t necessarily know which yet so we pass them all along 0 Figure 437 shows the EX stage for a 1w instruction the signextended immediate value is added to the base register both of which come from the IDEX pipeline registers this sum the effective address for the memory access is stored in the EXMEM pipeline registers 0 Figure 438 top shows the MEM stage for a 1w instruction 6 CSCI 2500 Computer Organization Spring 2009 the effective address stored in the EXMEM pipeline registers is used to retrieve a value from the data memory this value is stored in the MEMWB pipeline registers 0 Figure 438 bottom shows the WB stage for a 1w instruction the value retrieved from memory saved in the MEMWB pipeline registers is sent back to the register le for storing 0 Figure 441 adds extra values to the pipeline registers in recognition of the fact that the regis ter number needs to be retained for the WB stage if we don t we d be using the destination register from a different instruction 0 Figure 442 highlights the parts of the datapath that are used for 1w 0 Figures 439 and 440 show the completion of a sw instruction here we need to remember the value from the register le that is to be stored in memory It must be passed along during the EX phase from the IDEX registers to the EXMEM registers during MEM we store the value at the location speci ed by the effective address both coming from the EXMEM pipeline registers the WB stage does nothing for SW 0 Figure 443 shows a sequence of instructions in a pipeline and Figure 445 shows the in structions in execution at the fth step of this execution sequence Augmenting control to support a pipelined control may seem daunting but it really is not as bad as we d expect We can use the same control lines as we did for the singlecycle implementation but each stage should be using the control as set for the instruction it is executing Control values can be stored in the pipeline registers to make this happen Figure 446 shows the pipelined data path with the control added We need only store control signals at each stage that are to be used in that or in subsequent stages Figure 450 Figure 451 shows the complete pipelined datapath Dealing with Hazards We noticed earlier that our pipelines cannot always operate at full capacity 0 some instructions do not need to use all stages of the pipeline 7 CSCI 2500 Computer Organization Spring 2009 o instructions may depend on values computed in prior instructions that are still in the pipeline data hazards o instructions may begin executing before a previous jump or branch is taken control hazards Data Hazards We rst look at enhancements to our pipelined datapath and control that will deal with data hazards Figure 452 shows the dependencies in an unfortunate instruction sequence 0 2 is computed by the rst instruction and is used by the next 4 o The second and third instructions need the value of 2 before the rst instruction has had a chance to store it o The last two instructions are ne remembering that values are written to the register le in the rst half of the clock cycle read in the second half Upon closer inspection though we see that the values needed by the second and third instructions are available in pipeline registers as shown in Figure 453 o the value for 2 in the and instruction is in the EXMEM pipeline registers o the value for 2 in the or instruction is in the MEMVV B pipeline registers We need to detect when these conditions occur and account for them There are two main cases to deal with l the destination register Rd of the instruction in the MEM phase is one of the source regis ters Rs or Rt of the instruction in the EX phase 2 the destination register Rd of the instruction in the WB phase is one of the source registers Rs or Rt of the instruction in the EX phase For the rst case we need to divert a value from the EXMEM pipeline registers as an input to the ALU In the second we take a value from the MEMWB pipeline registers The text breaks this info four cases we can check for 1a EXMEMRegisteer IDEXRegisterRs 1b EXMEMRegisteer IDEXRegisterRt 2a MEMWBRegisteer IDEXRegisterRs CSCI 2500 Computer Organization Spring 2009 lb MEMWBRegisteer IDEXRegisterRt In our example from Figure 452 the suband hazard is of type la and the subor hazard is of type 2b So we want to detect these conditions and perform appropriate forwarding only when it matters 7 that is when the rst instruction is actually going to write the ALU result into a register ie its RegWrite control line is 1 We can also skip forwarding if the destination register is 0 since it will never be changed to have any value other than 0 Figure 454 shows the addition of a forwarding unit that can present inputs to the ALU from pipeline registers instead of from the register le The forwarding unit handles EX or MEM hazards by setting two new control lines ForwardA and ForwardB The ForwardA and ForwardB lines can be set according to these rules 0 for an EX hazard if EXMEMRegWrite and EXMEMRegisteer l O and EXMEMRegisteer IDEXRegisterRs ForwardA 10 if EXMEMRegWrite and EXMEMRegisteer l O and EXMEMRegisteer IDEXRegisterRt ForwardB 10 o for a MEM hazard if MEMWBRegWrite and MEMWBRegisteer l O and MEMWBRegisteer IDEXRegisterRs ForwardA 01 if MEMWBRegWrite and MEMWBRegisteer l O and MEMWBRegisteer IDEXRegisterRt ForwardB 01 However the rules for MEM do not take into account that the forwarding from MEM should not take place if there is an EX hazard for the same destination register a double data hazard since that value would overwrite in a nonpipelined world the value in the register to be used by the EX instruction The rules for MEM are then augmented to if MEMWBRegWrite and MEMWBRegisteer l O and not EXMEMRegWrite and EXMEMRegisteer l O and EXMEMRegisteer IDEXRegisterRs and MEMWBRegisteer IDEXRegisterRs 9 CSCI 2500 Computer Organization Spring 2009 ForwardA 01 if MEMWBRegWrite and MEMWBRegisteer l O and not EXMEMRegWrite and EXMEMRegisteer l O and EXMEMRegisteer IDEXRegisterRt and MEMWBRegisteer IDEXRegisterRt ForwardB 01 Figures 456 and 457 show the data path augmented with additional lines and a forwarding unit that can resolve data hazards We next consider an even more unfortunate data hazard a loaduse hazard as shown in Figure 458 This one cannot be resolved through forwarding 7 the value has not yet been retrieved from the data memory by the time it is needed In his case we need to stall the pipeline to wait for the value to become available as shown in Figure 459 This is accomplished by adding a hazard detection unit Using our notation from before we know a stall is necessary when IDEXMemRead and IDEXRegisterRt IFIDRegisterRs or IDEXRegisterRt IFIDRegisterRt Note that we are detecting this condition as early as possible 7 when the olTending sequence in structions are in the IF and ID stages This makes it easier to stall The stall is accomplished by adding a bubble to the pipeline 7 an instruction that does nothing a nap or no op This requires that o the PC is not updated 0 the IFID pipeline registers are not updated 0 the control entries in the IDEX pipeline register are loaded with all 0 s which results in the bubblenop The data path augmented with a hazard detection unit is shown in Figure 460 Stalls reduce the performance 7 we re spending time executing a nop instead of meaningful in structions However they are essential to ensure correct behavior An optimizing compiler should be aware of the details of the pipeline and can rearrange instruc tions in many cases to avoid the need for a stall at execution time 10 CSCI 2500 Computer Organization Spring 2009 Control Hazards Recall that a control hazard occurs when a branchjump instruction is being executed and subse quent partiallyexecuted instructions in the pipeline need to be cancelled since they never should have been executed The speci cs of our pipeline mean that a conditional branch s outcome is not known until the MEM phase Figure 461 shows a branch instruction that results in a control hazard 7 instructions already in the pipeline that are not to be executed are ushed their control lines are set to 0 Figure 462 shows how we can reduce the cost of a control hazard by adding hardware to determine the result of a conditional branch sooner during ID 0 The branch target adder is moved to ID 0 An equality checker is added that compares the values coming out of the register le that can quickly determine the result of a beq or bne Another technique that works well with short pipelines such as our 5stage pipeline is the use of a branch delay slot 0 Here we always execute the instruction immediately following a branch or jump o This way a programmer or hopefully a compiler can reorder instructions so that some useful work can be done in the slot after the branch 0 This is not always possible so a mop instruction may need to be inserted in some cases 0 However it is a quite effective way to reduce the cost and frequency of control hazards Real implementations of MIPS among other ISAs make use of branch delay slots Figure 464 shows some examples of how code can be reordered to take advantage of branch delay slots We already discussed some branch prediction techniques The text discusses some of them in a bit more detail
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'