### Create a StudySoup account

#### Be part of our community, it's free to join!

Already have a StudySoup account? Login here

# Introduction to Computer Architecture II CS 270

ODU

GPA 3.7

### View Full Document

## 5

## 0

## Popular in Course

## Popular in ComputerScienence

This 112 page Class Notes was uploaded by Armani Kunde on Monday September 28, 2015. The Class Notes belongs to CS 270 at Old Dominion University taught by Larry Wilson in Fall. Since its upload, it has received 5 views. For similar materials see /class/215306/cs-270-old-dominion-university in ComputerScienence at Old Dominion University.

## Reviews for Introduction to Computer Architecture II

### What is Karma?

#### Karma is the currency of StudySoup.

#### You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/28/15

Key 710 Notice that we have blocks of size four thus our cache thus each block row in our cache holds four words These word offsets are labeled 00 01 10 and 11 There are four rows labeled 00 01 10 and 11 The sequence we are given is of word addresses To convert this into a sequence of block numbers we divide by 4 and throw away the remainder which is the offset Word 2 will be found in block 0 it is actually the third word in that block Word 3 is also found in block 0 A 0 plus a remainder of 3 It is the 4 h word in that block Word 11 is in block 2 and will the 3rd word in that block Thus ourword reference sequence 231116211364481911322427611 When divided by 4 using integer division becomes our block number reference sequence 002 45 3161242051612 We map these block nos into 4 cache blocks direct mapping mod 4 to get the cache block number sequence 0020130002011212 Thus we get the following map where the contents of each cache block is read L to R timewise CBN is cache block number and BN is block nummmmmber CBN BN 0 0 4 16 12 4 0 1 5 1 2 42 6 2 3 3 To see this considerthe following explanation Word 2 maps to block 0 this gives us a miss and block 0 which contains words 03 is loaded into the cache block 0 mod 4 which is cache block 0 Word 3 maps to block 0 which is in CBN 0 so we have a hit Word 11 maps to block 2 as are all words 811 2 mod 4 is 2 miss so block 2 is stored in CBN 2 Word 16 maps to block 4 as are all words 1619 4 mod 4 is 0 This is a miss so block zero is removed from CBN 0 and replaced by block 4 Word 21 maps to block 5 2023 5 mod 4 is 1 miss so CBN 1 now contains block 5 Word 13 maps to block 3 3 mod 4 is 3 miss so CBN 3 now contains block 13 Word 64 maps to block 16 which in turn maps to CBN 0 miss Word 48 maps to block 12 which in turn maps to CBN 0 miss Word 19 maps to block 4 which in turn maps to CBN 0 miss Word 11 maps to block 2 which in turn maps to CBN 2 hit Word 3 maps to block 0 which in turn maps to CBN 0 miss Word 22 maps to block 5 which in turn maps to CBN 1 hit Word 4 maps to block1 which in turn maps to CBN 1 miss Word 27 maps to block 6 which in turn maps to CBN 2 miss Word 6 maps to block1 which in turn maps to CBN 1 hit Word 11 maps to block 2 which in turn maps to CBN 2 miss Thus we had 4 hits and 121 misses Final contents of Cache CBN 0 contains block number 4 or words 03 CBN 1 contains block number 1 or words 47 CBN 2 contains block number 2 or words 911 CBN 3 contains block number 3 or words 12 15 Whewii 714 The miss penalty is the time to transfer one block from main memory to the cache Assume that it takes 1 clock cycle to send the address to the main memory a Con guration a requires 16 main memory accesses to retrieve a cache block and words of the block are transferred 1 at a time Miss penalty 1 16 X 10 16 X l 177 clock cycles 7 Con guration b requires 4 main memory accesses to retrieve a cache block and words ofthe block are transferred 4 at a time Miss penalty 1 4 X 10 4 X l 45 clock cycles Con guration c requires 4 main memory accesses to retrieve a cache block and words of the block are transferred 1 at a time 0 Miss penalty 1 4 X 10 16 X l 57 clock cycles 728 Two principles apply to this cache behavior problem First a twoway setassociative cache of the same size as a directmapped cache has half as many sets as the directmapped has blocks Second LRU replacement can behave pessimally as poorly as possible for access patterns that cycle through a sequence of addresses that reference more blocks than will fit in a set managed by LRU replacement Consider three addressesicall them A B Cithat all map to the same set in the twoway setassociative cache but to two different sets in the directmapped cache Without loss of generality let A map to one set in the directmapped cache and B and C map to another set Let the access pattem be A B C A B C A and so on The directmapped cache will then have miss miss miss hit miss miss hit and so on With LRU replacement the block at address C will replace the block at the address A in the twoway setassociative cache just in time for A to be referenced again Thus the twoway setassociative cache will miss on every reference as this access pattern repeats Here is a second answer The example we did in class 8 cached blocks and thus used mod 8 for direct mapping and mod 4 for two way lfA has block nos 0 B has nos 4 and C has nos 8 Then for direct mapped A and C go to cached block 0 will be goes to Cache block 4 thus A B C A B C has the miss pattern M M M M H M since B is a hit the second time around While A B and C all are zero mod 4 they all map to set 0 which can hold only two block 80 the ABCABC patter results in all misses This is not quite an example ofthe rst answer since that would require B and C to map to the same cache location rather than A and B 732 Here are the computations for each machine Lecture 3 Chapter 2 CS 270 spring 2008 skip section 27 to save time SecWON 2 7 Sm 0 Pctf vMe J39er vN0 0 j quws skm res JaSSWCJ Section 28 Ascii Table and copy string code Char llln Hu Char Bin Ilu CIIIII Bln HEX Chnr BIII Hex NL I um 0006 00 st mu now an a DD mm 40 IIn mm 60 s 000 I In I x A mom 4 Immm m srx mooln 02 quot0mm 12 a 100mm 42 x 0mm 2 ET 00090 03 II IIIIII c IouuoII 43 c IIUOUH EoT UDUOIIX H s mumnu 4 D 10001 44 a IInquo 64 ENQ 000mm 05 II IquI 1 a 100mm 45 IMHO 9 ACK 000mm no r 010mlquot 25 1 100mm 40 HULIIIU so I351 00001 07 v moon 7 i g Imolll n g llD lll 67 BS 0001000 us I mu moo 2x 3 Ion mm as I III mm mI II I DIXIImI on not a 1 0000 49 X mum at LF 000 mm M mu mm 2A 1 I00 Iqu M 1 III mm M j 0001011 08 4 0w Inn x 101mm 4 k Iu InII FF 000 I mu Hon 2 1 I00 I 100 4c 1 III 1le at CB non IInI 0m Hm 1 mm I H m IIu IIoI an so omIIIu UE ow IIID 2E I1 um um as 11 III In III S omuII DIOIIH 2r a noun a IIuIIIji D I m n on 0000 3 I7 101mm 1 III on0 In DC ml onm 1 le II o IIII mm 5 1 II 0001 I DC loam I2 2 DH 0010 12 1 Im mm 5 2 III mm 7 DC m1 mu 1 3 GIIOI39IH I1 9 10100 53 logy A I mm 14 4 ll DI 1 IOI mm 2 III 01 74 NAK OOI mm I5 5 I ll umI V quotII IIIIII 5 I III 0101 s SYN 001mm In s all one 36 v Im mm 55 v IIImIn 7n man 1 IJIInIII 17 I II III I IIIIIIII 77 CAN ml moo Is a U I000 18 x I01 mm x III Icon 7x I mm W 9 I1 mo 2 Im IUUI 50 y 001 79 SUB 0m Imn IA I OH 1010 1A 2 Im Iulo 5A 3 III mm M ESC IIIII NH 13 l 011 um I E IN I ll 5 7 I OH 78 m I lllm Ic lt on me C III SC I II as om llul ID 0H um 10 1 Im um 50 Il IIuI 7n as um um IE an Inn NE IDI mo in I ma 715 us II1II F 7 ml IIII F IDI IIII 517 URL III III 1F Ahhminuom Iur Cnnlrol Chamnus NlII quImnIInms FF rm and CAN camel son smn ufMmling cu curring mum RM and ufmcdlum 39rx smnuflexl so wIIi qu sun rIIIhinIuh x and flux SI xhi in EC m EOT muarmnsmmkm DLE um link cup is me stpamur ENQ mquiry dzvicl nnlml I GS gmup xLpnmm39 ACE acknowlqu ncz Ihzvimcnulmll RS mmrd Imam El ml DC dnvictcunInl Us unilscpmmr hmbpacc nca dcvicccmllmll 5139 spun at hurimmnl mmlmim NAK IIcguIivI acknnwlcdg um Mae m Amman 5mm Cm nquot L liM cc mummud Iv IniormIIu onInmwhungelASCII F I VT mica tabulullun 1539quot and 0f Imusmmwn block Section 29 9 R655 jig ml camp N0 ad 290 lax ED Ila Sb nll SQJPPD Lpumq all 0 uomppv u spumdo mu A39Jpnds umu uugpmlsug puruq runmpum up uoguunsu duln 3111 91mm 11mm Jul ml pur f gapmdo dmn sq Jo 31111 all Mon 39OU Sm 9a Sm 9 a 00001 s vd lxnu mp uo aalt 1 gm wuuuj 511p mm palqluns x u m1 mum 391 M Sc pmu quotmum aJmu 111 I Aunnn39 mum MLngme N M J mjmm Decoding Machine Code What is the assembly langtmgc statement Corresponding in this machine instructiunf 1114 Mt 39Jlmt Thu lirst step in converting lit decimal to binary is tn lind thi up lik39ldai Uhlx39 3 BS 16 7 m Utttm tltlt lll lllll 1111 111th tm Halt tll lllt rmine the npurntiun Referring to l ignru 1 Ve lnnk at this up licltt tn detc whcn hits 5 I729 are 000 antl hih 28720 art 000 ll 1 an er39ormait instruction Let39s reformat the binary instruction intu RI39umml FICidh t t in igi 2m rl rd Inm gtml t l 0 ltlltUU 11111111111 llltlt LNl up r HOllllUll U1 The bottom portion 0139 Figurv 2 25 determines thc npcrntinn ol39 tn Huimat 39 i are 100 and hits 270 are 101 whirl 11k 15 instructiuni In this raw hits 5 this hlnary puttcrn represents In amt instruction W decode the rest 0139 the imtruttion by looking at the eld values 39I hv dcriy mal values are I39or llic rs Cid IS l39ttr rt In for rd shanii is ltllllhk t ii Figure 218 39 these Ul bL l h 39epresent registers t6 1 H 7 and LW Now we can shnw the assembly inall ucliun rtdtl tallt1i itt nraymn on mm by M Aways mm c z 5 gm Equal esl Pcvmmuve munch PC v I1 100 Mm equal lest PCralume 5un Ench nn n31 equal 39seu on less man v7 Compate lass man 1m rm 7m m39zilan gm 311vm Comnarelessumncmslam unmm n W mm 7 man arlrlrmx an Compare lzss Ihrm ansmnl 4mm Tumquot Jumn m targfl address Jump mama Jlump and HM Fm swlchL mocadure alum h a PC 4 go 1 10000 Fm procedure cal Lectures on Chapter 2 Modi ed for CS 270 spring 270 This is meant to be a patch to provide needed background Section 22 Operations insirum a mmpum 0 KlLI lllc um variables ll md and m pm llwn39 mm in I m In ation is rigid m Hm such MIPS uiiluncuc instruction permum n one upmuon and mum alwws have Xmlly lhrcu m NHL2 lm Examplc hllpp we wnm lo plum lln39 51ml HI 39 m wlcs lquot Ll 21ml w imn dl iath m In 1111 ml n39c heng nlslilwmluly gur Ilmul mu 1 quotvariablequot is in ha um mmquot W explain in mm r6 Thu Ibllnwing sequence ul ilhll39llilm Adds lhc our variables H l 1 l 3 HM rum ul r MM r w m li JLl H m all n w W Th HHl ml H l l l Eu ml u r N Th wu Ml ll r m 39H 2 We Will now take our rst look at the green MIPS reference data card Eventually you Will understand every etail on this card At this time we Will concentrate on the registers primarily NAME NUMBER SVO VI 23 BASIC m smucnou FORMATS ll upcode 1 r1 I t I mcl 3 2a 25 1 2 6 I5 II In 6 5 I opcmlc I 5 immediate 31 26 s z zu xs 5 J opcode 1 address av 24 25 can srnucnou SET MNE OPCODI J MON FOR NET NAME 1C MAT OPERATION in Vcrilog In mid R de ers am I o I 20quot Add 1mmcdiau 1me I Rm qus SignEx mm In 8 Add Innn Unsigned mam I Rn Rn1 SignBIdmm 2 9 Add Unsigned nadn R and Rrs Rln u 21 And and R Rrd Rm amp Rln o 24 1 And Immend am I Rn ers amp ZcmEx mm 3 cm iRRIFS1RIHD rm 0quot Eq m q 39 PCPC4anchLIr I41 4quot iIIRIrsIlzkInII anch On Not Equal hue I PCzPCNmmcthr 4 5 Jump 3 J PCJumpAd lr 5 2m Jump And Link In 1 J M IPC4PCJumpAddr 5 3 Jump chisxer Jr a PcRIrsI 0 as Rl l l2439b0Mers Land syn Unstmi ru I mgnExummm 2 024 Load Halfword um l deH 1639b0MRIs o 5 Unsigned SignExllmm50n 2 39 Lnad Upper 1mm I m 1 Mn limm lh39bOl f Ian Word hi I Rn MRSign xllmm 2 o I 23m Nor m R Rrd RIrsj I Rlnl u 27m or R erdl lel I Rm 0 25m Or llllmedialc on I kin R15 IZImExllmm 3 d ScchssThnn ulr R RtdRIsltRn39 l 0 Draw 5 Less Than Innn 5 I z 1 I W I Sliwlm Bun SnI Lg Than mm Kin Rhs lt SignExllImn Umigmzd quotmquot 39 393 I 0 1x5 b quot 31 3 1m R ani 5 z Rm 139 I 0 a o I 2b x Shin Lc Lugical an R RM Rr51ltlt shun 0 I 00 Shin Righl Logical 31 R klrd Rrs gtgt shuml o I 02 Slorl aym ah I W39sps wmlm39wr zig g m 25m 1 SInm Halfwon n I RN Emma 5 m 2 29 Stan Wold rm I MRIsignixtlmln Rln 2 3 n Sublncl nub R Rer ers RIrI I 0 I 22m Submacl Unsigned auhu R thd le Rln o I 23 I May causd ovur ow exccpuuu 2 SignExIlmrII I I6IimmddiIIIe15J immudinm I 3 ZcmExllmm 1 l l lh immm mc l 439 anchdr l lAIimmcdinlcllle immediate Tho I S JumpAddr I PCIS 128 nddm Z39H a Operands cmsidzmd unsigned numbers vs 2 g camp AHITHMETIC cone INSTRUCTION SET OPCODE MNE FMTFT MON FOR FUNCT NAME 1C MAT OPERATION ch BrancIiOlIFPTrue bclt Fl ilePcondPCPC4BrancllAddr4 HIEIri Brunch 0n FP False um Fl il39FPcandP39 c4smnchAddr141 lugal Dividl du R LoRrs Rn1 RIrsRrl OIJ lla Ilvichnsignd nivu R LoRrsIRrlHiRrsRn a D lb FPAdd Single mus FR will Flfsl1 Fll39l 111040 FPAdd lFfdFde lFIfsFfsl ll D0 1 anxle FR F MFmH llll IO FR FPcond Ffs up Full 239 l 2 n lllOI Ii FP Compare FPcond l FIsll l39sI op Double d39 FR F F l39 l 0 H x is eq 1 ur Le a is lt orlt y is 32 3c uric FP Divide Slnglc 11 FR FLfd Hm I pm HID13 FF Divide r lFrd1Ffdll IFIstFUsH 1 Double dh d FR FW FU H 1111 13 FP Multiply Single muLS FR Ffd Ffs F l 11042 FP Muluply V Frdllelll lFlfle lsz ll Double mum FR Wmmhm 1111 12 FP Subtmct Single suns FR FI39dFfs F lulu41 FP Subtract lFfdFrdl Fl sFfsl Double subd FR 39 F m39m n Lilli ll LoadFPSingIc M l FnlMRrsSignExunun 2 31i 4 LoadFP n Ruxl5ign umim 2 3544 1 39 F39nlMersSignExllmm4 Move From Hi m R R rd Hi Mle Move From Lo mflo R R rd Lu 0 l l2 Move From Cunlml ruch R R rd Cers 16101 10 mult R iLol Rm 39 Rm 04418 Multiply Unsigned multu R lHiLo Rls Rm a 01449 Slum FP singl s Mrers1SigiiExllnunl 1111 2 saw Fl M R rs Si Exllmm n 2 Double 5 I MIRIrsIsigExllmml4l Hm I WM FLOATING POINT INSTRUCTION FORMATS FR encode vim l n I rs raj mcl 3 2625 05 0 ll 20 III I II III Fl rupcodc l full I n immcdimn l 31 Ill E II JD IE IS 0 PSEUDO INSTRUCTION SET NAME MNEMONIC OPERATION Branch Less 39Hlxm n1 z in RlsltRn PC Lubcl Branch Greater Than b9 i RrsgtRrl PC Label Branch Less Than or Equal 1 in Rr ltR Label b 9 PC Branch realcr ml or Equal bga II RrsgtRI1I PC Label Load Immediate 11 R n1 Immedime Move move R rd Rrs Note that items 3 and 4 have been removed as not pertinent to CS 270 spring 2008 Section 23 Operands of the Computer Hardware 5 An example from the text rquotac 39a Usingquot quot instance the assignment statement from our earlier example Tl tlttrllll 5 3 and t 11 respectively What is the compiled M ll S code inlil tuttiw Fti39ltl lrtl snli listl lv 5 if r tier i tti rontains ti ll r it contains l 1 ll t tll rm 7 Mi Wllltll i it ltl39ll39 It is the compilers iob to associate program variables with registers Take or The variables I g ti i and ii are assigned to the registers liO la il 39IWZ The compiled program is verv similar to the prior example except we replace tie 39 39ables with the register names mentioned above plus two temporary registers lt U and l39 twhich correspond to the temporary variables above 6 Memory operands a The MIPS architecture is sometimes classi ed as registerregisterloadstore b Memory storage uses alignment for various fractions of words as illustrated below and explained in class c Big Endian versus little Endian a Slack b Accumulator c Registerrmamory 1d RegtsleHegisrerlloadrstore Processor 39 Mammy Figure 21 Operand locations for four instruction set architecture classes The arrows indicate whether the oper and is an input or the result of the ALU operation or both an input and result Lighter shades indicate inputs and the dark shade indicates the result In a a Top Of Stack register T05 points to the top input operand which is com bined with the operand belowThe rst operand is removed from the stack the result takes the place of the second operand and T05 is updated to point to the result All operands are implicit In b the Accumulator is both an implicit input operand and a result ln c one input operand is a register one is in memoryand the result goes toa register All operands are registers in d and like the stack architecture can be transferred to memory only via sepa rate instructions push or pop for a and load or store for d Value of 3 loworder bits of byte address Width ofobiect 0 1 2 3 4 5 6 7 lulc ilmw ligiicnl Iigiinl Aligned Iigiiml ligiilil Iigiicrl Ilglll39il I IigriLr Ilium iliiill mi39ili llh lt Aligiicil lllit Iljllll39ll Z Mics iliiill mi ilr Mmilignerl I Misiiligiic d I Misiiligncd Misuligned l min Milli ligiicrl igiicd 4 ii In inui39lli Misiiligncd I Nillellgllcrl 4 line ITM l39ill Misuligncll Misiiligncll J MIC immli Misiiligncd illigquot0d X39l silliniihlc mnli Iigiiril x h ll iiluiililc miili Misziligncd x IL lllllllhll39 mrili Misaligiicd Nilulcs rrlnulwlc mrili Misuligncd S MILs lllrilll lC mi39di Misiiligiied S lulu illrllNk39 Mil39lll H i1IClrlllllr39 mrili Misuligncd N lulu irlniililc liii39di Misuligncd Figure 25 Aligned and misaligned addresses of byte halfword word and doubleword objects for byte addressed computers Fm each misaligned example some objects require two memory accesses to complem Every aligned object can always cornplelc in one memory attess as long as the memory is as wide as the objeclThD gure shows the memory organized as 8 byles wideThe byte offsets that label the columns speiify the lowcider 3 bits ol the address Thri c ii c l39o Llil39l39crcnl convcnlions 01 ordcriiig Ilic byes within i lurgcr objccl LilIr Emliui liylc order piils llic byte WhUNC llddll is willquot ii th lcuslsigiiilicuni position iii lhc loublc mrll the liillc cndl Thu hle ii39u iiiiiii bcrcd 39Ji l 7 i 4 3 I l fig furan hyc order puts Ilic byic how uddi css is quotx xlllllquot ail lhc iiiml sigiiilicaiiil pmilion in UK LllelL word the big Glid Tilt hylcs arc llllllllk l39L Ll 7 We will walk through and further illustrate the example on page 55 in class Be sure to pick up the concepts of base register and offset Compiling an Assignment When an Operand Is in Memory Let s assume that 39 is an array of ltlt words and that the compiler has associated the variables Li and it with the registers Isl and 1737 as betore let39s also assume that the starting address or WSL39 address of the array is in t Compile this X assignment statement o ti Ai8lv Although there is a single operation in this 39signment statement one of the operands is in memory so we must rst transfer AM to a register The ad dress of this array element is the sum of the base of the array A found in reg ister quotb5 3 plus the number to select element 8 The data should be placed in a temporary register or use in the next instruction Based on Figure 22 the rst compiled instruction is lw NOAH s3 it Tenant try req tttl gets M8 0n the next page we39ll make a slight adjustment to this instruction but we ll use this simpli ed version for nowt The following instruction can operate on the value in t0 which equals M31 since it is in a register The instruction must add it contained in if 2 to M8 l Lt 0 and put the sum in the register corresponding to 9 associated with lml l atlcl slsZttJ l g r i M8 fer instruction is called the offset and the register i called the base register The constant in a data tr added to form the atldrcn 8 This example uses 1w and sw Page 57 Compiling Using Load and Store Assume varitihle ll is Associated with register 82 and the use address of the army A is in 71 What is the MIPS assembly rude hr the C ssignment state ment below Alli l1t Although there is u single operation in the C staleman now two of the oper ands are in memory so we need even more MIPS instructions The tirst two instructions are the sannc t5 the prior example except this time we use the proper offset For hyte addressing in the loud word instruction to select ME 1 and the add instruction places the sum in rU ll 39l etnmmrv re t MU gel s M8 in ttu this Mitt ti Tetttru39irarv rely lift nets ll MM a PM it H The final instruction stores the sum into AL 11139 using 48 as the offset and register l S 5 as the base register nxtf1j1mt into AUX sw lutt39lvttit bwl 10 We use immediate operands to avoid some memory load operations rvHi i 33quot3 3 i mnmliutu inxh39uctinm ilhmmc Kim lhird hdl39dudrc csign munliunud in Ihu Ilillruicx and Pumlk n L implc 1 In39Vi39QIII I Hll1 ir39ii MAL lhu unmme LIC LN human upwind omu licqucmlv and by indudmg mmmnh i1 illHlL IiUl1th m much Lister Hun i mmmvm werc Inmlui x 11 By now you know all about gure 24 on page 59 Again I suggest avoiding the check yourself material Iwill explain this suggestion in class MIPS operands Innm v in lm mam rm 1mm lh MWquot mm WM lu w l glilw39 IH pmmvm EHYFIHIH39ZN up AUNT mm m rim 1 H nan H simrlhm m MlPS MIFR mm IMP anquot mow v m flu quot rvlmnmw MummyWuawnWummW4 lvlcmmvlmrvirLlxH1mrlwxsa N1v39lwWMlLl 19613031 mullIr an Mvala MIPS assembly language mm add I 1 l l r i 4 Three operands Iain M wngvx AHHHWC HL suhtxacl V 1 r i lllre 39 operands Alam m vegmnm My wunawb 39 39 WHO 0 PM lJIIIHlJHl Fm from mvmmx my Data fmm Hymnv in menu FIGURE 24 MIPS architecture revealed through Section 23 I hghlighml pmlimh whim HI39 awmlwh Lingungr m mlum inumtumi II1 m llnn Section 24 Representing Instructions in the Computer 11 To this point we have been looking at assembler language AL Now we will get up to speed on machine language ML Translating a MIPS Assembly Instruction into a Machine Instruction let39s do the next step in the re nemth ol39 the MIPS language as an example We39ll show the real Mll language ersion ol39 the instruction represented symbolically as it ltl lii l l l 1 rst as a combination of decimal numbers and then of binary numbers The decimal representation is 0 17 8 8 o 32 Each of these segments of an instruction is called a eld The rst and last elds containing 0 and 32 in this case in combination tell the MIPS computer that this instruction performs addition The second eld gives the number of the reg ister that is the rst source operand ol39the addition operation I 17 ls l and the third eld gives the other source operand for the addition 18 l mlr39ihe fourth eld contains the number ol39the register that is to receive the sum 8 nt U The h eld is unused in this instruction so it is set to l Thus this instruction adds register L l to register l s I and places the stun in register M t This instruction can also be represented as elds of binary numbers as op posed to decimal 000000 10001 10010 JlOOD 00000 100000 t3 bus 5 has 5 hits 5 tnts 5 Dll S 6 hits mmug LU In 41m mi quot111quot not U Vpllulhh am me E lg o 32 4 rm veg N quot144 Iquot a 4 E La in i 7 i l 39 1 veg m mr mlmvc mus mix eld day um uppqu in 12 Be sure that you can convert the results below into binary as the machine would see it and to heX as we prefer it You should be very agile translating amongst decimal binary and heX numbers at this time Translating MIPS Assembly Language into Machine Language We can now take an example all the way from what the programmer writes to what the computer executes ll tl has the base ol39the array A and l corre sponds to li the assignment statement M will 4l1 tl2ttttl is compiled into In WU ljt39ttu Etlt l Terrapin at u irwii tilt net L AHMU t ll 39l tt tz tstt ff Templar m w tun Mt lrl39 ll Msitttt l39lllllllllllrll ll Blut r i w ll r r tl39ttitt39l haul into ADM What is the MIPS machine language code for these three instructions For convenience let s rst represent the machine language instructions using decimal numbers From Figure 26 we can determine the three machine lair guage instructions address shall 7 is 35 it 7 739 r 1209 o t 18 8 8 o 32 7 7 43 l 9 8 1200 The Basics of Logic Design I always loved that word Boolean Claude Shannon FEEE SpectrumApril 1992 bhdjmuu 39 39 George Boole in the 18005 Could represent the workings of L electrical switchesr 144 1 31 Introduction B3 32 Gates Truth Tables and Logic Equations B4 33 Combinational Logic B78 34 Using a Hardware Description Language B720 35 Constructing a Basic Arithmetic Logic Unit B726 36 Faster Addition Carry Lookahead B738 37 Clocks B747 38 Memory Elements Flipflops Latches and Registers B749 39 Memory Elements SRAMs and DRAMs B757 310 Finite State Machines B767 311 Timing Methodologies B772 312 Field Programmable Devices B777 313 Concluding Remarks B778 314 Exercises B779 Introduction This appendix provides a brief discussion of the basics of logic design It does not replace a course in logic design nor will it enable you to design signi cant working logic systems If you have little or no exposure to logic design however this appendix will provide suf cient background to understand all the material in this book In addition if you are looking to understand some of the motivation behind how computers are implemented this material will serve as a useful intro duction If your curiosity is aroused but not sated by this appendix the references at the end provide several additional sources of information Section B2 introduces the basic building blocks of logic namely gates Section 133 uses these building blocks to construct simple combinational logic systems which contain no memory If you have had some exposure to logic or digital sysi tems you will probably be familiar with the material in these rst two sections Section 135 shows how to use the concepts of Sections B2 and 133 to design an ALU for the MIPS processor Section 136 shows how to make a fast adder and may 34 Appendix B The Basics of Logic Design asserted signal A signal that is logically true or I deasserted signal A signal that is logically false or 0 be safely skipped if you are not interested in this topic Section 137 is a short intro duction to the topic of clocking which is necessary to discuss how memory ele ments work Section 138 introduces memory elements and Section 139 extends it to focus on random access memories it describes both the characteristics that are important to understanding how they are used in Chapters 5 and 6 and the back ground that motivates many of the aspects of memory hierarchy design in Chapter 7 Section 1310 describes the design and use of nite state machines which are sequential logic blocks If you intend to read Appendix C you should thoroughly understand the material in Sections B2 through B10 But if you intend to read only the material on control in Chapters 5 and 6 you can skim the appendices but you should have some familiarity with all the material except Sec tion B11 Section 1311 is intended for those who want a deeper understanding of clocking methodologies and timing It explains the basics of how edgeetriggered clocking works introduces another clocking scheme and brie y describes the problem of synchronizing asynchronous inputs Throughout this appendix where it is appropriate we also include segments of Verilog to demonstrate how logic can be represented in Verilog which we introduce in Section B4 A more extensive and complete Verilog tutorial appears on t e CD Gates Truth Tables and Logic Equations The electronics inside a modern computer are digital Digital electronics operate with only two voltage levels of interest a high voltage and a low voltage All other voltage values are temporary and occur while transitioning between the values As we discuss later in this section a possible pitfall in digital design is sampling a signal when it not clearly either high or low The fact that computers are digital is also a key reason they use binary numbers since a binary system matches the underlying abstraction inherent in the electronics In various logic families the values and relationships between the two voltage values differ Thus rather than refer to the voltage levels we talk about signals that are logically true or are l or are asserted or signals that are logically false or 0 or deasserted The values 0 and 1 are called complements or inverses of one another Logic blocks are categorized as one of two types depending on whether they contain memory Blocks without memory are called combinatizmal the output of a combmational block depends only on the current input In blocks with memory the outputs can depend on both the inputs and the value stored in memory which is called the state of the logic block In this section and the next we will focus only 32 Gates Truth Tables and Logic Equations B5 on combinational logic After introducing different memory elements in Section B8 we will describe how sequential logic which is logic including state is designed Truth Tables Because a combinational logic block contains no memory it can be completely speci ed by de ning the values of the outputs for each possible set of input values Such a description is normally given as a truth table For a logic block with 11 inputs there are 2 entries in the truth table since there are that many possible combinations of input values Each entry speci es the value of all the outputs for that particular input combination Truth Tables Consider a logic function with three inputs A B and C and three outputs D E and F The function is de ned as follows D is true if at least one input is true E is true if exactly two inputs are true and F is true only if all three inputs are true Show the truth table for this function The truth table will contain 25 8 entries Here it is Truth tables can completely describe any combinational logic function howe ever they grow in size quickly and may not be easy to understand Sometimes we want to construct a logic function that will be 0 for many input combinations and we use a shorthand of specifying only the truth table entries for the nonzero oute puts This approach is used in Chapter 5 and Appendix C combinational logic A logic system whose blocks do not contain memory and hence compute the same output given the same input sequential logic A group of logic elements that contain memory and hence whose value depends on the inputs as well as the current contents of the memory Appendix B The Basics of Logic Design Boolean Algebra Another approach is to express the logic function with logic equations This is done with the use of Boolean algebra named after Boole a 19th century mathee matician In Boolean algebra all the variables have the values 0 or 1 and in typie cal formulations there are three operators l The OR operator is written as as in A B The result of an OR operator is 1 if either of the variables is 1 The OR operation is also called a logical 5am since its result is 1 if either operand is 1 l The AND operator is written as as in A B The result ofan AND operae tor is 1 only if both inputs are 1 The AND operator is also called logical product since its result is 1 only if both operands are 1 l The unary operator NOT is written as A The result ofa NOT operator is 1 only if the input is 0 Applying the operator NOT to a logical value results in an inversion or negation of the value ie if the input is 0 the output is 1 and vice versa There are several laws of Boolean algebra that are helpful in manipulating logic equations l Identitylaw A0 A andA1 A ZeroandOnelawsA11andA0 0 lnverselaws AA 1andAA 0 Associative laws ABC AB CandA B C A gt19 C Distributive laws A B C A B A C and AB C AB AC In addition there are two other useful theorems called DeMorgan s laws that are discussed in more depth in the exercises Any set oflogic functions can be written as a series of equations with an output on the leftehand side of each equation and a formula consisting of variables and the three operators above on the righthand side I I l Commutative laws ABBAandAB BA I I Logic Equations Show the logic equations for the logic functions D E and F described in the previous example 32 Gates Truth Tables and Logic Equations Here s the equation for D D A B C F is equally simple F A B C E is a little tricky Think of it in two parts what must be true for E to be true two of the three inputs must be true and what cannot be true all three cannot be true Thus we can write E as E ABACBCABC We can also derive E by realizing that E is true only if exactly two of the inputs are true Then we can write E as an OR of the three possible terms that have two true inputs and one false input E ABDACEBCA Proving that these two expressions are equivalent is explored in the exercises In Verilog we describe combinational logic whenever possible using the assign statement which is described beginning on page B723 We can write a de nition for E using the Verilog exclusiveeOR operator as a S Si g n E A B A C D and F have even simpler representations which are just like the corresponding C code DAlBlCandFAampBampC Gates Logic blocks are built from gates that implement basic logic functions For exam ple an AND gate implements the AND function and an OR gate implements the OR function Since both AND and OR are commutative and associative an AND or an OR gate can have multiple inputs with the output equal to the AND or OR of all the inputs The logical function NOT is implemented with an inverter that always has a single input The standard representation of these three logic building blocks is shown in Figure 321 Rather than draw inverters explicitly a common practice is to add bubbles to the inputs or output of a gate to cause the logic value on that input line or output line to be inverted For example Figure B22 shows the logic diagram for the function AB using explicit inverters on the left and using bubbled inputs and outputs on the right Any logical function can be constructed using AND gates OR gates and invere sion several of the exercises give you the opportunity to try implementing some common logic functions with gates In the next section we ll see how an imple mentation of any logic function can be constructed using this knowledge gate A device that implements basic logic functions such as AND or OR 38 Appendix B The Basics of Logic Design NOR gate An inverted OR gate NAND gate An inverted AND gate Check Yourself decoder A logic block that has an nebit input and 271 out puts where only one output is asserted for each input combination ZED 3 FIGURE 321 Standard drawing for an AND gate OR gate and an inverter shown from left to right The signals to the left of each symbol are the inputs while the output appears on the right The AND and OR gates both have two inputs Inverters have a single input A A B B FIGURE 322 Logic gate implementation of 13 5 using explicit inverts on the left and using bubbled inputs and output on the right This logic function can be simpli ed to A i or in Verilog A 84 In fact all logic functions can be constructed with only a single gate type if that gate is inverting The two common inverting gates are called NOR and NAND and correspond to inverted OR and AND gates respectively NOR and NAND gates are called universal since any logic function can be built using this one gate type The exercises explore this concept further Are the following two logical expressions equivalent If not nd a setting of the Variables to show they are not I AB6ACEBCZi I BA6CA Combinational Logic In this section we look at a couple of larger logic building blocks that we use heavily and we discuss the design of structured logic that can be automatically implemented from a logic equation or truth table by a translation program Last we discuss the notion of an array of logic blocks Decoders One logic block that we will use in building larger components is a decoder The most common type of decoder has an nebit input and 2 outputs where only one output is asserted for each input combination This decoder translates the 11bit 33 Combinational Logic 439 Out0 439 Out1 4gt Out2 4gt Out3 a Out4 a Out5 OutB Out7 Decoder a A 3bit decoder b The truth table for a Slut decoder FIGURE 331 A 3bit decoder has 3 inputs called 12 11 and 10 and 23 8 outputs called Outo to Out1 Only the output cor responding to the binary value of the input is true as shown in the truth table The label 3 on the input to the decoder says that the input signal is 3 input into a signal that corresponds to the binary value of the 11bit input The outputs are thus usually numbered say Out0 Out1 OutZ L 1 If the value of the input is i then Outi will be true and all other outputs will be false Figure B31 shows a 37bit decoder and the truth table This decoder is called a 3 t0 8 decoder since there are 3 inputs and 8 25 outputs There is also a logic element called an encoder that performs the inverse function of a decoder taking 2 inputs and pref ducing an nebit output Multiplexors One basic logic function that we use quite often in Chapters 5 and 6 is the multi plexer A multiplexer might more properly be called a selector since its output is one ofthe inputs that is selected by a control Consider the tweeinput multiplexer The left side of Figure B32 shows this multiplexer has three inputs two data vale ues and a selector or control value The selector value determines which of the inputs becomes the output We can represent the logic function computed by a tweeinputimultiplexer shown in gate form on the right side of Figure 1332 as C ASBS Multiplexers can be created with an arbitrary number of data inputs When there are only two inputs the selector is a single signal that selects one of the inputs if it is true 1 and the other if it is false 0 If there are 11 data inputs there will need to be flegzn l selecter inputs In this case the multiplexer basically consists of three parts selector value Also called con trol value The control signal that is used to select one ofthe input values of a multiplexer as the output of the multiplexer Appendix B The Basics of Logic Design sum of products A form of logical representation that employs alogical sum OR of products terms joined using the AND operator A A 0 M u C C B 1 B S S FIGURE 332 A twoinput multiplexer on the left and its implementation with gates on the right The multiplexor has two data inputs A and B which are labeled 0 and 1 and one selector input S as well as an output C Implementing multiplexors in Verilog requires a little more work esper cially when they are wider than two inputs We show how to do this beginning on page B723 1 A decoder that generates r1 signals each indicating a different input value 2 An array of r1 AND gates each combining one of the inputs with a signal from the decoder 3 A single large OR gate that incorporates the outputs of the AND gates To associate the inputs with selector values we often label the data inputs numere ically ie 0 1 2 3 r1 7 1 and interpret the data selector inputs as a binary number Sometimes we make use of a multiplexer with undecoded selector sig nals Multiplexors are easily represented combinationally in Verilog using rifexprese sions For larger multiplexors case statements are more convenient but care must be taken so as to synthesize combinational logic TwoLevel Logic and PLAs As pointed out in the previous section any logic function can be implemented with only AND OR and NOT functions In fact a much stronger result is true Any logic function can be written in a canonical form where every input is either a true or complemented variable and there are only two levels of gatesione being AND and the other ORiwith a possible inversion on the nal output Such a rep resentation is called a two level representation and there are two forms called sum of products and product ofsurns A sumeofeproducts representation is a logical sum OR of products terms using the AND operator a product of sums is just the opposite In our earlier example we had two equations for the output E E ABACBCABC and E AB6ACEBCZi 33 Combinational Logic This second equation is in a sumeofeproducts form it has two levels of logic and the only inversions are on individual variables The rst equation has three levels of logic Elaboration We can also write Eas a product of sums E 2iEC2i6sE6A To derive this form you need to use DeMorgan39s theorems which are discussed in the exercises In this text we use the sumeofeproducts form It is easy to see that any logic function can be represented as a sum of products by constructing such a represenr tation from the truth table for the function Each truth table entry for which the function is true corresponds to a product term The product term consists of a logical product of all the inputs or the complements of the inputs depending on whether the entry in the truth table has a 0 or 1 corresponding to this variable The logic function is the logical sum of the product terms where the function is true This is more easily seen with an example Sum of Products Show the sumeofrproducts representation for the following truth table for D Appendix B The Basics of Logic Design programmable logic array PLA A structuredelogic element composed ofa set of inputs and corresponding input complements and two stages of logic the rst generating prod uct terms of the inputs and input complements and the sec ond generating sum terms ofthe product terms Hence PLAs implement logic functions as a sum ofproducts minterms Also called prod uct terms A set of logic inputs joined by conjunction AND operations the product terms form the first logic stage of the programmable logic array FLA There are four product terms since the function is true 1 for four different input combinations These are 0 0 E B N we 5 BC Thus we can write the function for D as the sum of these terms D XECAB6AE6ABC Note that only those truth table entries for which the function is true genere ate terms in the equation We can use this relationship between a truth table and a twoelevel representa tion to generate a gateelevel implementation of any set of logic functions A set of logic functions corresponds to a truth table with multiple output columns as we saw in the example on page 135 Each output column represents a different logic function which may be directly constructed from the truth table The sumeofeproducts representation corresponds to a common structured logic implementation called a programmable logic array PLA A PLA has a set of inputs and corresponding input complements which can be implemented with a set of inverters and two stages of logic The rst stage is an array of AND gates that form a set of product terms sometimes called minterms each product term can consist of any of the inputs or their complements The second stage is an array of OR gates each of which forms a logical sum of any number of the prod uct terms Figure B33 shows the basic form of a FLA A PLA can directly implement the truth table of a set of logic functions with multiple inputs and outputs Since each entry where the truth table is true requires a product term there will be a corresponding row in the PLA Each out put corresponds to a potential row of OR gates in the second stage The number of OR gates corresponds to the number of truth table entries for which the output is true The total size of a FLA such as that shown in Figure 333 is equal to the sum of the size of the AND gate array called the AND plane and the size of the OR gate array called the OR plane Looking at Figure 1333 we can see that the size of the AND gate array is equal to the number of inputs times the number of different product terms and the size of the OR gate array is the number of outputs times the number of product terms A FLA has two characteristics that help make it an ef cient way to implement a set of logic functions First only the truth table entries that produce a true Value for at least one output have any logic gates associated with them Second each diff 33 Combinational Logic lnPUtS AND gates Producttermsl l l l l l l l OR gates Outputs FIGURE 333 The basic form of a FLA consists of an array of AND gates followed by an array of OR ga es Each entry in the AND gate array is a product term consisting of any number of inputs or inverted inputsr Each entry in the OR gate array is a sum term consisting of any number of these product termsr ferent product term will have only one entry in the PLA even if the product term is used in multiple outputs Let s look at an examp e P LAs Consider the set of logic functions de ned in the example on page BS Show a PLA implementation of this example for D E and F Here is the truth table we constructed earlier 0 0 1 1 0 0 1 1 HOHOHOHO HHHHHHHO OHHOHOOO HOOOOOOO Appendix B The Basics of Logic Design readeonly memory ROM A memory whose contents are designated at creation time after which the contents can only be read ROM is used as structured logic to implement a set of logic functions by using the terms in the logic functions as address inputs and the out puts as bits in each word ofthe memo programmable ROM PROM A form of readeonly memory that can be pro grammed when a designer knows its contents Since there are seven unique product terms with at least one true value in the output section there will be seven columns in the AND plane The number of rows in the AND plane is three since there are three inputs and there are also three rows in the OR plane since there are three outputs Figure B34 shows the resulting PLA with the product terms corresponding to the truth table entries from top to bottom Rather than drawing all the gates as we did in Figure 334 designers often show just the position of AND gates and OR gates Dots are used on the intersece tion of a product term signal line and an input line or an output line when a cor responding AND gate or OR gate is required Figure B35 shows how the PLA of Figure B34 would look when drawn in this way The contents of a PLA are xed when the PLA is created although there are also forms of PLAelike structures called PALS that can be programmed electronically when a designer is ready to use them ROMS Another form of structured logic that can be used to implement a set of logic functions is a read only memory ROM A ROM is called a memory because it has a set of locations that can be read however the contents of these locations are xed usually at the time the ROM is manufactured There are also programma ble ROMS PROMs that can be programmed electronically when a designer knows their contents There are also erasable PROMs these devices require a slow erasure process using ultraviolet light and thus are used as readeonly memories except during the design and debugging process A ROM has a set of input address lines and a set of outputs The number of addressable entries in the ROM determines the number of address lines if the ROM contains 2 addressable entries called the height then there are m input lines The number of bits in each addressable entry is equal to the number of out put bits and is sometimes called the width of the ROM The total number ofbits in the ROM is equal to the height times the width The height and width are some times collectively referred to as the shape of the ROM A ROM can encode a collection of logic functions directly from the truth table For example if there are 11 functions with m inputs we need a ROM with m address lines and 2 entries with each entry being 11 bits wide The entries in the input portion of the truth table represent the addresses of the entries in the ROM while the contents of the output portion of the truth table constitute the contents of the ROM If the truth table is organized so that the sequence of entries in the input portion constitute a sequence of binary numbers as have all the truth tables we have shown so far then the output portion gives the ROM contents in order as well In the previous example starting on page B713 there were three inputs 33 Combinational Logic 315 Inputs A 21111111 Outputs D FIGURE 334 The FLA for implementing the logic function described above Inputs AND plane Outputs D OR plane E F FIGURE 335 A FLA drawn using dots to indicate the components of the product terms and sum terms in the array Rather than use inverters on the gates usually all the inputs are run the width of the AND plane in both true and complement formsr A dot in the AND plane indicates that the input or its inverse occurs in the product termr A dot in the OR plane indicates that the corresponding product term appears in the corresponding output Appendix B The Basics of Logic Design and three outputs This leads to a ROM with 25 8 entries each 3 bits wide The contents of those entries in increasing order by address are directly given by the output portion of the truth table that appears on page B713 ROMs and PLAs are closely related A ROM is fully decoded it contains a full output word for every possible input combination A PLA is only partially decoded This means that a ROM will always contain more entries For the earlier truth table on page B713 the ROM contains entries for all eight possible inputs whereas the PLA contains only the seven active product terms As the number of inputs grows the number of entries in the ROM grows exponentially In contrast for most real logic functions the number of product terms grows much more slowly see the examples in Appendix C This difference makes PLAs generally more ef cient for implementing combinational logic functions ROMs have the advantage of being able to implement any logic function with the matching nume ber of inputs and outputs This advantage makes it easier to change the ROM con tents if the logic function changes since the size of the ROM need not change In addition to ROMs and PLAs modern logic synthesis systems will also trans late small blocks of combinational logic into a collection of gates that can be placed and wired automatically Although some small collections of gates are usue ally not area ef cient for small logic functions they have less overhead than the rigid structure of a ROM and PLA and so are preferred For designing logic outside of a custom or semicustom integrated circuit a common choice is a eld programming device we describe these devices in Sec tion B12 Don t Cares Often in implementing some combinational logic there are situations where we do not care what the value of some output is either because another output is true or because a subset of the input combinations determines the values of the out puts Such situations are referred to as don t cures Don t cares are important because they make it easier to optimize the implementation of a logic function There are two types of don t cares output don t cares and input don t cares both ofwhich can be represented in a truth table Output don t cures arise when we don t care about the value of an output for some input combination They appear as Xs in the output portion of a truth table When an output is a don t care for some input combination the designer or logic optimization program is free to make the output true or false for that input combination Input don t cures arise when an output depends on only some of the inputs and they are also shown as Xs though in the input portion of the truth table 33 Combinational Logic Don39t Cares Consider a logic function with inputs A B and C de ned as follows I If A or C is true then output D is true Whatever the value of B I If A or B is true then output E is true whatever the value of C I Output F is true if exactly one of the inputs is true although we don t care about the value of F whenever D and E are both true Show the full truth table for this function and the truth table using don t cares How many product terms are required in a PLA for each of these Here s the full truth table without don t cares HHHHOOOO HHOOHHOO HOOHOHHO This requires seven product terms Without optimization The truth table written with output don t cares looks like ANSWER Appendix B The Basics of Logic Design bus ln logic design a collection of data lines that is treated together as a single logical sig nal also a shared collection of lines with multiple sources and uses If we also use the input don t cares this truth table can be further simpli ed to yield This simpli ed truth table requires a PLA with four minterms or it can be implemented in discrete gates with one twoeinput AND gate and three OR gates two with three inputs and one with two inputs This compares to the original truth table that had seven minterms and would require four AND gates Logic minimization is critical to achieving ef cient implementations One tool useful for hand minimization of random logic is Kamaugh maps Karnaugh maps represent the truth table graphically so that product terms that may be combined are easily seen Nevertheless hand optimization of signi cant logic functions using Karnaugh maps is impractical both because of the size of the maps and their complexity Fortunately the process of logic minimization is highly mechane ical and can be performed by design tools In the process of minimization the tools take advantage of the don t cares so specifying them is important The text book references at the end of this appendix provide further discussion on logic minimization Karnaugh maps and the theory behind such minimization algoe rithms Arrays of Logc Elements Many of the combinational operations to be performed on data have to be done to an entire word 32 bits of data Thus we often want to build an array of logic ele ments which we can represent simply by showing that a given operation will hape pen to an entire collection of inputs For example we saw on page 379 what a 17 bit multiplexor looked like but inside a machine much of the time we want to select between a pair of buses A bus is a collection of data lines that is treated together as a single logical signal The term bus is also used to indicate a shared collection of lines with multiple sources and uses especially in Chapter 8 where IO buses were discussed 33 Combinational Logic For example in the MIPS instruction set the result of an instruction that is written into a register can come from one of two sources A multiplexor is used to choose which of the two buses each 32 bits wide will be written into the Result register The 17bit multiplexor which we showed earlier will need to be replicated 32 times We indicate that a signal is a bus rather than a single 17bit line by showing it with a thicker line in a gure Most buses are 32 bits wide those that are not are explicitly labeled with their width When we show a logic unit whose inputs and outputs are buses this means that the unit must be replicated a suf cient number of times to accommodate the width of the input Figure B36 shows how we draw a multiplexor that selects between a pair of 327bit buses and how this expands in terms of 17bit7wide multiplexors Sometimes we need to construct an array of logic elements where the inputs for some elements in the array are outputs from earlier elements For example this is how a multibitewide ALU is constructed In such cases we must explicitly show how to create wider arrays since the indiVide ual elements of the array are no longer independent as they are in the case of a 327 bitewide multiplexor Select Select A 32 A31 M M u 32 c u C31 B 32 X B31 X A30 U C30 B30 X A0 U CO x BO a A 32rbll WIde 24071 multlplexor b The 32rbllwlde multlplexor Is actually an array of 32 lrbll multlplexors FIGURE 336 A multiplexor is arrayed 32 times to perform a selection between two 32 bit inputs Note that there is still only one data selection signal used for all 32 17bit multiplexers Appendix B The Basics of Logic Design Check Yourself hardware description language Aprog ramminglane guage for describing hardware used for generating simulations of ahardware design and also as input to synthesis tools that can generate actual hardware verilog One of the two most common hardware description languages VHDL One ofthe two most common hardware description languages Parity is a function where the output depends on the number of Is in the input For an even parity function the output is 1 if the input has an even number of ones Suppose a ROM is used to implement an even parity function with a 47bit input Which of A B C or D represents the contents of the ROM 0 1 2 3 4 5 6 7 8 HOOOOOOOO OHHHHHHHH OHOHOHOHO HOHOHOHOH Using a Hardware Description Language Today most digital design of processors and related hardware system is done using a hardware description language Such a language serves two purposes First it provides an abstract description of the hardware to simulate and debug the design Second with the use of logic synthesis and hardware compilation tools this description can be compiled into the hardware implementation In this section we introduce the hardware description language Verilog and show how it can be used for combinational design In the rest of the appendix we expand the use of Verilog to include design of sequential logic In optional sec tions of Chapters 5 and 6 which appear on the CD we use Verilog to describe prof cessor implementations Verilog is one of the two primary hardware description languages the other is VHDL Verilog is somewhat more heavily used in industry and is based on C as opposed to VHDL which is based on Ada The reader generally familiar with C will nd the basics ofVerilog which we use in this appendix easy to follow Reade 34 Using a Hardware Description Language ers already familiar with VHDL should nd the concepts simple provided they have been exposed to the syntax of C Verilog can specify both a behavioral and a structural de nition of a digital sys tem A behavioral speci cation describes how a digital system functionally opere ates A structural speci cation describes the detailed organization of a digital system usually using a hierarchical description A structural speci cation can be used to describe a hardware system in terms of a hierarchy of basic elements such as gates and switches Thus we could use Verilog to describes the exact contents of the truth tables and datapath of the last section With the arrival of hardware synthesis tools most designers now use Verilog or VHDL to structurally describe only the datapath relying on logic synthesis to generate the control from a behavioral description In addition most CAD sys tems provide extensive libraries of standardized parts such as ALUs multiplexors register les memories programmable logic blocks as well as basic gates Obtaining an acceptable result using libraries and logic synthesis requires that the speci cation be written with an eye toward the eventual synthesis and the desired outcome For our simple designs this primarily means making clear what we expect to be implemented in combinational logic and what we expect to require sequential logic In most of the examples we use in this section and the remainder of this appendix we have written the Verilog with the eventual synthe sis in mind Data Types and Operators in Verilog There are two primary datatypes in Verilog A wire speci es a combinational signal N A reg register holds a value which can vary with time A reg need not necessarily correspond to an actual register in an implementation although it often wi A register or wire named X that is 32 bits wide is declared as an array reg 3101 X or w re 3101 Xwhich also sets the index of0 to designate the least signi cant bit of the register Because we often want to access a sub eld of a register or wire we can refer to contiguous set of bits of a register or wire with the notation S ta rti n g b i t endi n g bi t where both indices must be con stant values An array of registers is used for a structure like a register le or memory Thus the declaration reg 3101 registerfi1e0311 speci es a variable register le that is equivalent to a MIPS register le where reg ister 0 is the rst When accessing an array we can refer to a single element as in C using the notation regi Ste rfi 1 e I regnum behavioral speci cation Describes how a digital system operates functionally structural speci cation Describes how a digital system is organized in terms of a hierarchical connece tion of elements hardware synthesis tools Computereaided design software that can generate a gateelevel design based on behavioral descriptions of a digital system wire In Verilog speci es a combinational sig39nal reg InVerilog aregister Appendix B The Basics of Logic Design Check Yourself The possible values for a register or wire in Verilog are l 0 or 1 representing logical false or true I 2 representing unknown the initial value given to all registers and to any wire not connected to something I 2 representing the higheimpedance state for tristate gates which we will not discuss in this appendix Constant values can be speci ed as decimal numbers as well as binary octal or hexadecimal We often want to say exactly how large a constant eld is in bits This is done by pre xing the value with a decimal number specifying its size in bits For example I 4 b 010 O speci es a 47bit binary constant with the value 4 as does 4 d4 l i 8 M speci es an 87bit constant with the value 74 in twos complement representation Values can also be concatenated by placing them within 1 l separated by come mas The notation x bi t fi el d replicates bi t fi el d x times For example I 1612 1301 l creates a 32bit value with the pattern 0101 01 l Al3116 B150 creates a value whose upper 16 bits come from A and whose lower 16 bits come from B Verilog provides the full set of unary and binary operators from C including the arithmetic operators 7 I the logical operators 8 the comparison operators 1 gt lt lt gt the shift operators ltlt gtgt and Cs conditional operator which is used in the form condition 7 exp r1 exp r2 and returns exp r1 if the condition is true and exp r 2 if it is false Verilog adds a set of unary logic reduction operators 8 A that yield a single bit by applying the logical operator to all the bits of an operand For example ampA returns the value obtained by ANDing all the bits of A together and AA returns the reduction obtained by using exclusive OR on all the bits of A Which of the following de ne exactly the same value 1 8 b11110000 2 8 hFO 3 8 d240 4 H411 blll41 bllll 5 4 b14 b01 34 Using a Hardware Description Language Structure of a Verilog Program AVerilog program is structured as a set of modules which may represent anything from a collection of logic gates to a complete system Modules are similar to classes in C although not nearly as powerful A module speci es its input and output ports which describe the incoming and outgoing connections of a mod ule A module may also declare additional Variables The body of a module con sists of l initial constructswhich can initialize reg variables I continuous assignments which de ne only combinational logic I always constructs which can de ne either sequential or combinational logic instances of other modules which are used to implement the module being de ned Representing Complex Combinational Logo in Verilog A continuous assignment which is indicated with the keyword a s s i g n acts like a combinational logic function the output is continuously assigned the Value and a change in the input Values is re ected immediately in the output Value Wires may only be assigned Values with 39 39 Using 39 assign ment we can de ne a module that implements a halfeadder as Figure B41 shows Assign statements are one sure way to write Verilog that generates combinae tional logic For more complex structures however assign statements may be awkward or tedious to use It is also possible to use the always block of a module to describe a combinational logic element although care must be taken Using an always block allows the inclusion of Verilog control constructs such as if thene else case statementsf07 statements and repeat statements to be used These state ments are similar to those in C with small changes An always block speci es an optional list of signals on which the block is sen sitive in a list starting with The always block is reeValuated if any of the listed module naltiadder ABSumCarry input AB two 17bit inputs output Sum Carry two 17bit outputs assign Sum A A B sum is A xor B assign Carry A amp B Carry is A and B endmodule FIGURE 341 A Verilog module that defines a haltadder using continuous assignments Appendix B The Basics of Logic Design sensitivitylist The list of signals that speci es when an always block should be reevaluated signals changes value if the list is omitted the a l ways block is constantly reevale uated When an a l ways block is specifying combinational logic the sensitivity list should include all the input signals If there are multiple Verilog statements to be executed in an a l ways block they are surrounded by the keywords b e gi n and end which take the place of the l and l in C An a l ways block thus looks like always list of signals tnat cause reevaluation begin Verilog statements including assignments and otner control statements end blocking assignment In Vere ilog an assignment that come pletes before the execution of the next statement nonblocking assignment An assignment that continues after evaluating the righthand side assigning the leftehand side the value only after all rightehand sides are evaluated Reg variables may only be assigned inside an a l ways block using a procedural assignment statement as 139 quot 39 L J from 39 39 we saw ear lier There are however two different types of procedural assignments The assignment operator executes as it does in C the righthand side is evaluated and the leftehand side is assigned the value Furthermore it executes like the nor mal C assignment statement that is it is completed before the next statement is executed Hence the assignment operator has the name blocking assignment This blocking can be useful in the generation of sequential logic and we will return to it shortly The other form of assignment nonblocking is indicated by lt In nonblocking assignment all rightehand sides of the assignments in an always group are evaluated and the assignments are done simultaneously As a rst example of combinational logic implemented using an always block Figure B42 shows the implementation of a 4etoel multiplexor which uses a case con struct to make it easy to write The case construct looks like a C swi tcn state ment Figure B43 shows a de nition of a MIPS ALU which also uses a case statement module Mult4t01 ln1ln2ln3ln4SelOut input 310 ln1 ln2 ln3 ln4 toun 32bit inputs input 10 Sel selector signal output reg 310 0ut 32bit output always lnl ln2 ln3 ln4 Sel case Sel a 47gt1 multiplexor 0 Out lt lnl 1 Out lt ln2 2 Out lt ln3 default Out lt ln4 endcase endmodule FIGURE 342 A Verilog definition of a 4to1 multiplexer with 32bit inputs using a case statement The c a s e statement acts like a C swi tcn statement except in Verilog only the code associe ated with the selected case is executed as if each case state had a break at the end and there is no fall through to the next statement 34 Using a Hardware Description Language modue MIPSALU ALUcti A B ALUOut Zero input 30 ALUct input 310 AB output reg 310 ALUOut output Zero assign Zero ALUOut Zero is true it ALUOut is 0 goes anywhere aways ALUcti A B reevauate it these change case ALUcti 0 ALUOut lt A amp B ALUOut lt A l B ALUOut lt A B ALUOut lt A r B 7 ALUOut lt A lt B 7 10 12 ALUOut lt A l B resuit is nor owl detauit ALUOut lt 0 deicaut to 0 shouid not happen endcase endmoduie FIGURE 343 A Verilog behavioral definition of a MIPS ALU This could be synthesized using a module library containing basic arithmetic and logical operation 5 Since only reg Variables may be assigned inside a ways blocks when we want to describe combinational logic using an aways block care must be taken to ensure that the reg does not synthesize into a register A Variety of pitfalls are described in the Elaboration below Elaboration Continuous assignment statements always yield combinational logic but other Verilog structures even when in a ways blocks can yield unexpected results during logic synthesis The most common problem is creating sequential logic by imply ing the existence of a latch or register which results in an implementation that is both slower and more costly than perhaps intended To ensure that the logic that you intend to be combinational is synthesized that way make sure you do the following 1 Place all combinational logic in a continuous assignment or an a ways block 2 Make sure that all the signals used as inputs appear in the sensitivity list of an aiways block 9 Ensure that every path through an aiways block assigns a value to the exact same set of bits The last of these is the easiest to overlook read through the example in Figure 8515 to convince yourself that this property is adhered to Appendix B The Basics of Logic Design Check Yourself ALU n Arthritic Logic Unit or rare Arithmetic Logic Unit A random number generator supplied 15 stan dard with all computer sys terns Stan KellyeBootle The Devil s DP Dictionary 1981 Assuming all values are initially zero what are the values of A and B after executing this Verilog code inside an always block C1 A lt C B C Constructing a Basic Arithmetic Logic Unit The arithmetic logic unit ALU is the brawn of the computer the device that per forms the arithmetic operations like addition and subtraction or logical opera tions like AND and OR This section constructs an ALU from four hardware building blocks AND and OR gates inverters and multiplexors and illustrates how combinational logic works In the next section we will see how addition can be sped up through more clever designs Because the MIPS word is 32 bits wide we need a 327bitewide ALU Let s assume that we will connect 32 17bit ALUs to create the desired ALU We ll there fore start by constructing a 17bit ALU A 1Bit ALU The logical operations are easiest because they map directly onto the hardware components in Figure 1321 The 17bit logical unit for AND and OR looks like Figure 1351 The multiplexor on the right then selects nAND b or 1 OR b depending on whether the value of Operation is 0 or 1 The line that controls the multiplexor is shown in color to dise tinguish it from the lines containing data Notice that we have renamed the con trol and output lines of the multiplexor to give them names that re ect the function of the ALU The next function to include is addition An adder must have two inputs for the operands and a singleebit output for the sum There must be a second output to pass on the carry called CnrryOutSince the CarryOut from the neighbor adder must be included as an input we need a third input This input is called CnrryIn Figure B52 shows the inputs and the outputs of a 17bit adder Since we know Operation FIGURE 351 The 1bit logical unit for AND and OR 5 u Constructing a Basic Arithmetic Logic Unit Carryln CarryOut FIGURE 352 A 1bit adder This adder is called a full adder it is also called a 32 adder because it has 3 inputs and 2 outputs An adder with only the a and b inputs is called a 22 adder or half adder FIGURE 353 Input and output specification for a 1bit adder what addition is supposed to do we can specify the outputs of this black box based on its inputs as Figure B53 demonstrates We can express the output functions CarryOut and Sum as logical equations and these equations can in turn be implemented with logic gates Let s do Carry Out Figure B54 shows the Values of the inputs when CarryOut is a 1 We can turn this truth table into a logical equation CarryOut b CarryIn a A CarryIn a b a b CarryIn If a b CarryIn is true then all of the other three terms must also be true so we can leave out this last term corresponding to the fourth line of the table We can thus simplify the equation to CarryOut b CarryIn a CarryIn a b Figure B55 shows that the hardware within the adder black box for CarryOut consists of three AND gates and one OR gate The three AND gates correspond 328 Appendix B The Basics of Logic Design HHHO HHOH HOHH FIGURE 354 Values of the inputs when CarryOut is a 1 Carryln CarryOut FIGURE 355 Adder hardware for the carry out signal The rest of the adder hardware is the logic for the Sum output given in the equation on page B728 exactly to the three parenthesized terms of the formula above for CarryOut and the OR gate sums the three terms The Sum bit is set when exactly one input is 1 or when all three inputs are 1 The Sum results in a complex Boolean equation recall that a means NOT a Sum a b CarryIn a b CarryIn a b Carryln a b Carryln The drawing of the logic for the Sum bit in the adder black box is left as an exercise Figure B56 shows a 17bit ALU derivedby combining the adder with the earlier components Sometimes designers also want the ALU to perform a few more sime ple operations such as generating 0 The easiest way to add an operation is to expand the multiplexor controlled by the Operation line and for this example to connect 0 directly to the new input of that expanded multiplexor 5 u Constructing a Basic Arithmetic Logic Unit Operation Carryln l AResult CarryOut FIGURE 356 A 1bit ALU that performs AND OR and addition see Figure 355 A 32Bit ALU Now that we have completed the 17bit ALU the full 327bit ALU is created by con necting adjacent black boxes Using xi to mean the ith bit of x Figure B57 shows a 32bit ALU Just as a single stone can cause ripples to radiate to the shores of a quiet lake a single carry out of the least signi cant bit ResultO can ripple all the way through the adder causing a carry out of the most signi cant bit Result31 Hence the adder created by directly linking the carries of 17bit adders is called a ripple carry adder We ll see a faster way to connect the 17bit adders starting on page B738 Subtraction is the same as adding the negative version of an operand and this is how adders perform subtraction Recall that the shortcut for negating a two s complement number is to invert each bit sometimes called the one s complement and then add 1LTo invert each bit we simply add a 21 multiplexor that chooses between b and b as Figure B58 shows Suppose we connect 32 of these 17bit ALUs as we did in Figure B57 The added multiplexor gives the option of b or its inverted value depending on Bin vert but this is only one step in negating a two s complement number Notice that the least signi cant bit still has a Carryln signal even though it s unnecessary for addition What happens if we set this Carryln to 1 instead of 0 The adder will then calculate a b 1 By selecting the inverted version of b we get exactly what we want abl abl a7b aib Appendix B The Basics of Logic Design Operation Carryln Carryln 30 b0 ALUO 4 ResultO CarryOut a1 Carryln b1 ALU1 Result1 CarryOut a2 Carryln b2 ALU2 4gt Result2 CarryOut 331 Carryln Result31 331 AL U31 FIGURE 357 A 32h ALU constructed from 32 lab ALUs CarryOut ofthe less signi cant bit is connected to the Carryln of the more signi cant bit This organization is called ripple carry The simplicity of the hardware design of a two s complement adder helps explain why two s complement representation has become the universal standard for intei ger computer arithmetic A MIPS ALU also needs a NOR function Instead of adding a separate gate for NOR we can reuse much of the hardware already in the ALU like we did for sub tract The insight comes from the following truth about NOR a b a b That is NOT a OR b is equivalent to NOT a AND NOT b This fact is called DeMorgan s theorem and is explored in the exercises in more dept Since we have AND and NOT b we only need to add NOT a to the ALU Figure B59 shows that change 35 Constructing a Basic Arithmetic Logic Unit 331 Binvert Operation Carryln a Result CarryOut FIGURE B58 A 1bit ALU that performs AND OR and addition on a and b or a and B By selecting b Binvert 1 and setting Carryln to 1 in the least signi cant bit of the ALUwe get two s compler ment subtraction of b from a instead of addition of b to a Ainvert Operation Binvert Carryln CarryOut FIGURE B59 A 1bit AILU that performs AND OR and addition on a and b or E and B By selecting a AinVert 1 and b Binvert 1 we get a NOR b instead of a AND bi Appendix B The Basics of Logic Design Tailoring the 32Bit ALU to MIPS These four operationsiadd subtract AND ORiare found in the ALU of almost every computer and the operations of most MIPS instructions can be performed by this ALU But the design of the ALU is incomplete One instruction that still needs support is the set on less than instruction S l t Recall that the operation produces 1 if rs lt rt and 0 otherwise Conse quently Sl t will set all but the least signi cant bit to 0 with the least signi cant bit set according to the comparison For the ALU to perform S l t we rst need to expand the threeeinput multiplexor in Figure B58 to add an input for the Sl t result We call that new input Less and use it only for S l t The top drawing of Figure 13510 shows the new 17bit ALU with the expanded multiplexor From the description of S l t above we must connect 0 to the Less input for the upper 31 bits of the ALU since those bits are always set to 0 What remains to consider is how to compare and set the least signi cant bit for set on less than instructions What happens if we subtract b from a If the difference is negative then a lt b since aeblt0gtaebblt0b gtaltb We want the least signi cant bit of a set on less than operation to be a 1 if a lt b that is a 1 if a 7b is negative and a 0 if it s positive This desired result corresponds exactly to the sign bit values 1 means negative and 0 means positive Following this line of argument we need only connect the sign bit from the adder output to the least signi cant bit to get set on less than Unfortunately the Result output from the most signi cant ALU bit in the top of Figure 13510 for the Sl t operation is not the output of the adder the ALU out put for the S l t operation is obviously the input value Less Thus we need a new 17bit ALU for the most signi cant bit that has an extra output bit the adder output The bottom drawing of Figure 13510 shows the design with this new adder output line called Set and used only for S l t As long as we need a special ALU for the most signi cant bit we added the over ow detece tion logic since it is also associated with that bit Alas the test of less than is a little more complicated than just described because of over ow as we explore in the exercises Figure B511 shows the 32bit ALU Notice that every time we want the ALU to subtract we set both CarryIn and Binvert to 1 For adds or logical operations we wantboth control lines to be 0 We can therefore simplify control of the ALU by combining the Carryln and Binvert to a single control line called Bnegate 35 Constructing a Basic Arithmetic Logic Unit Aim Operation Binvert Carryln l CarryOut Aimen Operation B39n ert Carryln a i 0 1 H b 7 o 1 Less Over ow a Over ow detection FIGURE 3510 Top A 1bit ALU that performs AND OR and addition on a and b or b and bottom a 1bit ALU for the most sig fieant bit The top drawing includes a direct input that is connected to perform the set on less than operation see Figure Bill the bottom has a direct output from the adder for the less than comparison called Set See Exercise 324 to see how to calculate over ow with fewer inputsi Appendix B The Basics of Logic Design Binvert Ainvert Carryln Operation llll a0 4 Carryln b0 4 ALUO Less CarryOut a1 4 Carryln b1 4 0 4 Less a2 a Carryln b2 4 ALU2 CarryOut Carryln ALU31 Less Carryln ResuItO Result1 Result2 Result31 Overflow FIGURE 351 A 32bit ALU constructed from the 31 copies of the 1bit ALU in the top of Figure 3510 and one 1bit ALU in the bottom of that figure The Less inputs are connected to 0 except for the least signi cant bit which is connected to the Set output of the most signi cant bit If the ALU performs a 7 b and we select the input 3 in the multiplexor in Figure 3510 then Result 0 i i i 001 if a lt b and Result 0 i i i 000 otherwrse To further tailor the ALU to the MIPS instruction set we must support condie tional branch instructions These instructions branch either if two registers are equal or if they are unequal The easiest way to test equality with the ALU is to subtract b from a and then test to see if the result is 0 since aib0gtab 35 Constructing a Basic Arithmetic Logic Unit 335 Thus if we add hardware to test if the result is 0 we can test for equality The simplest way is to OR all the outputs together and then send that signal through an inverter Zero Result31 Result30 Result2 Resultl ResultO Figure B512 shows the reVised 327bit ALU We can think of the combination of the 17bit Ainvert line the 17bit Binvert line and the 27bit Operation lines as 47 bit control lines for the ALU telling it to perform add subtract AND OR or set on less than Figure B513 shows the ALU control lines and the corresponding ALU operation Bnegate Operation Ainvert r r W W a0 4 Carryln R no b0 4 ALUO 55 Less T CarryOut l l l l a1 a Carryln R m b1 4 ALU1 55 I 0 4r Less CarryOut 5 zero a2 4 Carryln b2 ALU2 Result2 0 4 Less CarryOut Carryln a31 Carryln Resunm b31 ALU31 0 Less Over ow FIGURE B512 The final 32bit ALU This adds a Zero detector to Figure B511 Appendix B The Basics of Logic Design subtract 0n FIGURE 3513 The values of the three ALU control lines Bnegate and Operation and the corresponding ALU operations ALU operation Zero Result Over ow CarryOut FIGURE 3514 The symbol commonly used to represent an ALU as shown in Figure 3512 This symbol is also used to represent an adder so it is normally labeled either with ALU or Adder Finally now that we have seen what is inside a 32bit ALU we will use the unie versal symbol for a complete ALU as shown in Figure 3514 Defining the MIPS ALU in Verilog Figure B515 shows how a combinational MIPS ALU might be speci ed in Vere ilog such a speci cation would probably be compiled using a standard parts library that provided an adder which could be instantiated For completeness we show the ALU control for MIPS in Figure 3516 which we will use later when we build a Verilog version of the MIPS datapath in Chapter 5 The next question is How quickly can this ALU add two 327bit operands We can determine the a and b inputs but the CarryIn input depends on the operation in the adjacent 17bit adder Ifwe trace all the way through the chain of dependene 35 Constructing a Basic Arithmetic Logic Unit B37 module MlPSALU ALUCtl A B ALUOut Zero input 301 ALUCtl input 3101 AB output reg 310 ALUOut output Zero assign Zero ALUOut Zero is true it ALUOut is 0 always ALUCtl A B begin reevaluate it these Cnange case ALUctl o ALUOut lt A amp B 1 AtUOut lt A l B 2 AtUOut lt A B 6 ALUOut lt A B 7 B ALUOut lt A lt 7 1 O 12 ALUOut lt A l B result is nor default ALUOut lt O endcase end endmodule FIGURE 3515 A Verilog behavioral definition of a MIPS ALU module ALUControl ALUOp FuncCode ALUCtl input 101 ALUOp input 501 FunCCode output 301 reg ALUCtl always case FuncCode 32 ALUOplt2 add 34 ALUOplt6 subtract 36 ALUOPltO and 37 ALUOplt1 or 39 ALUOplt12 nor 42 ALUOplt7 sit default ALUOplt15 should not happen endcase endmodule FIGURE 3516 The MIPS ALU control a simple piece of combinational control logic cies we connect the most signi cant bit to the least signi cant bit so the most signi cant bit of the sum must wait for the sequential evaluation of all 32 17bit adders This sequential chain reaction is too slow to be used in tirneecritical hard ware The next section explores how to speed up addition This topic is not crucial to understanding the rest of the appendix and may be skipped Appendix B The Basics of Logic Design Check Yourself Suppose you wanted to add the operation NOT a AND b called NAND How could the ALU change to support it No change You can calculate NAND quickly using the current ALU since w b a b and we already have NOT a NOT b and OR N You must expand the big multiplexor to add another input and then add new logic to calculate NAND Faster Addition Carry Lookahead The key to speeding up addition is determining the carry in to the higheorder bits sooner There are a variety of schemes to anticipate the carry so that the worst case scenario is a function of the logz of the number of bits in the adder These anticipatory signals are faster because they go through fewer gates in sequence but it takes many more gates to anticipate the proper carry A key to understanding fast carry schemes is to remember that unlike software hardware executes in parallel whenever inputs change Fast Carry Using Infinite Hardware As we mentioned earlier any equation can be represented in two levels of logic Since the only external inputs are the two operands and the Carryln to the least signi cant bit of the adder in theory we could calculate the CarryIn values to all the remaining bits of the adder in just two levels of lo ic For example the Carryln for bit 2 of the adder is exactly the CarryOut ofbit 1 so the formula is CarryInZ b1 Carry1n1a1 Carrylnl a1 b1 Similarly Carrylnl is de ned as CarryInl b0 CarryInO a0 Carryan a0 b0 Using the shorter and more traditional abbreviation of ci for Carrylni we can rewrite the formulas as c2 b1c1a1c1a1b1 c1b0c0a0c0a0 b0 Substituting the de nition of c1 for the rst equation results in this formula c2 a1a0b0a1a0c0a1b0c0 b1a0b0b1ra0c0b1b0c0a1b1 36 Faster Addition Carry Lookahead You can imagine how the equation expands as we get to higher bits in the adder it grows rapidly with the number of bits This complexity is re ected in the cost of the hardware for fast carry making this simple scheme prohibitively expensive for wide adders Fast Carry Using the First Level of Abstraction Propagate and Generate Most fast carry schemes limit the complexity of the equations to simplify the hardware while still making substantial speed improvements over ripple carry One such scheme is a carry lookahead adder In Chapter 1 we said computer sys7 tems cope with complexity by using levels of abstraction A carryilookahead adder relies on levels of abstraction in its implementation Let s factor our original equation as a rst step ci1 bi ciai ciai bi ai bi ai bi ci If we were to rewrite the equation for c2 using this formula we would see some repeated patterns c2 a1 b1a1b1 a0 r b0 a0 b0 c0 Note the repeated appearance of ai bi and ai bi in the formula above These two important factors are traditionally called generate gi and propagate pi gi ai bi pi ai bi Using them to de ne ci1 we get ci1 gipi ci To see where the signals get their names suppose gi is 1 Then ci1 gipici1pici 1 That is the addergerierates a CarryOut ci1 independent of the value of Carryln ci Now suppose that gi is 0 and pi is 1 Then ci1 gipici 01ci ci That is the adder propagates Carryln to a CarryOut Putting the two together Carrylni1 is a 1 if either gi is 1 or both pi is 1 and Carrylni is 1 As an analogy imagine a row of dominoes set on edge The end domino can be tipped over by pushing one far away provided there are no gaps between the two Similarly a carry out can be made true by a generate far away provided all the propagates between them are true Appendix B The Basics of Logic Design Relying on the de nitions of propagate and generate as our rst level of abstraction we can express the Carryln signals more economically Let s show it for 4 bits c1 g0p0 c0 02 g1p1g0p1p0c0 c3 g2p2g1p2p1g0p2p1p0c0 c4 g3p3g2p339p2g1p3p239p1g0 p3p2p1p0c0 These equations just represent common sense Carrylni is a 1 if some earlier adder generates a carry and all intermediary adders propagate a carry Figure B61 uses plumbing to try to explain carry lookahead Even this simpli ed form leads to large equations and hence considerable logic even for a 16bit adder Let s try moving to two levels of abstraction Fast Carry Using the Second Level of Abstraction First we consider this 47bit adder with its carryelookahead logic as a single build ing block If we connect them in ripple carry fashion to form a 16bit adder the add will be faster than the original with a little more hardware To go faster we ll need carry lookahead at a higher level To perform carry lookahead for 47bit adders we need propagate and generate signals at this higher level Here they are for the four 47bit adder blocks P0 p3p2p1p0 P1 p7p6p5p4 P2 p11 p10p9p8 P3 p15 p14 p13 p12 That is the super propagate signal for the 47bit abstraction Pi is true only if each of the bits in the group will propagate a carry For the super generate signal Gi we care only if there is a carry out of the most signi cant bit of the 47bit group This obviously occurs if generate is true for that most signi cant bit it also occurs if an earlier generate is true and all the intermediate propagates including that ofthe most signi cant bit are also true G0 g3p3g2p3p239g1p3p2p1g0 G1 g7p739g6p7p6g5p7p6p539g4 G2 g11 p11 g10 p11 p10 g9 p11 p10 p9 g8 G3 g15p15 g14p15 p14 g13p15 p14 p13 g12 36 Faster Addition Carry Lookahead FIGURE 361 A plumbing analogy for carry lookahead for 1 bit 2 bits and 4 bits using water pipes and valves The wrenches are turned to open and close Valvesi Water is shown in color The output of the pipe Ci1 will be full if either the nearest generate Value gi is turned on or if the i propagate Value pi is on andthere is water further upstream either from an earlier generate or propagate with water behind it CarryIn c0 can result in a carry out without the help of any generates but with the help of all propagatesi Appendix B The Basics of Logic Design Figure B62 updates our plumbing analogy to show P0 and G0 FIGURE 362 A plumbing analogy for the nextlevel carrylookahead signals P0 and GO P0 is open only if allfour propagates pi are open while water ows in G0 only if at least one generate gi is open and all the propagates downstream from that generate are open 36 Faster Addition Carry Lookahead Then the equations at this higher level of abstraction for the carry in for each 47 bit group of the 16bit adder C1 C2 C3 C4 in Figure 1363 are very similar to the carry out equations for each bit of the 47bit adder c1 c2 c3 c4 on page B740 C1 C2 C3 C4 G0P0 G1 P1 G2P2 G3P3 c0 G0 P1 r P0 c0 G1P2rP1G0P2 P1P0 c0 G2P3P2 G1P3P2 P1GO P3P2P1rP0c0 Figure B63 shows 47bit adders connected with such a carryilookahead unit The exercises explore the speed differences between these carry schemes different notations for multibit propagate and generate signals and the design of a 64bit adder Both Levels of the Propagate and Generate Determine the gi pi Pi and Gi Values of these two 167bit numbers 0001 1010 0011 0011tWO 1110 0101 11101011tWO Also what is CarryOut15 C4 a b Aligning the bits makes it easy to see the Values of generate gi ai bi and propagate pi ai bi 0001 1010 0011 0011 1110 010111101011 0000 0000 0010 0011 1111111111111011 where the bits are numbered 15 to 0 from left to right Next the super propagates P3 P2 P1 P0 are simply the AND of the lowerelevel propagates a b gi pf P31 P21 P11 P01 1 1 1 1 1 1 11 11 11 Appendix B The Basics of Logic Design Carryln 30 Carryln b0 4 Result0 3 a1 4r 1 4 32 4 ALUO b2 po 4 pi a3 G0 4 gi b3 4 O I oi 1 CarryIookahead unlt a4 Carryln 34 a Result4 7 35 4r b5 4 aB 4 ALU1 35 a P1 4 pi 1 a7 a G1 4 9i 1 b7 4 CZ Cl 2 384 Carryln 33 a Results 11 39 4 b9 4 a10 4 ALU2 b10 4 p2 4 pi 2 a11 4 G2 4 gi 2 311 4 C3 Cl 3 3124 Carryln b124gt Result12 15 6134 3134 314 a ALU3 3144 p3 4 pi 3 a15 4 i 3 1215 G3 9 c4 Cl 4 CarryOut FIGURE 363 Four 4bit ALUs using carry lookahead to form a 16bit adder Note that the carries come from the carryrlookahead unit not from the 47bit ALUs 36 Faster Addition Carry Lookahead G0 G1 G2 G3 The super generates are more complex so use the following equations g3p339g2p339p2 g1p3 p239p139gO 01gt01 0 11 0 1 100000 g7p739g6p739p639g5p7 p6 p5g4 01gt01 1 11 1 1 000101 g11p11g10p11gtp10g9p11gtp10p9g8 01gt01 1 01 1 1 000000 g15p15 g14p15gtp14 g13p15 p14 p13 g12 01gt01 1 01 1 1 000000 Finally CarryOut15 is C4 G3P3 G2P3 P2 G1P3 P2 P1 G0 P3gtP2gtP1gtP0gtc0 01011111101110 0 00l00l Hence there is a carry out when adding these two 167bit numbers The reason carry lookahead can make carries faster is that all logic begins evaluate ing the moment the clock cycle begins and the result will not change once the output of each gate stops changing By taking a shortcut of going through fewer gates to send the carry in signal the output of the gates will stop changing sooner and hence the time for the adder can be less To appreciate the importance of carry lookahead we need to calculate the rela tive performance between it and ripple carry adders Speed of Ripple Carry versus Carry Lookahead One simple way to model time for logic is to assume each AND or OR gate takes the same time for a signal to pass through it Time is estimated by sime ply counting the number of gates along the path through a piece of logic Compare the number ofgate delays for paths of two 167bit adders one using ripple carry and one using twoilevel carry lookahead Figure B55 on page B728 shows that the carry out signal takes two gate def lays per bit Then the number of gate delays between a carry in to the least signi cant bit and the carry out of the most signi cant is 16 X 2 32 Appendix B The Basics of Logic Design Check Yourself For carry lookahead the carry out of the most signi cant bit is just C4 de ned in the example It takes two levels oflogic to specify C4 in terms ofPi and Gi the OR of several AND terms Pi is speci ed in one level of logic AND using pi and Gi is speci ed in two levels using pi and gi so the worst case for this next level of abstraction is two levels of logic pi and gi are each one level of logic de ned in terms of ai and bi lfwe assume one gate delay for each level of logic in these equations the worst case is 2 2 1 5 gate delays Hence for the path from carry in to carry out the 16bit addition by a carryelookahead adder is six times faster using this very simple estimate of hardware speed Summary Carry lookahead offers a faster path than waiting for the carries to ripple through all 32 17bit adders This faster path is paved by two signals generate and propae gate The former creates a carry regardless of the carry input and the other passes a carry along Carry lookahead also gives another example of how abstraction is important in computer design to cope with complexity Using the simple estimate of hardware speed above with gate delays what is the relative performance of a ripple carry 87bit add versus a 64bit add using carrye lookahead logic 1 A 64ebit carryelookahead adder is three times faster 87bit adds are 16 gate delays and 64ebit adds are 7 gate delays 2 They are about the same speed since 64ebit adds need more levels of logic in the 16bit adder 3 87bit adds are faster than 64 bits even with carry lookahead Elaboration We have now accounted for all but one of the arithmetic and logical operations forthe core MIPS instruction set the ALU in Figure 8514 omits support of shift instructions It would be possible to widen the ALU multiplexor to include a left shift by 1 bit or a right shift by 1 bit But hardware designers have created a circuit called a barrel shifter which can shift from 1 to 31 bits in no more time than it takes to add two 32bit numbers so shifting is normally done outside the ALU Elaboration The logic equation for the Sum output of the full adder on page 828 can be expressed more simply by using a more powerful gate than AND and OR An exclusive OR gate is true if the two operands disagree that is xeyjlandxyjo B1 clocks In some technologies exclusive OR is more efficient than two levels of AND and OR gates Using the symbol 9 to represent exclusive OR here is the new equation Sum a e be Carryln Also we have drawn the ALU the traditional way using gates Computers are designed today in CMOS transistors which are basically switches CMOS ALU and bar rel shifters take advantage of these switches and have many fewer multiplexors than shown in our designs but the design principles are similar Elaboration Using lowercase and uppercase to distinguish the hierarchy of gener ate and propagate symbols breaks down when you alternate notation that scales is g J and p forthe generate and propagate signals for bits i to j Thus g1 1 is generate for bit 1 g4 1 is for bits 4 to 1 and gm 1 is for bits 16 to 1 Clocks Before we discuss memory elements and sequential logic it is useful to discuss brie y the topic of clocks This short section introduces the topic and is similar to the discussion found in Section 52 More details on clocking and timing methode ologies are presented in Section B11 Clocks are needed in sequential logic to decide when an element that contains state should be updated A clock is simply a freeerunning signal with a xed cycle time the clock frequency is simply the inverse of the cycle time As shown in Figure 371 the clock cycle time or clock period is divided into two portions when the clock is high and when the clock is low In this text we use only edge triggered clocking This means that all state changes occur on a clock edge We use an edge triggered methodology because it is simpler to explain Depending on the tech nology it may or may not be the best choice for a clocking methodology Falling edge JAG Clock period Rising edge FIGURE 311 A clock signal oscillates bedween high and low values The clock period is the time for one full cycle In an edgertriggered design either the rising or falling edge of the clock is active and causes state to be changed edgeetriggered clocking A clocking scheme in which all state changes occur on a clock edge clocking methodology The approach used to determine when data is valid and stable rel ative to the clock Appendix B The Basics of Logic Design state element A memory element synchronous system A meme ory system that employs clocks and where data signals are read only when the clock indicates that the signal values are stable In an edgeetriggered methodology either the rising edge or the falling edge of the clock is active and causes state changes to occur As we will see in the next sec tion the state elements in an edgeetriggered design are implemented so that the contents of the state elements only change on the active clock edge The choice of which edge is active is in uenced by the implementation technology and does not affect the concepts involved in designing the logic The clock edge acts as a sampling signal causing the value of the data input to a state element to be sampled and stored in the state element Using an edgeetrigger means that the sampling process is essentially instantaneous eliminating prob lems that could occur if signals were sampled at slightly different times The major constraint in a clocked system also called a synchronous system is that the signals that are written into state elements must be valid when the active clock edge occurs A signal is valid if it is stable ie not changing and the value will not change again until the inputs change Since combinational circuits cannot have feedback if the inputs to a combinational logic unit are not changed the outputs will eventually become valid Figure B72 shows the relationship among the state elements and the combinae tional logic blocks in a synchronous sequential logic design The state elements whose outputs change only after the clock edge provide valid inputs to the come binational logic block To ensure that the values written into the state elements on the active clock edge are valid the clock must have a long enough period so that all the signals in the combinational logic block stabilize then the clock edge same ples those value for storage in the state elements This constraint sets a lower bound on the length of the clock period which must be long enough for all state elements inputs to be valid In the rest of this appendix as well as in Chapters 5 and 6 we usually omit the clock signal since we are assuming that all state elements are updated on the same clock edge Some state elements will be written on every clock edge while others will be written only under certain conditions such as a register being updated In such cases we will have an explicit write signal for that state element The write signal must still be gated with the clock so that the update occurs only on the clock edge if the write signal is active We will see how this is done and used in the next section Combinational logic Clock cycle J FIGURE 312 The inputs to a combinational logic block come from a state element and the outputs are written into a state element The clock edge determines when the contents of the state elements are updated 38 Memory Elements Flipdflops Latches and Registers Combinational logic FIGURE 313 An edgetriggered methodology allows a state element to be read and written in the same clock cycle without creating a race that could lead to undetermined data values Of course the clock cycle must still be long enough so that the input values are stable when the active clock edge occurs One other advantage of an edgeetriggered methodology is that it is possible to have a state element that is used as both an input and output to the same combie national logic block as shown in Figure 373 In practice care must be taken to prevent races in such situations and to ensure that the clock period is long enough this topic is discussed further in Section B11 Now that we have discussed how clocking is used to update state elements we can discuss how to construct the state elements Elaboration Occasionally designers find it useful to have a small number of state elements that change on the opposite clock edge from the majority of the state ele ments Doing so requires extreme care because such an approach has effects on both the inputs and the outputs of the state element Why then would designers ever do this Consider the case where the amount of combinational logic before and after a state element is small enough so that each could operate in onehalf clock cycle rather than the more usual full clock cycle Then the state element can be written on the clock edge corresponding to a half clock cycle since the inputs and outputs will both be usable after onehalf clock cycle One common place where this technique is used is in register files where simply reading or writing the register file can often be done in half the normal clock cycle Chapter 6 makes use of this idea to reduce the pipelining overhead Memory Elements Flip ops Latches and Registers 38 In this section and the next we discuss the basic principles behind memory ele ments starting with ipe ops and latches moving on to register les and nally to memories All memory elements store state the output from any memory ele ment depends both on the inputs and on the value that has been stored inside the memory element Thus all logic blocks containing a memory element contain state and are sequential register le A state element that consists of a set of registers that can be read and written by supplying a register number to be accessed Appendix B The Basics of Logic Design ipe op A memory element for which the output is equal to the value ofthe stored state inside the element and for which the internal state is changed only on a clock edge latch A memory element in which the output is equal to the value of the stored state inside the element and the state is changed whenever the appropri ate inputs change and the clock is asserted 6 S FIGURE 381 A pair of cross coupled NOR gate can store an internal value The value storfd on the output Q is recycled by inverting it to obtain Q and then inverting Q to obtain Q If either R or Q are asserted Q will be deasserted and vice versa The simplest type of memory elements are unclocked that is they do not have any clock input Although we only use clocked memory elements in this text an unclocked latch is the simplest memory element so let s look at this circuit rst Figure B81 shows an S R latch setereset latch built from a pair of NOR gates OR gates with inverted outputs The outputs Q and Q represent the value of the stored state and its complement When neither 5 nor R are asserted the cross coupled NOR gates act as inverters and store the previous values of Q and Q For example i the output Q is true then the bottom inverter produces a false output which is Q which becomes the input to the top inverter which produces a true output which is Q and so on If S is asserted then the output Q will be asserted and Q will be deasserted while ifR is asserted then the output Q will be asserted and Q will be deasserted When 5 and R are both deasserted the last values of Q and Q will continue to be stored in the crossecoupled structure Asserting S and R simultaneously can lead to incorrect operation depending on how 5 and R are deasserted the latch may oscillate or become metastable this is described in more detail in Section B11 This crossecoupled structure is the basis for more complex memory elements that allow us to store data signals These elements contain additional gates used to store signal values and to cause the state to be updated only in conjunction with a clock The next section shows how these elements are built FlipFlops and Latches Flip ops and latches are the simplest memory elements In both ipe ops and latches the output is equal to the value of the stored state inside the element Fur thermore unlike the SR latch described above all the latches and ipe ops we will use from this point on are clocked which means they have a clock input and the change of state is triggered by that clock The difference between a ipe op and a latch is the point at which the clock causes the state to actually change In a clocked latch the state is changed whenever the appropriate inputs change and 38 Memory Elements Flipdflops Latches and Registers the clock is asserted whereas in a ipe op the state is changed only on a clock edge Since throughout this text we use an edgeetriggered timing methodology where state is only updated on clock edges we need only use ipe ops Flipe ops are often built from latches so we start by describing the operation of a simple clocked latch and then discuss the operation of a ipe op constructed from that latch For computer applications the function of both ipe ops and latches is to store a signal A D latch or D ip op stores the value of its data input signal in the internal memory Although there are many other types of latches and ipe ops the D type is the only basic building block that we will need A D latch has two inputs and two outputs The inputs are the data value to be stored called D and a clock signal called C that indicates when the latch should read the value on the D input and store The outputs are simply the value of the internal state Q and its complement Q When the clock input C is asserted the latch is said to be open and the value of the output Q becomes the value of the input D When the clock input C is deasserted the latch is said to be closed and the value of the out put Q is whatever value was stored the last time the latch was open Figure B82 shows how a D latch can be implemented with two additional gates added to the crossecoupled NOR gates Since when the latch is open the value of Q changes as D changes this structure is sometimes called a transparent latch Figure B83 shows how this D latch works assuming that the output Q is initially false and that D changes rst As mentioned earlier we use ipe ops as the basic building block rather than latches Flipe ops are not transparent their outputs change only on the clock edge A ipe op can be built so that it triggers on either the rising positive or falling negative clock edge for our designs we can use either type Figure B84 shows how a fallingeedge D ipe op is constructed from a pair of D latches In a D ipe op the output is stored when the clock edge occurs Figure B85 shows how this ipe op operates 390 D FIGURE 382 A D latch implemented with NOR gates A NOR gate acts as an inverter if the other input is 0 Thus the crossrcoupled pair of NOR gates acts to store the state value unless the clock input C is asserted in which case the value of input D replaces the value of Q and is stored The value of input D must be stable when the clock signal C changes from asserted to deasserted D ipe op A ipe op with one data input that stores the value of that input signal in the inter nal memory when the clock edge occurs 352 Appendix B The Basics of Logic Design FIGURE 383 Operation of a D latch assuming the output is initially deasserted When the clock C is asserted the latch is open and the Q output immediately assumes the Value of the D inputr FIGURE 384 A D flipflop with a fallingedge trigger The rst latch called the master is open and follows the input D when the clock input C is asserted When the clock input C falls the rst latch is closed but the second latch called the slave is open and gets its input from the output of the master latchr FIGURE 385 Operation of a D flip op with a fallingedge trigger assuming the output is initially deasserted When the clock input C changes from asserted to deasserted the Q output stores the Value of the D inputr Compare this behavior to that of the clocked D latch shown in Figure Br8r3r In a clocked latch the stored Value and the output Q both change whenever C is high as opposed to only when Ctransitionsr 38 Memory Elements Flipdflops Latches and Registers 353 Here is a Verilog description of a module for a risingeedge D ipe op assuming that C is the clock input and D is the data input module DFFltclockDQQbar input clock D output reg Q Q is a reg since it is assigned in an always block output Qbar assign Qbar N O Qbar is always just the inverse of 0 always posedge clock perform actions whenever the clock rises endmodul e Because the D input is sampled on the clock edge it must be Valid for a period of time immediately before and immediately after the clock edge The minimum time that the input must be Valid before the clock edge is called the set up lime seteup time The minimum the minimum time during which it must be Valid after the clock edge is called the time that the input to a memory hold time Thus the inputs to any ipe op or anything built using ipe ops de dce mu bevahd baf e the must be Valid during a window that begins at time tsetrup before the clock edge and CIOCk edge39 ends at thold after the clock edge as shown in Figure 386 Section 1311 talks about hold time The minimum time clocking and timing constraints including the propagation delay through a ip during WhiCh the input must he op in more detail Valid after the clock edge We can use an array of D ipe ops to build a register that can hold a multibit datum such as a byte or word We used registers throughout our datapaths in Chapters 5 and 6 Regster Files One structure that is central to our datapath is a register le A register le consists of a set of registers that can be read and written by supplying a register number to be accessed A register le can be implemented with a decoder for each read or D Setup time Hold time C l FIGURE 386 Setup and hold time requirements for a D flipflop with a fallingedge trig ger The input must be stable a period of time before the clock edge as well as after the clock edge The L I m t u L i l minimum us L d 39 quot the setup timewhile theminimum time the signal must be stable after clock is called the hold time Failure to meet these minimum re uire ments can result in a situation where the output of the ipe op may not even be predicmble as described in Section B11 Hold times are usually either 0 or Very small and thus not a cause of worry Appendix B The Basics of Logic Design write port and an array of registers built from D ipe ops Because reading a reg ister does not change any state we need only supply a register number as an input and the only output will be the data contained in that register For writing a regise ter we will need three inputs a register number the data to write and a clock that controls the writing into the register In Chapters 5 and 6 we used a register le that has two read ports and one write port This register le is drawn as shown in Figure 387 The read ports can be implemented with a pair of multiplexors each of which is as wide as the number of bits in each register of the register le Figure B88 shows the implementation of two register read ports for a 327bitewide regise ter le Implementing the write port is slightly more complex since we can only change the contents of the designated register We can do this by using a decoder to genere ate a signal that can be used to determine which register to write Figure B89 shows how to implement the write port for a register le It is important to remember that the ipe op changes state only on the clock edge In Chapters 5 and 6 we hooked up write signals for the register le explicitly and assumed the clock shown in Figure B89 is attached implicitly What happens if the same register is read and written during a clock cycle Because the write of the register le occurs on the clock edge the register will be Valid during the time it is read as we saw earlier in Figure B72 The Value returned will be the Value written in an earlier clock cycle If we want a read to return the Value currently being written additional logic in the register le or out side ofit is needed Chapter 6 makes extensive use of such logic Read register 4 Read Read register data 1 4 number 2 Register le Write Read 4 4 register data 2 Write data Write FIGURE 381 A register file with two read ports and one write port has five inputs and two outputs The control input Write is shown in color 38 Memory Elements Flipdflops Latches and Registers Read register Register 0 Register 1 Read data 1 Read register Read data 2 FIGURE 388 The implementation of two read ports for a register file with n registers can be done with a pair of ntod multiplexors each 32 bits wide The register read number sigr nal is used as the multiplexor selector signal Figure 1389 shows how the write port is implemented Specifying Sequential Logc in Verilog To specify sequential logic in Verilog we must understand how to generate a clock how to describe when a Value is written into a register and how to specify sequential control Let us start by specifying a clock A clock is not a prede ned object in Verilog instead we generate a clock by using the Verilog notation n before a statement this causes a delay of n simulation time steps before the execue tion of the statement In most Verilog simulators it is also possible to generate a clock as an external input allowing the user to specify at simulation time the number of clock cycles to run a simulation or The code in Figure 13810 implements a simple clock that is high or low for one simulation unit and then switches state We use the delay capability and blocking assignment to implement the cloc Appendix B The Basics of Logic Design Write Register 0 Register number Register 1 Register n 2 Register n 1 Register data FIGURE 389 The write port for a register file is implemented with a decoder that is used with the write signal to generate the c input to the registers All three inputs the regisr ter number the data and the write signal will have setup and holdrtime constraints that ensure that the correct data is written into the register ler reg clock Clock is a register always 1 clock 1 1 clock O FIGURE 3810 A specification of a clock Next we must be able to specify the operation of an edgeetriggered register In Verilog this is done by using the sensitivity list on an always block and specifying as a trigger either the positive or negative edge of a binary variable with the notae tion p O s e d g e or n e g e d g e respectively Hence the following Verilog code causes register A to be written with the value b at the positive edge cloc reg 3101 A Wire 3101 b always posedge clock A lt b 39 Memory Elements SRAMs and DRAMs 357 modu1e registerti1e Read1ReadZNriteRegNriteDataRegNriteData1Data2cock input 50 Read1Read2NriteReg tne registers numbers to read or write input 310 NriteData data to write input RegNrite Tne write contro c10ck tne c10ck to trigger write output 310 Data1 Data2 tne register vaues read reg 310 RF 310 32 registers eacn 32 bits 1ong assign Data1 RRead1 assign DataZ RRead2 aways begin write tne register witn new vaue it Regwrite is nign posedge dock it Regwrite RNriteReg lt NriteData end endmodu1e FIGURE 381 A MIPS regster file written in behavioral Verilog This register le writes on the rising clock edge Throughout this chapter and the Verilog sections of Chapters 5 and 6 we will assume a positive edgeetriggered design Figure B811 shows a Verilog speci cae tion of a MIPS register le that assumes two reads and one write with only the write being clocked In the Verilog for the register le in Figure 3811 the output ports corresponding Check to the registers being read are assigned using a continuous assignment but the Yoursef register being written is assigned in an aways block Which of the following is the reason a There is no special reason It was simply convenient b Because Data1 and Data2 are output ports and WriteData is an input port c Because reading is a combinational event while writing is a sequential event Memory Elements SRAMs and DRAMs static random access mem 01 ySRAM A memory where data is stored statically as in ipe ops rather than dynamie Registers and register les prov1de the basic building blocks for small memories any as in DRAM SRAMS are but larger amounts of memory are built using either SRAMs static random faster than DRAMs but less access memories or DRAMS dynamic random access memories We rst dis dense and more expensive cuss SRAMs which are somewhat simpler and then turn to DRAMs Per bin Appendix B The Basics of Logic Design SRAMs SRAMs are simply integrated circuits that are memory arrays with usually a sin gle access port that can provide either a read or a write SRAMs have a xed access time to any datum though the read and write access characteristics often differ A SRAM chip has a speci c con guration in terms of the number of addressable locations as well as the width of each addressable location For example a 4M x 8 SRAM provides 4M entries each of which is 8 bits wide Thus it will have 22 address lines since 2M 222 an 87bit data output line and an 87bit single data input line As with ROMs the number of addressable locations is often called the height with the number of bits per unit called the width For a variety of technical reasons the newest and fastest SRAMs are typically available in narrow con gurations X 1 and X 4 Figure B91 shows the input and output signals for a 2M x 16 SRAM To initiate a read or write access the Chip select signal must be made active For reads we must also activate the Output enable signal that controls whether or not the datum selected by the address is actually driven on the pins The Output enable is useful for connecting multiple memories to a singleeoutput bus and using Output enable to determine which memory drives the bus The SRAM read access time is usually speci ed as the delay from the time that Output enable is true and the address lines are valid until the time that the data is on the output lines Typical read access times for SRAMs in 2004 vary from about 274 ns for the fastest CMOS parts which tend to be somewhat smaller and narrower to 8720 ns for the typical largest parts which in 2004 have over 32 million bits of data The demand for lowepower SRAMs for consumer products and digital appliances has grown greatly in the past ve years these SRAMs have much lower standby and access power but usually are 5710 times slower Most recently synchronous SRAMsisimilar to the synchronous DRAMs which we discuss in the next sec tionihave also been developed Address A o Chip select 4 15 Output enable 4 23mm gt Dout15 0 Write enable 4 Din15 0 3 FIGURE 391 A 32K x 8 SRAM showing the fifteen address lines 32K 215 and eight data inputs the three control lines and the eiglt data outputs 39 Memory Elements SRAMs and DRAMs For writes we must supply the data to be written and the address as well as sig nals to cause the write to occur When both the Write enable and Chip select are true the data on the data input lines is written into the cell speci ed by the address There are seteupetime and holdetime requirements for the address and data lines just as there were for D ipe ops and latches In addition the Write enable signal is not a clock edge but a pulse with a minimum width requirement The time to complete a write is speci ed by the combination of the setup times the hold times and the Write enable pulse width Large SRAMs cannot be built in the same way we build a register le because unlike a register le where a 327toel multiplexor might be practical the 64Ketoel multiplexor that would be needed for a 64K X 1 SRAM is totally impractical Rather than use a giant multiplexor large memories are implemented with a shared output line called a bit line which multiple memory cells in the memory array can assert To allow multiple sources to drive a single line a three state bu er or tristate buffer is used A threeestate buffer has two inputsia data signal and an Output enableiand a single output which is in one of three states asserted deasserted or high impedance The output of a tristate buffer is equal to the data input signal either asserted or deasserted if the Output enable is asserted and is otherwise in a high impedance state that allows another threeestate buffer whose Output enable is asserted to determine the value of a shared output Figure B92 shows a set of threeestate buffers wired to form a multiplexor with a decoded input It is critical that the Output enable of at most one of the three state buffers be asserted otherwise the threeestate buffers may try to set the out put line differently By using threeestate buffers in the individual cells of the SRAM each cell that corresponds to a particular output can share the same out put line The use of a set of distributed threeestate buffers is a more ef cient implementation than a large centralized multiplexor The threeestate buffers are incorporated into the ipe ops that form the basic cells of the SRAM Figure B93 shows how a small 4 X 2 SRAM might be built using D latches with an input called Enable that controls the threeestate output The design in Figure B93 eliminates the need for an enormous multiplexor however it still requires a very large decoder and a correspondingly large number of word lines For example in a 4M x 8 SRAM we would need a Zetoe4M decoder and 4M word lines which are the lines used to enable the individual ipe ops To circumvent this problem large memories are organized as rectangular arrays and use a twoestep decoding process Figure B94 shows how a 4M x 8 SRAM might be organized internally using a twoestep decode As we will see the two level decoding process is quite important in understanding how DRAMs operate Recently we have seen the development of both synchronous SRAMs SSRAMs and synchronous DRAMs SDRAMs The key capability provided by Appendix B The Basics of Logic Design Select 0 Enable Data 0 out Select 1 Enable Data 1 out Select 2 Enable Data 2 out Select 3 Enable Data 3 out FIGURE 392 Four threestate buffers are used to form a multiplexer Only one ofthe four Select inputs can be asserted A threerstate buffer with a deasserted Output enable has a highrimpedance output that allows at L whose Output L 39 quot t quot 39 L ut line synchronous RAMs is the ability to transfer a burst of data from a series of sequene tial addresses within an array or row The burst is de ned by a starting address supplied in the usual fashion and a burst length The speed advantage of synchroe nous RAMs comes from the ability to transfer the bits in the burst without having to specify additional address bits Instead a clock is used to transfer the successive bits in the burst The elimination of the need to specify the address for the transe fers within the burst signi cantly improves the rate for transferring the block of data Because of this capability synchronous SRAMs and DRAMs are rapidly becoming the RAMs of choice for building memory systems in computers We discuss the use of synchronous DRAMs in a memory system in more detail in the next section and in Chapter 7 DRAMs In a static RAM SRAM the value stored in a cell is kept on a pair of inverting gates and as long as power is applied the value can be kept inde nitely In a dynamic RAM DRAM the value kept in a cell is stored as a charge in a capacitor A single transistor is then used to access this stored charge either to read the value or to overwrite the charge stored there Because DRAMs use only a single transise tor per bit of storage they are much denser and cheaper per bit By comparison SRAMs require four to six transistors per bit In DRAMs the charge is stored on a 39 Memory Elements SRAMs and DRAMs 361 Din1 Din1 17 D D 17 D D 7 C latch Q A 7 C latch Q 7 write enable Enable Enable 0 E f 43 2to4 17 D D I7 D D decoder 4 7 C latch Q 7 C latch Q 7439 Enable Enable 1 E f 43 07 D D 07 D D 71 Address 7 C latch Q 7 C latch Q 7439 Enable Enable 2 T E f u7 D D u7 D D 7 C latch Q A 7 C latch Q 7 Enable Enable 3 T E Dout1 DOUtlol FIGURE 393 The basic structure of a 4 2 SRAM consists of a decoder that selects which pair of cells to activate The actir Vated cells use athreerstate output connected to the Vertical bit lines that supply the requested data The address that selects the cell is sent on one of a set of horizontal address lines called the word lines For simplicity the Output enable and Chip select signals have been omitted but they could easily be added with a few AND gatest Page 62 4K X 1024 SRAM 4K X 1024 SRAM 4K X 1024 SRAM 4K X 1024 SRAM 4K X 1024 SRAM 4K X 1024 SRAM 4K X 1024 SRAM 4KX 1024 SRAM 12 4096 Address 21 10 4096 decoder Address 90 Dout7 Dout6 Dout5 Dout4 Dout3 Dout2 Dout1 DoutO FIGURE 394 1ypieal organization of a 4M 8 SRAM as an array of 4K 1024 arrays The rst decoder generates the addresses for eight 4K X 1024 arrays then a set of multiplexers is used to select 1 bit from each 10244bit4wide array This is amuch easier design than a singleelevel decode that would need either an enormous decoder or a gigantic multiplexorr In practice a modern SRAM of this size would probably use an even larger number of blocks each somewhat smallerr 39 Memory Elements SRAMs and DRAMs capacitor so it cannot be kept inde nitely and must periodically be refreshed That is why this memory structure is called dynamic as opposed to the static storage in an SRAM cell To refresh the cell we merely read its contents and write it back The charge can be kept for several milliseconds which might correspond to close to a million clock cycles Today singleechip memory controllers often handle the refresh funce tion independently of the processor If every bit had to be read out of the DRAM and then be written back individually with large DRAMs containing multiple megabytes we would constantly be refreshing the DRAM leaving no time for accessing it Fortunately DRAMs also use a twoelevel decoding structure and this allows us to refresh an entire row which shares a word line with a read cycle fol lowed immediately by a write cycle Typically refresh operations consume 1 to 2 of the active cycles of the DRAM leaving the remaining 98 to 99 of the cycles available for reading and writing data Elaboration How does a DRAM read and write the signal stored in a cell The tran sistor inside the cell is a switch called a pass transistor that allows the value stored on the capacitor to be accessed for either reading or writing Figure 895 shows how the singletransistor cell looks The pass transistor acts like a switch when the signal on the word line is asserted the switch is closed connecting the capacitor to the bit line If the operation is a write then the value to be written is placed on the bit line If the value is a L the capacitor will be charged If the value is a 0 then the capacitor will be discharged Reading is slightly more complex since the DRAM must detect a very small charge stored in the capacitor Before activating the word line for a read the bit line is charged to the voltage that is halfway between the low and high voltage Then by activating the word line the charge on the capacitor is read out onto the bit line This causes the bit line to move slightly toward the high or low direction and this change is detected with a sense amplifier which can detect small changes in voltage DRAMs use a twoelevel decoder consisting of a row access followed by a column access as shown in Figure 1396 The row access chooses one of a number of rows and activates the corresponding word line The contents of all the columns in the active row are then stored in a set of latches The column access then selects the data from the column latches To save pins and reduce the package cost the same address lines are used for both the row and column address a pair of signals called RAS Row Access Strobe and CAS Column Access Strobe are used to signal the DRAM that either a row or column address is being supplied Refresh is per formed by simply reading the columns into the column latches and then writing the same values back Thus an entire row is refreshed in one cycle The twoelevel addressing scheme combined with the internal circuitry make DRAM access times much longer by a factor of 5710 than SRAM access times In 2004 typical 364 Appendix B The Basics of Logic Design Word line Pass transistor Capacitor V FIGURE 395 A singletransistor DRAM cell contains a capacitor that stores the cell contents and a transistor used to access the cel Bit line Row 2048 X 2048 ecoder array 11to 2048 Address10 0 Column latches Dout FIGURE 396 A 4M 1 DRAM is built with a 2048 2048 array The row access uses 11 bits to e ct a row which is then latched in 2048 17bit latchesr A multiplexor chooses the output bit from these 2048 latchesr The RAS and CAS signals control whether the address lines are sent to the row decoder or colr umn multiplexorr DRAM access times range from 45 to 65 ms 256 Mbit DRAMs are in full produce tion and the rst customer samples ofl GB DRAMs became available in the rst quarter of 2004 The much lower cost per bit makes DRAM the choice for main memory while the faster access time makes SRAM the choice for caches 39 Memory Elements SRAMs and DRAMs You might observe that a 64M X 4 DRAM actually accesses 8K bits on every row access and then throws away all but 4 of those during a column access DRAM designers have used the internal structure of the DRAM as a way to provide higher bandwidth out of a DRAM This is done by allowing the column address to change without changing the row address resulting in an access to other bits in the column latches To make this process faster and more precise the address inputs were clocked leading to the dominant form of DRAM in use today syn chronous DRAM or SDRAM Since about 1999 SDRAMs are the memory chip of choice for most cache based main memory systems SDRAMs provide fast access to a series of bits within a row by sequentially transferring all the bits in a burst under the control of a clock signal In 2004 DDRRAMs Double Data Rate RAMs which are called double data rate because they transfer data on both the rising and falling edge of an externally supplied clock are the most heavily used form of SDRAMs As we discuss in Chapter 7 these highespeed transfers can be used to boost the band width available out of main memory to match the needs of the processor and caches Error Correction Because of the potential for data corruption in large memories most computer systems use some sort of errorechecking code to detect possible corruption of data One simple code that is heavily used is a parity code In a parity code the number of Is in a word is counted the word has odd parity if the number of Is is odd and even otherwise When a word is written into memory the parity bit is also written 1 for odd 0 for even Then when the word is read out the parity bit is read and checked If the parity of the memory word and the stored parity bit do not match an error has occurred A 17bit parity scheme can detect at most 1 bit of error in a data item if there are 2 bits of error then a 17bit parity scheme will not detect any errors since the par ity will match the data with two errors Actually a 17bit parity scheme can detect any odd number of errors however the probability of having three errors is much lower than the probability of having two so in practice a 17bit parity code is lime ited to detecting a single bit of error Of course a parity code cannot tell which bit in a data item is in error A 17bit parity scheme is an error detecting code there are also error correcting codes ECC that will detect and allow correction of an error For large main meme ories many systems use a code that allows the detection of up to 2 bits of error and the correction of a single bit of error These codes work by using more bits to encode the data for example the typical codes used for main memories require 7 or 8 bits for every 128 bits of data erroredetecting code A code that enables the detection of an error in data but not the precise location and hence correction of the error Appendix B The Basics of Logic Design Elaboration A 1bit parity code is a distance2 code which means that if we look at the data plus the parity bit no 1bit change is sufficient to generate another legal com bination of the data plus parity For example if we change a bit in the data the parity will be wrong and vice versa Of course if we change 2 bits any 2 data bits or 1 data bit and the parity bit the parity will match the data and the error cannot be detected Hence there is a distance of two between legal combinations of parity and data To detect more than one error or correct an error we need a distance3 code which has the property that any legal combination of the bits in the error correction code and the data have at least 3 bits differing from any other combination Suppose we have such a code and we have one error in the data In that case the code plus data will be 1 bit away from a legal combination and we can correct the data to that legal combina tion If we have two errors we can recognize that there is an error but we cannot cor rect the errors Let s look at an example Here are the data words and a distance3 error correction code for a 4bit data item To see how this works let s choose a data word say 0110 whose error correction code is 011 Here are the four 1bit error possibilities for this data 1110 0010 0100 and 0111 Now look at the data item with the same code 011 which is the entry with the value 0001 If the error correction decoder received one of the four pos sible data words with an error it would have to choose between correcting to 0110 or 0001 While these four words with error have only 1 bit changed from the correct pat tern of 0110 they each have 2 bits that are different from the alternate correction of 0001 Hence the error correction mechanism can easily choose to correct to 0110 since a single error is much higher probability To see that two errors can be detected simply notice that all the combinations with 2 bits changed have a different code The one reuse of the same code is with 3 bits different but if we correct a 2bit error we will correct to the wrong value since the decoder will assume that only a single error has occurred If we want to correct 1bit errors and detect but not erroneously correct 2bit errors we need a distance4 code Although we distinguished between the code and data in our explanation in truth an error correction code treats the combination of code and data as a single word in a larger code 7 bits in this example Thus it deals with errors in the code bits in the same fashion as errors in the data bits While the above example requires n 1 bits for n bits of data the number of bits required grows slowly so that for a distance3 code a 64bit word needs 7 bits and a 310 Finite State Machines 128bit word needs 8 This type of code is called a Hamming code after R Hamming who described a method for creating such codes Finite State Machines As we saw earlier digital logic systems can be classi ed as combinational or sequential Sequential systems contain state stored in memory elements internal to the system Their behavior depends both on the set of inputs supplied and on the contents of the internal memory or state of the system Thus a sequential sys tem cannot be described with a truth table Instead a sequential system is described as a nite state machine or often just state machine A nite state machine has a set of states and two functions called the next state function and the output function The set of states correspond to all the possible values of the internal storage Thus if there are n bits of storage there are 2 states The next state function is a combinational function that given the inputs and the current state determines the next state of the system The output function produces a set of outputs from the current state and the inputs Figure B101 shows this diae grammatically N ext state Nextstate function Current state Inputs Output function 4 OUtpUts FIGURE 3101 A state machine consists of internal storage that contains the state and two combi tional functions the nextstate function and the output function Oftenthe output function is restricted to take only the current state as its input this does not change the capability of a sequential machine but does affect its internals nite state machine A sequential logic function con sisting of a set of inputs and out puts a nextestate function that maps the current state and the inputs to a new state and an output function that maps the current state and possibly the inputs to a set of asserted outputs nextestate function A combi national function that given the inputs and the current state determines the next state of a finite state machine Appendix B The Basics of Logic Design The state machines we discuss here and in Chapters 5 and 6 are synchronous This means that the state changes together with the clock cycle and a new state is computed once every clock Thus the state elements are updated only on the clock edge We use this methodology in this section and throughout Chapters 5 and 6 and we do not usually show the clock explicitly We use state machines throughout Chapters 5 and 6 to control the execution of the processor and the actions of the datapath To illustrate how a nite state machine operates and is designed let s look at a simple and classic example controlling a traf c light Chapters 5 and 6 contain more detailed examples of using nite state machines to control processor execue tion When a nite state machine is used as a controller the output function is often restricted to depend on just the current state Such a nite state machine is called a M0072 machine This is the type of nite state machine we use throughout this book If the output function can depend on both the current state and the current input the machine is called a Mealy machine These two machines are equivalent in their capabilities and one can be turned into the other mechanically The basic advantage of a Moore machine is that it can be faster while a Mealy machine may be smaller since it may need fewer states than a Moore machine In Chapter 5 we discuss the differences in more detail and show a Verilog version of nite state control using a Mealy machine Our example concerns the control ofa traf c light at an intersection of a north south route and an eastewest route For simplicity we will consider only the green and red lights adding the yellow light is left for an exercise We want the lights to cycle no faster than 30 seconds in each direction so we will use a 0033 Hz clock so that the machine cycles between states at no faster than once every 30 seconds There are two output signals l NSlite When this signal is asserted the light on the northesouth road is green when this signal is deasserted the light on the northesouth road is red I EWZite When this signal is asserted the light on the eastewest road is green when this signal is deasserted the light on the eastewest road is red In addition there are two inputs NScar and EWcar l NSCm Indicates that a car is over the detector placed in the roadbed in front of the light on the northesouth road going north or south I EWCM Indicates that a car is over the detector placed in the roadbed in front of the light on the eastewest road going east or west The traf c light should change from one direction to the other only if a car is wait ing to go in the other direction otherwise the light should continue to show green in the same direction as the last car that crossed the intersection 310 Finite State Machines To implement this simple traf c light we need two states I NSgreen The traf c light is green in the northesouth direction I EWgreen The traf c light is green in the eastewest direction We also need to create the nextestate function which can be speci ed with a table 0 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 Notice that we didn t specify in the algorithm what happens when a car approaches from both directions In this case the nextestate function given above changes the state to ensure that a steady stream of cars from one direction cannot lock out a car in the other direction The nite state machine is completed by specifying the output function w m NSgreen 1 0 l EWgreen l o l 1 l Before we examine how to implement this nite state machine let s look at a graphical representation which is often used for nite state machines In this rep resentation nodes are used to indicate states Inside the node we place a list of the outputs that are active for that state Directed arcs are used to show the nextestate function with labels on the arcs specifying the input condition as logic functions Figure B102 shows the graphical representation for this nite state machine A nite state machine can be implemented with a register to hold the current state and a block of combinational logic that computes the nextestate function and the output function Figure B103 shows how a nite state machine with 4 bits of state and thus up to 16 states might look To implement the nite state machine in this way we must rst assign state numbers to the states This process is called state assignment For example we could assign NSgreen to state 0 and Appendix B The Basics of Logic Design EWcar NSgreen EWgreen EWca NScar FIGURE 3102 The graphical representation of the twostate traffic light controller We simpli ed the logic functions on the sta e transitions For exam le the transition from NSgreen to EWgreen in the nextrstate table is NScar EWcar NScar EWcar which is equivalent to EWcar EWgreen to state 1 The state register would contain a single bit The nextestate function would be given as NeXtState CurrentState EWcar CurrentState NScar where CurrentState is the contents of the state register 0 or 1 and NeXtState is the output of the nextestate function that will be written into the state register at the end of the clock cycle The output function is also simple NSlite CurrentState EWlite CurrentState The combinational logic block is often implemented using structured logic such as a PLA A PLA can be constructed automatically from the nextestate and output function tables In fact there are computereaided design CAD programs that take either a graphical or textual representation of a nite state machine and pro duce an optimized implementation automatically In Chapters 5 and 6 nite state machines were used to control processor execution AppendiXC discusses the detailed implementation of these controllers with both PLAs and ROMs To show how we might write the control in Verilog Figure B104 shows a Vere ilog version designed for synthesis Note that for this simple control function a Mealy machine is not useful but this style of speci cation is used in Chapter 5 to implement a control function that is a Mealy machine and has fewer states than the Moore machine controller 310 Finite State Machines B71 Outputs Combinational logic Next state State register Inputs FIGURE 3103 A finite state machine is implemented with a state register that holds the current state and a combinational logic block to compute the next state and output functions The latter two functions are often split apart and implemented with two separate blocks of logicwhich may require fewer gatesr module lrattiCLite ENCarNSCarENLiteNSLiteClock input ENCar NSCarClock output ENLiteNSLite reg state initial state0 set initial state tollowing two assignments set the output which is based only on the state variable assign NSLite assign ENLite state NSLite on it state 0 state ENLite on it state 1 always posedge Clock all state updates on a positive Clock edge case state 0 state 7 ENCar Cnange state only it ENCar 1 state 7 NSCar Cnange state only it NSCar endcase endmodule FIGURE 3104 A Verilog version of the traffic light controller Appendix B The Basics of Logic Design Check Yourself What is the smallest number of states in a Moore machine for which a Mealy machine could have fewer states a Two since there could be a oneestate Mealy machine that might do the same thing b Three since there could be a simple Moore machine that went to one of two different states and always returned to the original state after that For such a simple machine a twoestate Mealy machine is possible c You need at least four states to exploit the adVantages of a Mealy machine over a Moore machine Timing Methodologies Throughout this appendix and in the rest of the text we use an edgeetriggered timing methodology This timing methodology has the adVantage that it is sime pler to explain and understand than a leveletriggered methodology In this sec tion we explain this timing methodology in a little more detail and also introduce levelesensitive clocking We conclude this section by brie y discussing the issue of asynchronous signals and synchronizers an important problem for digital designers The purpose of this section is to introduce the major concepts in clocking methodology The section makes some important simplifying assumptions if you are interested in understanding timing methodology in more detail consult one of the references listed at the end of this appendix We use an edgeetriggered timing methodology because it is simpler to explain and has fewer rules required for correctness In particular if we assume that all clocks arrive at the same time we are guaranteed that a system with edgeetrige gered registers between blocks of combinational logic can operate correctly with out races if we simply make the clock long enough A race occurs when the contents of a state element depend on the relative speed of different logic ele ments In an edgeetriggered design the clock cycle must be long enough to accommodate the path from one ipe op through the combinational logic to another ipe op where it must satisfy the setup time requirement Figure B111 shows this requirement for a system using rising edgeetriggered ipe ops In such a system the clock period or cycle time must be at least as large as t L prop combinational t setup for the worstecase Values of these three delays which are de ned as follows 31 Timing Methodologies 373 Q Flip op D Q Flip op C Combinational logic block t t combmatlonal J l FIGURE 3111 In an edgetriggered design the clock must be long enough to allow sig nals to be valid for the required setup time before the next clock edge The timefor aflipr op input to propagate to the ipr ip outputs is tpxop the signal then takes tcombmauoml to travel through the combinational logic and must be valid g before the next clock edge setup etup l thP is the time for a signal to propagate through a ip op it is also some times called clocketer tcombim oml is the longest delay for any combinational logic which by de e nition is surrounded by two ipe ops l tsetup is the time before the rising clock edge that the input to a ipe op must be valid We make one simplifying assumption the holdetime requirements are satise ed which is almost never an issue with modern logic One additional complication that must be considered in edgeetriggered designs is clock skew Clock skew is the difference in absolute time between when two state elements see a clock edge Clock skew arises because the clock signal will often use two different paths with slightly different delays to reach two different state elements If the clock skew is large enough it may be possible for a state ele ment to change and cause the input to another ipe op to change before the clock edge is seen by the second ipe op Figure B112 illustrates this problem ignoring seteup time and ipe op propa gation delay To avoid incorrect operation the clock period is increased to allow for the maximum clock skew Thus the clock period must be longer than t L prop combination L a1 t skew setup With this constraint on the clock period the two clocks can also arrive in the opposite order with the second clock arriving tskew earlier and the circuit will work correctly Designers reduce clock skew problems by carefully routing the clock signal to minimize the difference in arrival times In addition smart design ers also provide some margin by making the clock a little longer than the mini mum this allows for variation in components as well in the power supply Since clock skew The difference in absolute time between the times when two state elements see a clock edge Appendix B The Basics of Logic Design Clock arrives at time t D Q D Q Combinational logic block with C delay time of A Flip op Flip op C Clock arrives a er t A FIGURE 3112 Illustration of how clock skew can cause a race leading to incorrect operation Because ofthe difference in when the two ipr ops see the clock the signal that is stored into the rst ipr op can race forward and change the input to the second ipr op before the clock arrives at the second ipr op levelesensitive clocking A timing methodology in which state changes occur at either high or low clock levels but are not instantaneous as such changes are in edgeetriggered designs clock skew can also affect the holdetime requirements minimizing the size of the clock skew is important Edgeetriggered designs have two drawbacks they require extra logic and they may sometimes be slower Just looking at the D ipe op versus the levelesensitive latch that we used to construct the ipe op shows that edgeetriggered design requires more logic An alternative is to use level sensitive clocking Because state changes in a levelesensitive methodology are not instantaneous a levelesensitive scheme is slightly more complex and requires additional care to make it operate correctly LevelSensitive Timing In a levelesensitive timing methodology the state changes occur at either high or low levels but they are not instantaneous as they are in an edgeetriggered method ology Because of the noninstantaneous change in state races can easily occur To ensure that a levelesensitive design will also work correctly if the clock is slow enough designers use two phase Clocking Twoephase clocking is a scheme that makes use of two nonoverlapping clock signals Since the two clocks typically called 11 and 12 are nonoverlapping at most one of the clock signals is high at any given time as Figure B113 shows We can use these two clocks to build a system that contains levelesensitive latches but is free from any race conditions just as the edgeetriggered designs were q H H 132 Nonoverlapping periods FIGURE 3113 A twophase clocking scheme showing the cycle of each clock and the nonoverlapping periods 31 Timing Methodologies 375 Combinational logic block Combinational 11 logic block FIGURE 3114 A twophase timing scheme with alternating latches showing how the system operates on both clock phases The output of a latch is stable on the opposite phase from its C input Thus the rst block of combinational inputs has a stable input during 2 and its output is latched by 2 The second rightmost combinational block operates in just the opposite fashion with smble inputs during 1 Thquot th la th o L39 39 quot 39 minimum time that the respective clocks must be asserted The size of the nonoverr lapping period is determined by the maximum clock skew and the minimum delay of any logic block One simple way to design such a system is to alternate the use of latches that are open on 11 with latches that are open on 12 Because both clocks are not asserted at the same time a race cannot occur If the input to a combinational block is a 11 clock then its output is latched by a IZ clock which is open only during lZ when the input latch is closed and hence has a valid output Figure B114 shows how a system with twoephase timing and alternating latches operates As in an edgeetrige gered design we must pay attention to clock skew particularly between the two clock phases By increasing the amount of nonoverlap between the two phases we can reduce the potential margin of error Thus the system is guaranteed to operate correctly if each phase is long enough and there is large enough nonoverlap between the phases Asynchronous Inputs and Synchronizers By using a single clock or a twoephase clock we can eliminate race conditions if clock skew problems are avoided Unfortunately it is impractical to make an entire system function with a single clock and still keep the clock skew small While the CPU may use a single clock 1 0 devices will probably have their own clock Chapter 8 described how an asynchronous device may communicate with the CPU through a series of handshaking steps To translate the asynchronous input to a synchronous signal that can be used to change the state ofa system we need to use a synchronizer whose inputs are the asynchronous signal and a clock and whose output is a signal synchronous with the input clock Our rst attempt to build a synchronizer uses an edgeetriggered D ipe op whose D input is the asynchronous signal as Figure B115 shows Because we communicate with a handshaking protocol as we will see in Chapter 8 it does metastability Asituation that not matter whether we detect the asserted state of the asynchronous signal on one occurs ifa signal is sampled clock or the next since the signal will be held asserted until it is acknowledged Whal it is DOt Stable for th Thus you might think that this simple structure is enough to sample the signal requ d semlp aid hOId tlmfs accurately which would be the case except for one small problem 5 tg jluii gjet 13 memate The problem is a situation called metastability Suppose the asynchronous sige region between ahigh and low nal is transitioning between high and low when the clock edge arrives Clearly it is value Appendix B The Basics of Logic Design synchronizer failure A situa tion in which a fLipe op enters a metastable state and where some logic blocks reading the output ofthe ip op see a 0 while others see a 1 Asynchronous input D Q Synchronous output Flip op Clock C FIGURE 3115 A synchronizer built from a D flipflop is used to sample an asynchronous signal to produce an output that is synchronous with the clock This synchronizer will nut work properly not possible to know whether the signal will be latched as high or low That prob lem we could live with Unfortunately the situation is worse when the signal that is sampled is not stable for the required setup and hold times the ipe op may go into a metastable state In such a state the output will not have a legitimate high or low value but will be in the indeterminate region between them Further more the ipe op is not guaranteed to exit this state in any bounded amount of time Some logic blocks that look at the output of the ipe op may see its output as 0 While others may see it as 1 This situation is called a synchronizer failure In a purely synchronous system synchronizer failure can be avoided by ensure ing that the setup and hold times for a ipe op or latch are always met but this is impossible when the input is asynchronous Instead the only solution possible is to wait long enough before looking at the output of the ipe op to ensure that its output is stable and that it has exited the metastable state if it ever entered it How long is long enough Well the probability that the ipe op will stay in the metastable state decreases exponentially so after a very short time the probability that the ipe op is in the metastable state is very low however the probability never reaches 0 So designers wait long enough that the probability of a synchroe nizer failure is very low and the time between such failures will be years or even thousands of years For most ipe op designs waiting for a period that is several times longer than the setup time makes the probability of synchronization failure very low If the clock rate is longer than the potential metastability period which is likely then a safe synchronizer can be built with two D ipe ops as Figure B116 shows If you are interested in reading more about these problems look into the references D Q Flip op Asynchronous input Synchronous output Flip op Clock C FIGURE 3116 This synchronizer will work correctly if the period of metastability that we wish to guard against is less than the clock period Although the output of the rst ipr op may be metastable it will not be seen by any other logic element until the second clock when the second D ipe op samples the signalwhich by that time should no longer be in a metastable state 312 Field Programmable Devices B77 Suppose we have a design with very large clock skewilonger than the register propagation time Is it always possible for such a design to slow the clock down enough to guarantee that the logic operates properly a Yes if the clock is slow enough the signals can always propagate and the design will work even if the skew is very large b No since it is possible that two registers see the same clock edge far enough apart that a register is triggered and its outputs propagated and seen by a second register with the same clock edge Field Programmable Devices Within a custom or semicustom chip designers can make use of the exibility of the underlying structure to easily implement combinational or sequential logic How can a designer who does not want to use a custom or semicustom IC imple ment a complex piece of logic taking advantage of the very high levels of integra tion available The most popular components used for sequential and combinational logic design outside of a custom or semicustom IC is a eld pro grammable device FPD An PPD is a integrated circuit containing combinae tional logic and possibly memory devices that is con gurable by the end user PPDs generally fall into two camps programmable logic devices PLDs which are purely combinational and eld programmable gate arrays PPGAs which provide both combinational logic and ipe ops PLDs consist of two forms simple PLDs SPLDs which are usually either a PLA or a programmable array logic PAL and complex PLDs which allow more than one logic block as well as con gurable interconnections among blocks When we speak of a PLA in a PLD we mean a PLA with user programmable andeplane and oreplane A PAL is like a PLA except that the oreplane is xed Before we discuss FPGAs it is useful to talk about how FPDs are con gured Con guration is essentially a question of where to make or break connections Gate and register structures are static but the connections can be con gured Notice that by con guring the connections a user determines what logic funce tions are implemented Consider a con gurable PLA by determining where the connections are in the andeplane and the oreplane the user dictates what logical functions are computed in the PLA Connections in PPDs are either permanent or recon gurable Permanent connections involve the creation or destruction of a connection between two wires Current FPLDs all use an antifuse technology which allows a connection to be built at programming time that is then perma nent The other way to con gure CMOS PPLDs is through an SRAM The SRAM is downloaded at powereon and the contents control the setting of switches that Check Yourself propagation time The time required for an input to a ip op to propagate to the outputs of the ipe op eld programmable devices FPD An integrated circuit containing combinational logic and possibly memory devices that is con gurable by the end user programmable logic device PLD An integrated circuit containing combinational logic whose function is con gured by the end user eld programmable gate array A con gurable integrated circuit containing both combie national logic blocks and ip ops simple programmable logic device SPLD Programmable logic device usually containing either a single PAL or PLA programmable array logic PAL Contains a programmae ble andeplane followed by a fixed oreplane antifuse A structure in an inte grated circuit that when pro grammed makes a permanent connection between two wires Appendix B The Basics of Logic Design lookup tables LUTs In a field programmable device the name given to the cells because they consist of a small amount of logic and RAM in turn determine which metal lines are connected The use of SRAM control has the advantage that the FPD can be recon gured by changing the contents of the SRAM The disadvantages of the SRAMebased control are two the con guration is volatile and must be reloaded on powereon and the use of active transistors for switches increases the resistance of such connections slightl FPGAs include both logic and memory devices usually structured in a two dimensional array with the corridors dividing the rows and columns used for gloe bal interconnect between the cells of the array Each cell is a combination of gates and ipe ops that can be programmed to perform some speci c function Because they are basically small programmable RAMs they are also called lookup tables LUTs Newer FPGAs contain more sophisticated building blocks such as pieces of adders and RAM blocks that can used to build register les A few large FPGAs even contain 327bit RISC cores In addition to programming each cell to perform a speci c function the inter connections between cells are also programmable allowing modern FPGAs with hundreds of blocks and hundreds of thousands of gates to be used for complex logic functions Interconnect is a major challenge in custom chips and this is even more true for FPGAs because cells do not represent natural units of decomposie tion for structured design In many FPGAs 90 of the area is reserved for inter connect and only 10 is logic and memory blocks Just as you cannot design a custom or semicustom chip without CAD tools you also need them for FPDs Logic synthesis tools have been developed that tar get FPGAs allowing the generation of a system using FPGAs from structural and behavioral Verilog Concluding Remarks This appendix introduces the basics of logic design Ifyou have digested the mate rial in this appendix you are ready to tackle the material in Chapters 5 and 6 both of which use the concepts discussed in this appendix extensively Further Reading There are a number of good texts on logic design Here are some you might like to look into 314 Exercises B79 Ciletti M D 2002 Advanced Digital Design with the VerilugHDL Englewood Cliffs NI PrenticerHall A thuruugh bunk an lugic design using Verilag Katz R H 2004 Mudern Lugic Design second edition Reading MA Addison Wesley A general text an lugic design Wakerly I E 2000 Digital Design Principles and Practices third ed Englewood Cliffs NI PrenticerHall A general text an lugic design Exercises 31 10 lt B2gt In More Depth DeMorgan s Theorems 32 15 lt B2gt In More Depth DeMorgan s Theorems 33 10 lt B2gt For More Practice Truth Tables 34 10 lt B2gt For More Practice Truth Tables 35 15 lt B2gt For More Practice Building Logic Gates 36 15 lt B2gt For More Practice Building Logic Gates 37 10 lt B2 B3gt Construct the truth table for a foureinput oddeparity funce tion see page 3765 for a description ofparity 38 10 lt B2 B3gt Implement the foureinput oddeparity function with AND and OR gates using bubbled inputs and outputs 39 10 lt B2 B3gt Implement the foureinput oddeparity function with a PLA 310 15 lt B2 B3gt Prove that a twoeinput multiplexor is also universal by showing how to build the NAND or NOR gate using a multiplexor 311 5 lt 42 B2 B3gt Assume that X consists of3 bits x2 X1 X0 Write four logic functions that are true if and only i l X contains only one 0 l X contains an even number of 0s l X when interpreted as an unsigned binary number is less than 4 l then interpreted as a signed two s complement number is negative 312 5 lt 42 B2 B3gt Implement the four functions described in Exercise B11 using a PLA Appendix B The Basics of Logic Design 313 5 lt 42 B2 B3gt Assume that X consists of3 bits X2 x1 X0 and Y con sists of 3 bits y2 y1 y0 Write logic functions that are true if and only if I X lt Y where X and Y are thought of as unsigned binary numbers I X lt Y where X and Y are thought of as signed two s complement numbers I XY Use a hierarchical approach that can be extended to larger numbers of bits Show how can you extend it to 67bit comparison 314 5 lt B2 B3gt Implement a switching network that has two data inputs A and B two data outputs C and D and a control input 5 IfS equals 1 the network is in passethrough mode and C should equal A and D should equal B If 5 equals 0 the network is in crossing mode and C should equal B and D should equal A 315 15 lt B2 B3gt In More Depth DeMorgan s Theorems 316 30 lt B2 B3gt In More Depth DeMorgan s Theorems 317 5 lt B2 B3gt For More Practice Multiplexors 318 5 lt 59 B3gt What is the function implemented by the following Verilog modules module FUNCI IO II S out input IO II input S output out out S II IO endmodule module FUNCZ outctlclkreset output 70 out input Ctl clk reset reg 70 out always posedge Clk it reset begin out lt 839b0 end else it ctl begin out lt out 1 end else begin out lt out 7 I end endmodule 314 Exercises 319 5 lt B4gt The Verilog code on page 1353 is for a D ipe op Show the Vere ilog code for a D latch 320 10 lt 59 B3 B4gt Write down a Verilog module implementation of a Zetoe4 decoder andor encoder 321 10 lt 59 B3 B4gt Given the following logic diagram for an accumulae tor write down the Verilog module implementation of it Assume a positive edge triggered register and asynchronous Rst 16 Out Clk Rst V Register Load 322 20 lt 45 B3 BA B5gt Section 34 presents basic operation and possible implementations of multipliers A basic unit of such implementations is a shift andeadd unit Show a Verilog implementation for this unit Show how can you use this unit to build a 32bit multiplier 323 20 lt 46 B3 BA B5gt Repeat Exercise B22 but for an unsigned divider rather than a multiplier 324 15 lt B5gt The ALU supported set on less than Sl t using just the sign bit ofthe adder Let s try a set on less than operation using the values 77m and 6m To make it simpler to follow the example let s limit the binary representations to 4 bits 1001two and 0110tw0 1001two r 0110two 1001WO 1010two 0011two This result would suggest that 77 gt 6 which is clearly wrong Hence we must face tor in over ow in the decision Modify the 17bit ALU in Figure 13510 on page Be 33 to handle Sl t correctly Make your changes on a photocopy of this gure to save time Appendix B The Basics of Logic Design 325 20 lt B6gt A simple check for over ow during addition is to see ifthe Care ryIn to the most signi cant bit is not the same as the CarryOut of the most signif icant bit Prove that this check is the same as in Figure 33 on page 172 326 5 lt B6gt Rewrite the equations on page 1343 for a carryelookahead logic for a 16bit adder using a new notation First use the names for the CarryIn signals of the individual bits of the adder That is use c4 c8 c12 instead of C1 C2 C3 In addition let Pig mean a propagate signal for bits 139 to j and Gig mean a generate signal for bits 139 to For example the equation C2 G1P1G0P1P0c0 can be rewritten as C8 G74P74 G30P74 P30 C0 This more general notation is useful in creating wider adders 327 15 lt B6gt Write the equations for the carryelookahead logic for a 64 bit adder using the new notation from Exercise 1326 and using 167bit adders as build ing blocks Include a drawing similar to Figure B63 in your solution 328 10 lt B6gt Now calculate the relative performance of adders Assume that hardware corresponding to any equation containing only OR or AND terms such as the equations for pi and gi on page 1339 takes one time unit T Equations that consist ofthe OR ofseveral AND terms such as the equations for c1 c2 c3 and c4 on page B740 would thus take two time units 2T The reason is it would take T to produce the AND terms and then an additional T to produce the result of the OR Calculate the numbers and performance ratio for 47bit adders for both ripple carry and carry lookahead If the terms in equations are further de ned by other equa tions then add the appropriate delays for those intermediate equations and con tinue recursively until the actual input bits of the adder are used in an equation Include a drawing of each adder labeled with the calculated delays and the path of the worstecase delay highlighted 329 15 lt B6gt This exercise is similar to Exercise B28 but this time calculate the relative speeds of a 16bit adder using ripple carry only ripple carry of 47bit groups that use carry lookahead and the carryelookahead scheme on page B738 330 15 lt B6gt This exercise is similar to Exercises 1328 and B29but this time calculate the relative speeds of a 64bit adder using ripple carry only ripple carry of 47bit groups that use carry lookahead ripple carry of 167bit groups that use carry lookahead and the carryelookahead scheme from Exercise B27 314 Exercises 331 10 lt B6gt In More Depth Carry Save Adders 332 10 lt B6gt In More Depth Carry Save Adders 333 10 lt B6gt There are times when we want to add a collection of numbers together Suppose you wanted to add four 47bit numbers AB E F using 17bit full adders Let s ignore carry lookahead for now You would likely connect the 17bit adders in the organization at the top of Figure 3141 Below the traditional orgai nization is a novel organization of full adders Try adding four numbers using both organizations to conVince yourself that you get the same answer 334 5 lt B6gt First show the block organization ofthe 167bit carry save adders to add these 16 terms as shown in Figure 3141 Assume that the time delay through each 17bit adder is 2T Calculate the time ofadding four 47bit numbers to the organization at the top versus the organization at the bottom ofFigure 3141 Appendix B The Basics of Logic Design a3 b3 a2 b2 a1 b1 a0 b0 Traditional adder 39I39I FIGURE 3141 Traditional ripple carry and carry save addition of four 4bit numbers The details are shown on the left with the individual signals in lowercase and the corresponding higherrlevel blocks are on the right with collective signals in upper case Note that the sum of four nrbit numbers can take 11 2 1 s 335 5 lt B8gt For More Practice FlipeFlops and Latches 336 5 lt B8gt For More Practice FlipeFlops and Latches 337 10 lt B10gt For More Practice Finite State Machines 338 10 lt B10gt For More Practice Finite State Machines 339 15 lt B2 B8 B10gt For More Practice Constructing a Counter 340 20 lt B10gt For More Practice Constructing a Counter Example A computer has a CPU with a 10 ns cycle time and a two level cache The rst level cache is a 64 KB 4 way associative with a block size of 32 bytes The second level cache is a 1 MB direct mapped with a block size of 256 bytes A hit in the rst level cache results in moving the item requested to the CPU in one CPU cycle A miss in the rst level cache causes an access to the second level cache which has a 40 ns access time and a transfer rate of 16 bytes per 10 ns A miss in the second level cache causes an access to the main memory which has a 400 ns access time and a transfer rate of 32 bytes per 20 ns We will assume that the data is always in the memory The miss rate of the rst level cache is 00089 The local miss rate of the second level cache is 0220 a Compute the average memory access time in ns and in CPU cycles b What fraction in of the average access time is due to i second level cache misses and ii rst level cache misses 725 1 O lt 73gt Using the series of references given in Exercise 79 show and misses and final cache contents for a twoway setassociative cache one word blocks and a total size of 16 words Assume LRU replacement Same for 4way 8way and fully associative To capture the fact that the time to access data for both hits and misses affects performance designers often use average memory access time AMAT as a way to examine alternative cache designs Average memory access time is the average time to access memory considering both hits and misses and the frequency of different accesses it is equal to the following AMAT Hit time MR x MP AMAT is useful as a figure of merit for different cache systems 717 5 lt 72gt Find the AMAT for a processor with a 2 ns clock a miss penalty of 20 clock cycles a miss rate of 005 misses per access and a cache access time including hit detection of 1 clock cycle Assume that the read and write miss penalties are the same and ignore other write stalls 718 5 lt 72gt Suppose we can improve the miss rate to 003 misses per reference by doubling the cache size This causes the cache access time to increase to 12 clock cycles Using the AMAT as a metric determine if this is a good tradeoff 719 10 lt 72gt If the cache access time determines the processor s clock cycle time which is often the case AMAT may not correctly indicate whether one cache organization is better than another If the processor s clock cycle time must be changed to match that of a cache is this a good tradeoff Assume the processors are identical except for the clock rate and the number of cache miss cycles assume 15 references per instruction and a CPI without cache misses of 2 The miss penalty is 20 cycles for both processors The following is from Computer Architecture A Quantitative Approach by Hennessy and Patterson Alpha 21264 processor Block Black address citsel lt29 lt93 ltSgt Tag l dex Valid Tag ltlgt 29gt Ste Dioctsr l Vlclllll puller Lowerlevel memory i Figure 57 The organization of the data cache in the Alpha 21264 microprocessor The 64 KB cache is twoway set associative with 64rbyte blocks The 9bit index selects among 512 setsThe four steps of a read hit shown as circled numbers in order of occurrence label this organizationThree bits ofthe block offset join the index to sup ply the RAM address to select the proper 8 bytes Thus the cache holds two groups of 4096 64bit words with each group containing half of the 512 sets Although not exerr cised in this example the line from lowerlevel memory to the cache is used on a miss to load the cacheThe size of address leaving the CPU is 44 bits because it is a physical address and not a virtual address Figure 536 on page 466 explains how the Alpha maps from virtual to physical for a cache access

### BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.

### You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

## Why people love StudySoup

#### "Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

#### "Selling my MCAT study guides and notes has been a great source of side revenue while I'm in school. Some months I'm making over $500! Plus, it makes me happy knowing that I'm helping future med students with their MCAT."

#### "I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

#### "Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

### Refund Policy

#### STUDYSOUP CANCELLATION POLICY

All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email support@studysoup.com

#### STUDYSOUP REFUND POLICY

StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here: support@studysoup.com

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to support@studysoup.com

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.