New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Louisa O'Kon I


Louisa O'Kon I
OK State
GPA 3.58


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course


This 0 page Class Notes was uploaded by Louisa O'Kon I on Sunday November 1, 2015. The Class Notes belongs to ECEN 6263 at Oklahoma State University taught by Staff in Fall. Since its upload, it has received 7 views. For similar materials see /class/232911/ecen-6263-oklahoma-state-university in ELECTRICAL AND COMPUTER ENGINEERING at Oklahoma State University.

Similar to ECEN 6263 at OK State



Reviews for ADV VLSI DES & APP


Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 11/01/15
I ECEN 6263 Advanced VLSI Design I Register File Design Register les are an important component for implementing registers in modern computer design The heart of the computer processor is the number crunching or processing units We have already covered the basic design of these types of components The processing units must be connected to the internal storage registers as shown Note that the internal registers are different than main memory in that many different busses connect to the reg isters but usually only one bus connects to memory elements This greatly reduces the cost of memory elements but makes them unsuitable for direct connection to the process ing units Memory Units I Registers vector oating I Adder B mel 10 gic un1t pomt un1t Shifter unit The diagram clearly shows that there is a data ow bottleneck the Von Neuman bottle neck between the registers and memory The computer processor operates much more ef ciently when there are enough registers so that only occasionally does data need to be transferred between registers and memory Typically dozens or even hundreds of registers are desirable The interconnect area will be much too large if it is necessary to directly connect each individual register to each processing unit This is why processor registers are usually implemented as register les Register les are like memory in that only a lim ited number of connections ports are provided to each register and the same connections are shared by different registers Register les are unlike memory in that many more con nections ports are provided than is typical for ordinary memory Implementation of Data Storage Element The most obvious choice for implementation of a bit of data storage in the register le is the clocked DFF However routing clock signals to the extremely large number of bits 104 to 105 in the register le takes up too much area Asynchronous storage elements take up much less space because they do not require clock signals Storage cells from ordinary memory designs can be adapted for use in the register le by providing interfaces to more than one address and data bus ie a multiport memory RegisterFile Design October 24 2006 page 1 of4 I ECEN 6263 Advanced VLSI Design I The smallest memory cell is the dynamic RAM cell The dynamic RAM cell is usually not used for register les for the following reasons Dynamic RAM has additional overhead circuitry for refresh Even though the register le is large it still is a very small memory and the size expense of the overhead may be too large Also there is no time available to insert extra refresh cycles since the reg ister file is used every clock cycle N Dynamic RAM is relatively slow compared with other memory cells This is because the storage cell capacitance in dynamic RAM must be made as large as possible in order to produce a significant change on the bit lines Unfortunately the settling time of the bit lines increases as the storage capacitance increases E Dynamic RAM is normally implemented in specialized fabrication processes that have been optimized particularly for dynamic RAM It may be difficult to get dynamic RAM to work well in fabrication processes used for ordinary digital integrated circuits For these reasons a static RAM cell is usually used in register frles Recall from our dis cussion of memory design earlier that each static RAM cell requires a bit and E line to be able to write new data in the cell With the standard static RAM cell it is possible either to read two data simultaneously the bit and E line can be used to read different data or write one datum the bit and E line must have the same datum Each of the processing units usually requires an interface to the registers that can quickly read 2 input operands source operands and write 1 output operand destination operand If pipelining is done so that simultaneous reads and writes are not necessary then a register file based on the static RAM cell with 2 single ended reads or 1 double ended write is adequate for inter facing to a single processing unit More advanced computer processor design requires that several processing units should be connected to the same register frle One approach has been to partition the register file into pieces that are dedicated to each hardware unit For example the oating point unit often has a special set of registers The problem is that many machine instructions and much time must be wasted in moving the data between registers The true advantages of advanced processor design are achieved more effectively when many input and output ports are provided to each individual register in the register file so that many of the hard ware units can use the several registers simultaneously The simplest thing to do is to add additional pairs of bit lines as in a normal multiport RAM Fortunately if we use single ended reads then we have an additional 2 reads or 1 write for each pair of added bit lines which is exactly what is needed for interfacing to each added processing unit As an example design consider a register file cell with 3 pairs ofbit lines This type of register file is usually operated so that reads on all busses are done at the same time and writes are done at the same time never a mixture of reads and writes This design allows 6 simultaneous single ended reads by putting di erent addresses on the 6 address busses A1 B1 A2 B2 A3 B3 Also 3 simultaneous double ended writes are allowed by putting di erent addresses on the 3 pairs of address busses AlBl A2B2 A3B3 Register File Design October 24 2006 page 2 of4 I ECEN 6263 Advanced VLSI Design I word A3 word B3 word A2 word B2 word Al word Bl bit 3 bit 2 bit 1 bit 1 bit 2 bit 3 The transistors must be carefully sized to insure correct operation of the static RAM cell We have already analyzed the cell with a single pair of bit lines This analysis is adequate for the multiport static RAM cell as long as the same cell is not accessed at the same time through more than one bus Writing into the same address over two or more dilTerent bus ses will produce unpredictable results There will usually be some mechanism to prevent this from happening thus we will not consider it further Reading from the same address over 2 or more busses is perfectly acceptable The worst case is when all of the A busses or all of the B busses read from the same location We must prevent the combined pre charged high busses from overwriting a stored low Vdd Vdd T Vdd 39 L ncell Vdd T 3 Clearly Vcell will be pulled higher than when only one read is done making it more dif cult to keep below Vinv The noise margin is reduced from the single port memory noise margin which was already small For this reason this design style is only appropriate for 2 or 3 bitm lines The noise margin becomes impractically small for more bit lines Register File Design October 24 2006 page 3 of4 I ECEN 6263 Advanced VLSI Design I High Noise Margin Multi Port Memory the area of the memory cell N the number of word lines x the number of bit lines the number of word lines N the number of ports the number of bit lines N the number of ports Therefore the area of the memory cell N the number of ports2 The memory cell size for multiport memory can be very large just to layout the word and bit lines in metal There is room underneath to increase the transistor sizes andor increase the number of transistors without increasing the area of the memory cell This makes it possible to have alternative memory cell designs other than the classic static RAM cell f1g 1123 p729 multiport static RAM cell with extra transistors added to allow reading and writing on different bit lines Observe that reading the cell does not alter the stored cell voltage because of the read buffering circuitry in the lower right comer The read noise margin came about because we needed to insure that the cell voltage did not cross Vinv In this case the cell voltage does not change at all during a read which eliminates the need to provide a read noise margin in our design equations Thus the sizes of the cell transistors in the crosscoupled inverters only have to satisfy the write noise margin equations derived earlier which can easily be done to provide large write noise margms Furthermore the transistor sizes in the read bulTer circuitry can be chosen to minimize read access delay time The minimum time will be limited by the maximum size of the transistors in the bulTers The extra invert l makes it possible to do a double ended write from only one write bit line there is no bit line Register File Design October 24 2006 page 4 of4 I ECEN 6263 Advanced VLSI Design I High Speed Pipeline Implementation Cont Clocking Schemes for High Speed Domino Logic Fig 743a p 427 Traditional domino clocking While the clock is high the rst half of the domino gates evaluate When the clock goes from high to low the latch preserves the outputs from the first half which drives the second half while they evaluate When the clock goes from low to high the second latch preserves outputs and the process repeats Since half of the domino gates evaluate while the other half precharges no time is wasted by the precharge cycle The latches must be between the domino gates clocked on different phases to hide the pre charge signal from the following domino gates The evaluation delay of a domino gate cannot straddle the clock phase boundary which eliminates the possibility of borrowing Fig 743b p 427 Clock skew must be accounted for each half cycle which doubles the time lost to clock skew margin Fig 744 p 429 Latches can be eliminated provided that the clock signals overlap This is the opposite of dynamic latches which require nonoverlapping clocks Fig 745 p 429 If keepers are used a full keeper is required on the rst domino gate in each clock phase Fig 746 p 430 Skew tolerant domino Skew is tolerable as long as it is less than the overlap Limited borrowing is also possible during the overlap period Fig 747 p 43 1 Local generation of overlapping clock phases from a single global clock Since local clocks do not travel far it is easy to keep small skew between the phases Clock choppers delay the falling edge of clk and clkb to produce overlapping 11 and 12 Fig 748 p 431 OTB domino uses clk or clkb on the rst domino gates in each phase to prevent incorrect operation from short contamination delays Fig 749 p 432 Four phase domino Contamination delays are not a problem since all 4 of the clock phases are never on at the same time Since each phase overlaps the adjacent phase by T 4 skew and borrowing can be larger Fig 750 p 432 Local clock generators for four phase overlapping clocks Fig 751 p 434 Nphase domino Delay chains produce clock phases for each domino gate In a each global clock edge starts an evaluation of half of the domino gates which works even if the global clock is slow In b the rising edge of the global clock triggers evaluation of all of the domino gates which requires a ip op to make sure that the last phase 16 overlaps the rst phase 11 for slow global clocks High Speed Pipeline Implementation Cont November 24 2006 page 1 of5 I ECEN 6263 Advanced VLSI Design I Timing for Unfooted Domino Gates Fig 624 p 333 Recall that the series precharge nFET can be eliminated to make unfooted domino gates The series nFET makes sure that there is no path to ground dur ing precharge If the inputs can be guaranteed to be low during precharge there is no need for the series nFET and a faster gate results Fig 628 p 335 The outputs of the static gates X and Z are driven low during pre charge and can be used as inputs to unfooted domino gates provided that the precharge of the unfooted gate is delayed until its inputs settle low clkl clk2 clle clle Clkz undelaye Clkz delayed l precharge precharge I I I l I hormal I precharge WW I W l I I normal W precharge I r X I Y I abnormal Y 39 normal I I precharge precharge V I V attempted pre wait to pre charge before charge until X X goes low goes low Observe that a path to ground exists in the unfooted gate until X goes low Excessive power consumption in the unfooted gate during precharge can be avoided by delaying the falling edge of clk2 Observe that the precharge of the unfooted domino is delayed in both High Speed Pipeline Implementation Cont November 24 2006 page 2 of 5 I ECEN 6263 Advanced VLSI Design I cases Additional unfooted gates will have additional delays during precharge Unfooted domino precharges sequentially whereas footed domino precharges in parallel This requires inserting footed domino gates in between the unfooted gates to avoid lengthening the precharge period Fig 758 p 440 Clocking for unfooted domino Self Resetting Domino Gates Fig 752 p 435 Selfresetting domino postcharged domino The evaluation of the domino gate triggers a delayed reset signal which postcharges the gate as a precharge for the next evaluation This eliminates the need for generating complicated clock wave forms but each individual gate must have its own individual timing chain The main advantage is reduction in power since only precharged gates that evaluate low must be charged high again Normal domino gates consume considerable dynamic C V2f power on the clock lines whereas selfresetting domino gates have no clock lines Note that the output Y is a pulse rather than a level as in a normal logic gate If the input A remains high after the gate resets itself more pulses will occur If the input A is also a pulse which goes low before the gate resets then only a single pulse will occur This allows a selfresetting domino gate to drive another selfresetting domino gate provided the timing chain delays are the same Ensuring that the timing chain delays are the same requires careful layout Fig 753 p 435 Predicated selfresetting domino has correct operation even when the timing of the input pulses is unknown or if the inputs are static levels When input A is high the precharged gate will evaluate low but the reset postcharge will not occur until input A goes low again Fig 755 p 436 and fig 756 p 437 SRCMOS IBM has pulsed inputs and outputs like selfresetting domino but a common timing chain is used for the reset postcharge signal This reduces the amount of overhead circuitry since the timing chain can be shared between many logic gates instead of providing a timing chain for each individual gate However the timing chain has a high activity factor and correspondingly higher power consumption Fig 757 p 439 Global STP Intel is similar to SRCMOS Note the multitude of timing interrelationships required for correct operation for both Global STP and SRCMOS Top p 438 Postcharge circuits all produce pulses rather than levels Pulsed circuits require careful timing to avoid the problems listed High Speed Pipeline Implementation Cont November 24 2006 page 3 of 5 I ECEN 6263 Advanced VLSI Design I Non Monotonic Techniques Fig 627 p 335 Recall that precharged gates cannot directly drive other precharged gates using the same precharge clock because X can fall during evaluate and Y cannot go back high as it should Fig 761 p 442 Precharged gates can directly drive other precharged gates provided that evaluation of the second gate is delayed so that X falls before the second gate starts evalu ation If 12 is delayed by the amount of delay in the first gate then correct operation results Precharge is delayed as well as evaluate and the second gate output Y may be degraded just as precharge starts This glitch does not propagate and is usually not a prob lem Fig 762 p 443 Various circuit techniques for producing adjustable delay Fig 763 p444 Dummy gates can be used to more precisely match the delay ofa logic gate to produce the delayed clock Fig 764a p 444 Clock delayed domino CD domino The clock at each level is delayed by the longest delay of any gate at that level Fig 764b p 444 Faster CD domino Each gate is delayed by an amount to match the delay of its slowest input Observe that the worst case delay through the top 3 gates is 260ps compared to 280ps in part a The performance increase comes at the cost of inserting more delay elements Observe that each delay element must be designed indi vidually which requires extensive design effort Fig 765 p 445 Inhalation gate used by Intel Instead of delaying the clock akeeper is used to pull W back high when X goes low This works as long as X falls faster than W Observe that a multiinput domino NOR with nFETs in parallel is much faster than dom ino NAND with nFETs in series Fig 767 p 446 Complementary signal generator used by Intel Works the same way as the annihilation gate The crosscoupled pFETs help slow down W relative to X Fig 768 p 447 Output prediction logic OPL based on CMOS OPL is precharged with a delayed clock as is CD domino but OPL has a static pullup circuit so that the clock delay is not critical to correct operation Fig 769 p 447 Chain of OPL gates Fig 770 p 447 ElTect of small large and moderate clock delays on gate delays in the chain of OPL gates For moderate clock delays the gate outputs temporarily settle at an intermediate voltage where they are most sensitive to any changes on the gate inputs When the inputs settle the output quickly goes to its final value much like a domino gate This greatly reduces the delay of a chain of OPL gates Currents are owing in the gates High Speed Pipeline Implementation Cont November 24 2006 page 4 of 5 I ECEN 6263 Advanced VLSI Design I while the outputs are in midrange and large voltage glitches can occur which increases power dissipation Fig 771 p 448 Optimum clock delay to minimize path delay for different size pFETs in pullup circuit Note that the smallest pFETs give the smallest minimum path delay Fig 772a p 449 OPL from pseudonMOS Fig 772b p 449 OPL from precharged gate Fig 772c p 449 Differential OPL from combining CVSL g 620a p 332 and dual rail domino g 630a p 337 Differential OPL is faster than both pulldown does not ght pFET crosscouple and no static inverters needed Interfacing Domino and Static Logic Fig 773 p450 Latches prevent edges from occurring on static outputs during evalua tion of the domino gates In this example the level low sensitive latch is transparent only when the following domino gates are being precharged by the low clock level Note that time borrowing allows the latch delay to be moved anywhere in the low clock phase pro vided that suf cient clock skew margin is maintained Latches prevent domino outputs during precharge from driving subsequent static gates In the example a level high sensitive latch can be added after the domino gates which is transparent only when the domino gates are evaluating Subsequent static gates will not be driven by the logically incorrect precharge value from the domino gates High Speed Pipeline Implementation Cont November 24 2006 page 5 of 5 I ECEN 6263 Advanced VLSI Design I Tree Adders All of the schemes examined so far have a delay proportional to the number of bits The delay can be further reduced by using a hierarchy of blocks any of the schemes can be used for the basic blocks For example a multilevel LAC scheme would look like CLA CLA CLA CLA GiPi 64331 I 64331 f l illl in illl in 39 G sP s G sP s G sP s Cin we delay RVC 6 16 391 3 level LAC logn N total bits As we keep adding levels the delay approaches logN One can easily see that the following relationship holds for the group look ahead carry functions Gij Gik PikGk7 lj Pu PMPUH igtkgtj C Giy rPijCj1 t Tree Adders November 2 2006 page 1 of 3 I ECEN 6263 Advanced VLSI Design I which makes it possible to use the black cell de ned in g 1017a p 651 as a building block to compute G s P s and the grey cell de ned in g 1017b to compute C s g 1034a p 662 BrentKung tree adder First black cells combine 1bit G s and P s to make 2bit G s and P s which are combined to make 4bit G s and P s etc Then grey cells are used to compute the C s in the bottom tree This form has the minimum circuitry minimum area and power compared with other tree adders The delay is proportional to logN but two trees must be traversed which doubles the number of levels needed g 1034b p 662 Sklansky tree adder The C s are calculated with grey cells in parallel with G s and P s and more intermediate G s and P s are calculated with black cells The delay is proportional to logN but the numbers of levels have been reduced since only a single tree is traversed Observe the high fanout similar to the carry selectincrement adder The maximum fanout is N2 in the last level This large fanout will slow down the adder unless the transistors are carefully sized to handle the loads g 1034c p 662 KoggeStone tree adder G s P s and C s are calculated in parallel More intermediate G s and P s a calculated to avoid excessive fanout The delay is pro portional to logN with the minimum number of logic levels but the number of crossover wires between levels increases rapidly The number of crossover wires between the last two levels is N2 These wires take up excessive area for large adders and increase the loading of the bottom levels This form is popular with designers because of its potential high speed g 1034def p 663 Various tree adders intermediate to the rst three tree adders g 1035 p 665 Taxonomy of tree adders showing their properties of number of levels maximum fanout and maximum number of wiring tracks between levels High Valency Tree Adders The G s and P s have been combined two groups at a time to make a larger group Accordingly the carry chains in the tree adders have been binary trees The G s and P s can be combined more than two groups at a time to make higher radix trees The authors de ne valency as the number of groups combined together to make larger groups For example the valency3 equations can be derived recursively from the valency2 equations that we have already been using Gij Gik PikGk7 lj JPHG07UkPtWLka7UJ Pij Piik39PUCi lj PilPlilkPki 1 igtlgtkgtj eq 108 p 647 valency4 equations g 102c p651 valency4 circuit Tree Adder November 2 2006 page 2 of 3 I ECEN 6263 Advanced VLSI Design I g 1036 p 666 valency3 tree adders The high valency trees have less levels but the circuitry in each level is more complex Hybrid Tree Adders For small adders up to 4bits there is no signi cant speed advan tage for tree adders over ripple carry adders particularly if the ripple carry adder has an internal Manchester carry chain For larger adders the tree need only supply every 4th bit instead of every bit This results in much smaller cheaper faster trees called sparse tree adders g 1039 p 668 Skalansky valency2 sparse tree adder g 1040 p 669 KoggeStone valency3 sparse tree adder It is also possible to make more complex hybrids that use BrentKung structures for some levels and bit positions Sklansky for other positions and KoggeStone at still others This is an area of active research currently Tree Adder November 2 2006 page 3 of 3 I ECEN 6263 Advanced VLSI Design I Register File Design Cont Sense Amp for Single Ended Read It is quite common in classic SRAM memory design to read information from the memory cells by precharging the bit lines high g 1116 p 725 then using an analog differential ampli er to quickly sense any small difference between the bit and E lines g 1117 p725 We cannot do this for the register le since we have only bit lines and do not have any E lines Similarly to DRAM folded bit lines g 1130 p 737 can provide the two inputs to the differential sense amp g 118 p 718 A specially designed inverter can be used to perform a single ended read The transistor sizes in the inverter sense amp ml be chosen to minimize the read access time The circuit and waveforms on the bit and bit lines are shown below C bit line precharge re 1 p Vbim I i memory 39 Vimsense amp cell I i l r t I I V A I I Vbit 39 39 m i E I I I memory I I cell I t Sense Amp trd Vout Note that the large bit line capacitance C b it produces a long fall time on the bit line The sense amp inverter is designed with a much smaller output capacitance so the output rise time is much shorter The read access time delay trd is decreased by raising Vinv of the inverter used as the sense amp The decrease in read access time must be traded off against the decreased noise margin when raising Vim Raising V is accomplished by quot1 1 using wider than normal pFETs in the inverter sense amp Write Driver The write driver holds the bit and E lines high or low when doing a write so that the data in the selected memory cell is overwritten Since we already have a large pFET for pre Register File Design Cont October 26 2006 page 1 of6 I ECEN 6263 Advanced VLSI Design I charging the bit lines high we only need to add an nFET to bring the bit lines low if neces sary during a write The circuit for each pair ofbit and E lines is in g 11 10a p 719 for single ended write leave out the E line A more ef cient circuit might look like the following C bit line precharge VbitA Vmem L I I I Vmenlk memory cell bit bit readwrite A low on the readwrite control line enables the nFETs to write the data in to the selected memory cell A high on the readwrite control line disables both nFETs so that the stored data in the selected memory cell can be read Address Decoder The address inputs to the register le contain the binary encoded register number to be used for reading or writing Each address must be decoded by a binary decoder to produce the word select lines for the memory cells that drive each bit line A separate decoder is needed for each bit line in order for simultaneous readwrite operations to be done on each bit line Each decoder consists of large multiinput AND gate one AND gate for each word select line The fastest implement of the large AND gates is as atree of smaller gates as dis cussed last semester Each of the AND gates from each of the decoders should be laid out Register File Design Com October 26 2006 page 2 of6 I ECEN 6263 Advanced VLSI Design I with the memory cells in a word as in the following example with 6 bit lines Each col umn of AND gates implements a single complete decoder Addr Addr Addr Addr Addr Addr Bus Bus Bus bit and lines Bus Bus Bus A1 A2 A3 llllllll I mem mem mem memi f word word ines ines cell cell cell cell mem mem mem ce11 cell cell Note that the height of the AND gate cell should be adjusted to match the height of the memory cell so that the layout ts together without wasted area It must be possible to disable the decoders so that all of the word lines are low This will turn off all of the pass transistors in every memory cell so that no memory cell is selected This is necessary because 1 The address input must be stable before any word is selected for reading or writing 2 All of the word lines must be low while the bit lines are precharged high Since precharging the bit lines is done before each read and write operation it is critical that the time delay from the decoder enable to word lines should be as small as possible This can be done by implementing the multiinput AND gates in the decoder with an enable that has a very short path to the decoder output address word inputs I line enable Register File Design Com October 26 2006 page 3 of6 I ECEN 6263 Advanced VLSI Design I Timing Requirements Even though the register le does not require a clock input it is still necessary for the sig nals on the control lines to have the correct timing relationships in order for the register le to work correctly Timing diagrams for the read and write cycle are shown below for the three control lines that have been used in this design pT l t T 39 t read A write T T k k 39 t read cycle write cycle The decoder must remain disabl l until the precharge cycle nishei Thus decoder enable can go high only after pre is high and must go low before pre goes low The readwrite line can go low anytime during the precharge before a write but it must be high during the precharge before a read in order to not interfere with the precharging Register File Design Com October 26 2006 page 4 of 6 I ECEN 6263 Advanced VLSI Design I Generating Timing Waveforms Asynchronous design techniques are usually used to generate these waveforms In asyn chronous design gate delays are used to get the correct timing of the waveforms rather than using a clock signal The basic building block is the delay block which is usually implemented as a chain of inverters The delay rd through the inverters can be xed by varying the number of inverters in the chain and by varying the sizes of the transistors in the inverters The delay block can be used to implement other useful building blocks such as the pulse generator and the nonoverlapping square wave generator The nonoverlap ping square wave generator is frequently used to generate nonoverlapping clock phases 11 and 12 from a simple square wave input delay pulse generator in 1n out t odd number of inverters ou t 4 ltdV nonoverlapping square wave generator t t even number of inverters Register File Design Com October 26 2006 page 5 of6 I ECEN 6263 Advanced VLSI Design I A nonoverlapping generator ci it can be used to make decoder enable and precharge mutually exclusive Similarly pre and readwrite should be gated to drive the internal write driver and precharge transistor so that they are never on at the same time strobe decoder enable 1 internal readwrite readWNW to write driver In this example the decoder is enabled when strobe goes high and precharging is taking place when strobe is low It still is necessary that readwrite be stable not change during the time that strobe is high The strobe control line must also be properly synchronized with the address lines so that the address lines are stable whenever strobe is high and the decoder is enabled The address lines are usually synchronized with the system clock However the address lines do not become stable until a short delay after the clock period begins Therefore we can not use the system clock directly as the strobe signal We can use the clock to drive pulse generators to create the correctly timed strobe waveform clk addr strobe gt lt tdl td2 Register File Design Cont October 26 2006 page 6 of6 I ECEN 6263 Advanced VLSI Design I Carry Save Adder Implementation Now that we have seen that the carry save adder trees are most ef ciently implemented by putting together the 32 blocks we must still address the issue of how to implement the 32 block carry save adder efficiently Functionally the carry save adder is identical to the full adder The full adder is usually implemented with a reduced delay from C in to C out because the carry chain is the critical delay path in adders Unfortunately there is no single carry chain in the carry save adder trees in multipliers Thus it does not pay to make the delay shorter for one input by sacri cing delay on other inputs for carry save adders Instead carry save adders are normally implemented by treating the 3 inputs equally and trying to minimize delay from each input to the outputs We have A g s A Beac AEEZBEZECABC C AB ACBC 3w Q3 As we can see from the expanded version of the exclusive or function for the sum S both the uncomplemented and complemented form is required for each input there is a trans mission gate XOR circuit that does not require the complemented inputs but we won t consider this here If we want to avoid putting extra inverters in our carry paths to pro duce the complemented input the best thing to do is to have each carry save adder produce both uncomplemented and complemented outputs which can then be used as inputs by the next stage of carry save adders Due to symmetries in the logic functions for C and S pro ducing C E S and does not take as much circuitry as one might think The idea is to find common subfunctions for which we may use the same transistors to implement parts of more than one output function SA B CA B C AB CA7B C A B CZ B C ZB CAB C Carry Save Adder Implementation December 11 2004 page 1 of 10 I ECEN 6263 Advanced VLSI Design I C ABBCAC ABA ZBCABEC AB ZBC AEC ABZBAEC 6 m Z EE 5x21 E ZEZZEC39ZC39Z C39CEEZEECE 21 In both cases we see the functions have 1 Common subfunctions 2 The common part is gated by a complementary input These two properties allow the transistors for the common part to be shared Consider full CMOS gates forfandfwith a common part C which is gates by I C dualofC I K uncommon parts llere it is obVious why the gating signals 1 jmust be complementary to avoid shorting f to Can39y Save Adder Implementation December 11 2004 page 2 of 10 I ECEN 6263 Advanced VLSI Design I Cleverly extending this idea gives the folded transistor full CMOS design for the carry save adder shown below 2 C 4 5 E 1 B 1 F EH E i F3 a WI Note that a double bene t of reducing area and increasing speed by reducing transistor loading is obtained by using this technique Carry Save Adder Implementation December 11 2004 page 3 of 10 I ECEN 6263 Advanced VLSI Design I The transistor count may be further reduced by using logic gate design styles that elimi nate the pMOS pullup block which is made possible when synthesizing both f and f Common blocks in f and f may still be shared as above CVSL 2 1 In both cases thefandj blocks are synthesized with nMOSFET s only no pMOSFET s CVSL eliminates having to duplicate fandfwith pMOSFET s by using the cross coupled pMOSFET s which force f andfto opposite values The problem is that the crosscouple is Sl as we saw last semester Consider switching f from high to low At the beginning of the switching transient the pMOS cross couple has not yet switched so we have f pull up turns off off on J f open I on t the nMOSFET that just turned on must fight the pMOSFET that is still turned on to bring the f output low enough to turn on the other pMOSFET which then causes the rst pMOS FET to turn off This can take a considerable amount of time so that the typical CVSL gates are not much faster than the full CMOS gates even though the input gate load is 12 that of full CMOS The Complementary Pass Logic CPL method overcomes the speed problem by using inverters as level detectors for the two nMOS pass transistor blocks There is no cross couple circuit and no ghting of logic levels However an nMOS pass circuit is notori ously slow at passing high logic levels This can be compensated by adjusting the inverter cross over voltage Vinv to a lower than usual value as discussed for partial swing logic last semester In fact CPL is just nonfull swing pass transistor logic where both a logic Carry Save Adder Implementation December 11 2004 page 4 of 10 I ECEN 6263 Advanced VLSI Design I function and its complement are implemented simultaneously This is very useful for arithmetic circuits such as multipliers and adders CPL gates as originally presented in 1 can be improved somewhat The ANDN AND gate should be changed as follows A B B A A 0 l A B B 1 3 1 3 25 AB 2 AB original ANDN AND improved ANDN AND The revised form has a much smaller load on the B input and is much faster As usual the inverters do not need to be included in every gate they are inserted where needed to pre vent n2 delay through 71 transistors in series For example two 2input XOR gates can be cascaded to make a 3input XOR gate and an inverter need not be inserted between the two XOR gates BC C 6 ABC ltgtA NJ 3 II I s s B II I A II I A g s The three input XOR gate can be used to produce the sum output for the carry save adder The CPL three input XOR gate has the same number of transistors as the folded CVSL three input XOR gate 2 The structure of the circuits is almost the same which can be Can39y Save Adder Implementation December 11 2004 page 5 of 10 I ECEN 6263 Advanced VLSI Design I seen by redrawing the CPL gate upside down and explicitly putting in the transistors in the inverters Sum Circuits CPL Folded CVSL VII U N F WI GI The CPL gate has the advantage of being faster than the CVSL gate by about a factor of 2 The CPL gate has the disadvantage of dissipating DC power whereas the CVSL gate does not The carry output must also be implemented The CPL carry circuit in 1 has 12 pass FETs in it We can improve the speed of this circuit by putting in explicit connections to power and ground 1 s and 0 s as we did earlier for the ANDN AND gate Carry Save Adder Implementation December 11 2004 page 6 of 10 I ECEN 6263 Advanced VLSI Design I CPL Cany Circuits Folded CVSL m C out C out C out A A A7 B B 1 3 It is interesting to note that the folded CVSL carry circuit from 2 which has only 6 pass FETs in it cannot be made into a CPL circuit When A B l in the CVSL circuit a par allel combination of pass FETs controlled by C and C gives a valid logic 0 but in CPL it does not invalid Valid 0 liq LE1 1i l1 AC Cl 5 The above circuits are optimized implementations for the 32 carry save adder building block cell It is also possible to optimize other building block cells for example the 42 compressor The 42 compressor has 4 explicit inputs plus one hidden carry for atotal of 5 inputs The sum bit output of the 42 compressor is the exclusive or of all 5 inputs If the 42 compressor is made from two 32 blocks then the 5 input XOR gets imple mented by four 2 input XOR gates in series A tree of XOR gates would be faster 3 Carry Save Adder Implementation December 11 2004 page 7 of 10 I ECEN 6263 Advanced VLSI Design I Similarly a tree of gates can be found for the other 42 outputs which would be faster than obtained from two cascaded 32 circuits DE C AB DE C A 5 input XOR from 5 input XOR cascaded 32 optimized tree compressors The CPL gate for the XOR tree might look like the following Note that it is necessary to add the inverters before the internal XOR outputs can be used to control the gate of a pass FET B D L Uul I bl ml S S Carry Save Adder Implementation December 11 2004 page 8 of 10 I ECEN 6263 Advanced VLSI Design I Power reduction for CPL As with all forms of nonfull swing pass transistor logic CPL sulTers from DC power consumption because the inverter input voltage does not get high enough to completely turn off the pFET in the inverter The DC power consumption can be eliminated by adding an extra pullup pFET to the inverter inputs The function of the added pFET is to eventually bring the inverter input voltage high enough to completely turn off the inverter pFET The extra gate and drain load from the pFET does slow down the circuit but it is still faster than CVSL Another way to eliminate DC power consumption is to have a special IC fabrication pro cess that has been optimized for CPLl The problem is that the transistor switching threshold voltages VT and VTP are chosen to implement full swing logic in normal IC fab processes We want to choose VTquot and VT to eliminate the power consumption for non full swing logic A high voltage in the inverter input is VIH Vdd VTnpaSS where the switching threshold of the nMOS pass FET has been modi ed by the body effect so that VTnpass m 15 VTquot for most processes The pFET turns off at Vpoff Vdd IVTpl so that the high input turn off margin is VMH VIH Vpo IVTpl VTnpaSS If VTquot and VT are chosen to be the same magnitude which is the case for most IC fabrica tion processes then the high turn off margin is negative which is the cause of the DC power consumption The low input to the inverter goes all the way to zero so that the low input turn off margin is VML Vnoff VIL VTn 0 VTn which is the same as for full swing logic To optimize the fabrication process for nonfull swing logic the high turn off margin VMH should be increased to a positive number If VTquot is decreased to accomplish this then the low turn off margin VML is decreased Increasing VT does raise VMH without lowering VML However if we want to increase VMH to be as big as VML this would require Carry Save Adder Implementation December 11 2004 page 9 of 10 I ECEN 6263 Advanced VLSI Design I VTP VTquot VTnpass Such a large VTP would make the pchannel devices very slow There is another way to make VTP VTquot VTnpass That is to make the nMOS pass FETs differently than the regular nMOSFETs in the inverter A native nMOSFET is easy to fabricate with a threshold VTn39 m 0 If the native nMOSFET is used for the pass transistors body effect increases the threshold to only a few tenths of a volt Thus it is possible to satisfy VTP VTquot VTn39pass without increasing VT very much 1 K Yano et al A 38ns l6Xl6b Multiplier Using Complementary PassTransistor Logic IEEEJ Solid State Circuits vol 25 pp 388394 Apr 1990 2 P Song and G De Micheli Circuit and Architecture Tradeoffs for HighSpeed Multiplication IEEE J Solid State Circuits vol 26 pp 11841198 Sep 1991 3 N Nagamatsu et al A lSns 32X32b CMOS Multiplier with an Improved Parallel Structure IEEEJ Solid State Circuits vol 25 pp 494497 Apr 1990 Carry Save Adder Implementation December 11 2004 page 10 of 10 I ECEN 6263 Advanced VLSI Design I P1pel1n1ng The following de nitions are necessary to understand pipelining throughput no of usable outputsunit time latency time delay from valid inputs provided until outputs valid cost area power Pipelining is a method to get higher throughput without signi cantly increasing hardware cost or latency Objective 1 throughput T 2 latency cost Solutions 1 Use bigger transistors Id AK ltdopt I W cost A I I gt W Wopt Cost performance tradeolTs as discussed previously will lead to an optimum value of the transistor sizes We can still make the circuit faster but it is not worth the cost throughput l tdapt latency tdapt cost l adder 2 Use multiple copies of hardware in parallel throughput 2 tdapt adder adder 1 7 atency 7 tdopt cost 2 adders The throughput is increased without increasing latency but cost is greatly increased Pipelinng November 7 2006 page 1 of6 I ECEN 6263 Advanced VLSI Design I 3 Pipeline adder inputs throughput l T clk period T intermediate results latency 2T cost l adder l reg adder outputs The pipeline gets high throughput by applying new inputs to the first stage while the sec ond stage works on intermediate results from the first stage k T H CLKT add input T input 1 input 2 reg out int result 1 int result 2 add out output 1 output 2 The pipeline gives maximum throughput when the clock period T is as small as possible If the delay of the original adder Idapl is split between the delay of the first stage of the V adder tdl and the delay of the second stage tdz then T Z maXtd1 tdz Treg where 511 512 tdopt The minimum value for T is attained when 511 td2 tdoptZ T Tmin ltdoptZ Treg Therefore 1 1 2 maX throughput Tmquot Idem2 Treg tdopt l39 2Treg latency 2Tmm 2td0Pt2 Treg tdoerZTreg If Treg ltlt T then pipeline performance is as good as multiple parallel hardware and if cost of pipeline reg ltlt cost of adder then cost of pipeline is as good as single adder Pipelinng November 7 2006 page 2 of6 I ECEN 6263 Advanced VLSI Design I Generalization to n Stages The adder example shows that pipeline ef ciency is limited by I register delay 2 inability to evenly divide delay between pipeline stages If we pipeline a circuit with original delay rd we nd that the clock period cannot be shorter than T tdn AT where AT represents the additional delay from the registers and the inability to divide up the stages evenly and n is the number of pipeline stages throughputA 1 n lAt th h t mug 1 tdn AT td nAT n A latency latency ntdn AT td nAT n unp1pelled com register cost cost m n m unpipelined cost nregister cost cost throughput cost What this simple model is intended to show is that l a moderate amount of pipelining can be used to increase throughput without signi cantly increasing cost or latency N pipelining cannot be used to achieve arbitrary high levels of throughput LA there is an optimum number of pipeline stages beyond which pipelining is not econom ical 4 The optimum value of delay per stage mm is a few multiples of the register delay At Pipelinng November 7 2006 page 3 of6 I ECEN 6263 Advanced VLSI Design I Example for Kogge Stone Look Ahead Carry Tree Adder Pipeline Unpipelined LAC Tree Adder 2 stage pipelined 32 bit LAC tree adder 4 Nbit pairs gt 32 bit pairs CLK log2N levels D O O O L AC tree g 5 mid CLK reg Lk N E 5 CLK Stage 1 of the pipeline has 3 crossover levels with smaller loads Stage 2 of the pipeline has 2 crossover levels with larger loads It may also be desirable to size up the transistors in the pipeline register to drive the large loads in stage 2 The stage 1 delay and the stage 2 delay are obviously different although nearly the same Since the maximum of the stage delays determines the clock period then the clock period must be slightly longer than I d2 even if the register delay is negligible Time Borrowing in Pipelines When the pipeline registers are implemented as latches time borrowing between adjacent pipeline stages is possible Time borrowing can be used to eliminate the mismatch between stage delays The rst step is to replace the positive Pipelinng November 7 2006 page 4 of 6 I ECEN 6263 Advanced VLSI Design I edge sensitive register with a level low sensitive latch low latch in series with a level high sensitive latch high latch Then rearrange the latches to eliminate as much delay as possible The unpipelined adder clock period 1 N 37 t r T 4v o Haw H MHN v QOOOOO 6 gtgtgtgtgtO B hw22222gtlt3 h 2 2 The 2sta e i elined adder clock eriod clk m L t r lt T2 i time lost from stage delay mismatch v5 quot5 w mm 395 To lt1 In 395 To 3 g 8 1 9 3 5 3 Q02 lt12 lt12 3 lt12 lt12 3 A gt gtgt A gt gt O A Bgnwi i i Bgni 2 gtlt Bgn 2 lt 2 4 2 4 The2stae i elined adder with borrowin clock eriod clk m t r T2 mm v m 33950 4v 0 4v 04h H 1 1 1H 1 1MHN Qooo o o 3 gtgtgtgt gtO B hw22232 h2xg h 2 2 4 2 The delay from the low latch at the end of the rst stage is moved back to remove the wasted time This is possible while the clock is low because the low latch is transparent Level 4 of the adder tree is moved back across the high latch at the beginning of the sec ond stage and now straddles the time boundary between clock periods The high latch was moved forward in time which is possible as long as the clock is high Thus stage 1 of the Pipelinng November 7 2006 page 5 of6 I ECEN 6263 Advanced VLSI Design I pipelined has borrowed time from stage 2 to compensate for the late arrival of the output of level 4 of the adder tree This means that no time is lost because of stage delay mis match and T2 lt T2 The key to understanding borrowing is that the latch delays can be moved anywhere in the clock phase where the latches are transparent A high latch delay can be moved back until the latch output changes th after the rising edge of the clock If we try to move the latch delay back further the output will continue to change at th after the rising edge of the clock A high latch can be moved forward until the latch input changes ts before the falling edge of the clock If we try to move the latch delay further forward the latch will save the wrong data clkI Ll 1 Isak 39 tdq Pipelinng November 7 2006 page 6 of6


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Allison Fischer University of Alabama

"I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.