New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Dorris Borer


Dorris Borer
GPA 3.84


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in Electrical Engineering

This 74 page Class Notes was uploaded by Dorris Borer on Monday September 28, 2015. The Class Notes belongs to ESE534 at University of Pennsylvania taught by Staff in Fall. Since its upload, it has received 24 views. For similar materials see /class/215452/ese534-university-of-pennsylvania in Electrical Engineering at University of Pennsylvania.

Similar to ESE534 at Penn

Popular in Electrical Engineering




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 09/28/15
ESE680002 ESE534 L T Computer Organization aSt Ime Rent s Rule And its implications Superlinear growth rate of interconnect Day 15 March 12 2007 pgt05 Interconnect 3 Richness Area growth 9N2p Penn ESE680002 Sprin92007 DeHon amp 2 Penn ESE680002 Sprin92007 DeHon Today Now What How rich should interconnect be There is structure locality specifics of understanding interconnect 0 Rent characterizes locality methodology for attacking these kinds of questions How rich should interconnect be Aow full utilization Most area efficient Model requirements and area impact 3 Penn ESE680 002 Sprin92007 DeHon 4 Penn ESE680002 Sprin92007 DeHon Step 1 BUIId Architecture Tree of Meshes Model Assume geometric growth Nature model is Pick parameters Build architecture can hierarchical t Restrictedinternal ne bandwidth F C Can match to model Penn ESE680002 Sprin92007 DeHon 6 Penn ESE680002 Sprin92007 DeHon Parameterize C Parameterize Growth Penn ESE680002 Sprin92007 DeHon 2 1 gt 0c2 2 2 2 1 gta2lt34gt 13 23 8 Penn ESESBCl OOZSpringZOO7Deggm2 gta 2lt Step 2 Area Model Need to know effect of architecture parameters on area costs focus on dominant components wires switches logic blocks Penn ESE680 002 Sprn92007 DeHon Area Parameters Alogic 40m2 ASW 25K 2 Wire Pitch 8 Penn ESE680002 SpringZOO7 DeHon Switchbox Population Full population is excessive next lecture Hypothesis linear population adequate still to be disproven Example artificially small for clarity Penn ESE680002 Sprln92007 DeHon Larger Cartoon 1024 LUT Network P067 LUT Area 3 Penn Effects of P on Capacity 39 10107 13050 2 134167 r 2 134175 I 10106 d 2 lt1 10105 10104 I I I 10103 i 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 NLUT x1024 Penn ESE680002 Sprin92007 DeHon Application Requirements 511 hum Ei39m 5 39r U l a a a 2 mm Penn ESEGBODOZ Sprin92007 DeHon Max c7 P068 mmAvg c5 P072 17 Effects of P or on Area PO5 PO67 P075 1024 LUT Area Comparison 14 Penn ESEGBOOOZ Sprin92007 DeHon Step 3 Characterize Application Requirements Identify representative applications Today IWL893 logic benchmarks How much structure there How much variation among applications 16 Penn ESEGBOOOZ SpringZOO7 DeHon Penn ESEGouuuz Benchmark Wide v rk39MLk quot1quot slqrltr39x m l mu maxim w 18 bpllllgzwl wnuu Benchmark Parameters Complication Manage Interconnect requirements vary among applications Interconnect richness has large effect on area What is effect of architectureapplication mismatch Interconnect too rich Interconnect too poor 19 20 Penn ESE680002 Sprin92007 DeHon Penn ESE680002 Sprn92007 DeHon Interconnect Mismatch in Step 4 Assess Resource m Theory Impact ix 5 g 1 Map designs to parameterized 6 39quot if architecture Identify architectural resource required Compare mapping to kLUTs LUT count vs k Arta Overhead Ratio I l I l ELUU LHD lll UJU 240 05U LED DID 030 090 00 Pdl39kigquot 22 Penn ESuuuuu4 sprungqu uenun Penn ESEGBO OOZ SpringZOO7 DeHon Mapping to Fixed Wire Schedue Mapping to FixedWS Easy if need less wires than Net Better results if reassociate rather than keeping original subtrees If need more wires than net must depopulate to meet interconnect limitations Penn ESE680002 Sprin92007 DeHon Penn ESE680002 Sprln92007 DeHon Observation Don t really want a bisection of LUTs subtree lled to capacity by either of LUTs ruut bandwidth May be pro table to cut at some place other than midpoint not require balance condition Bisection should account for both LUT and wiring limitations Challenge Not know where to cut design into not knowing when wires will limit subtree c pacity 7 new Brute Force Solution Explore all cuts start with all LUTs in group consider all balances recurse Brute Force Too expensive Exponentialwork viable if solving same subproblems um Simplification Single linear ordering Partitions pick split point on ordering Reduce to nding cost of startend ranges subtrees within linear ordering Only n2 such subproblems Can solve with dynamic programming Dynamic Programming Start with base set start 0 size Compute all splits of size n from solutions to all problems of size n1 or smaller start Dorie when compute where to split 0N1 Dynamic Programming Just one possible heuristic solution to this problem not optimal dependent on ordering sacri ces ability to reorder on splits to avoid exponential problem size Opportunity to nd a better solution here 1 Ordering LUTs Another problem lay out gates in 1D line minimize sum of squared wire length tend m cluster cunnected gates tugether ls solvable mathematically for optimal Eigenvemur ur cunnectivity matrix Use this 1D ordering for our linear ordering 32 7 am Mapping Results Step 5 Apply Area Model Assess impact of resource results Resources gtlt Area Model 2 Area mg iv ll ill ii K M l ll Net Area Relative 1 Area Ratio Picking Network Design Point Halattm 1 Mm Fllih Minimize poroms Sigma Objective C P rel oreo relative area 6 06 123 area with full util 10 075 298 100 Don t optimize for 100 compute util 100 yield aso don t optimize for highest peak Penn ESE680002 SpringZOO7 DeHon 37 What about a single design mxman I39I Lina lm II LUT Utilization predict Area 1 E 3 3 Single design E ISLE E I it 23 I14 ll I12 ch 03 Poe Penn ESE680002 SpringZOO DeHon DEEJJ Lsize Area Comparison LUT Usage Relative Meal 39 3939I39quotquotif Pn n m 1mm Penn ESE680002 Spring2007 DeHon Methodology 1 Architecture model parameterized 2 Cost model 3 Important task characteristics 4 Mapping Algorithm Map to determine resources 5 Apply cost model 6 Digest results find optimum multiple understand conflicts avoidable 40 Penn ESE680002 SpringZOO7 DeHon Admin Wire requirement assignment due VVednesday Next assignment out today Will require computer use Reading for today tomorrow online Penn ESE680002 SpringZOO DeHon 41 Big Ideas MSB Ideas Interconnect area dominates logic area Interconnect requirements vary among designs within a single design To minimize area focus on using dominant resource interconnect may underuse nondominant resources LUTs 42 Penn ESE680002 SpringZOO7 DeHon Big Ideas MSB Ideas Two different resources here compute interconnect Balance of resources required varies among designs even within designs Cannot expect full utilization of every resource Most areaefficient designs may waste some compute resources cheaper resource 43 ESE680002 ESE534 P I Computer Organization reV39OUS Y Fixed and Programmable Computation AreaTimeEnergy Tradeoffs Day 8 February 5 2007 Vle scal39quotg Computing Requirements and Instruction Space Penn ESE680002 Sprin92007 DeHon amp Penn ESE680002 Sprin92007 DeHon Today Computing Requirements Instructions Computing Requirements Requirements Taxonomy reVIeW 3 Penn ESE680 002 Sprin92007 DeHon 4 Penn ESEGBO OOZ Sprin92007 DeHon Requirements ln orderto build a generalpurpose programmable computing device we absolutely must have 5 6 Penn ESE680002 Sprin92007 DeHon Penn ESE680002 Sprin92007 DeHon S Primitive compute elements enough o s Penn ESE680002 Sprin92007 DeHon Penn ESE680002 Sprin92007 DeHon Penn ESEGBO OOZ Sprin92007 DeHon Compute and Interconnect Penn ESE680002 Sprin92007 Shanng Interconnect Resources Penn ESEGBD OOZ Sprin92007 DeHon Sharing Interconnect and Compute Resources What role are the memories playing here 12 DeHon Memory block or Register File Interconnect moves data from input to storage cell or from storage cell to output What do I need to be able to use this circuit properly reuse it on different data Penn ESE680002 Sprin92007 DeHon Requirements In order to build a generalpurpose programmable computing device we absolutely must have Compute elements Interconnect space Interconnect time retiming Interconnect external IO Instructions Penn ESE680002 Sprin92007 DeHon Penn ESE680002 Sprin92007 DeHon 13 15 Penn ESE680002 Sprin92007 DeHon Instruction Taxonomy 17 Penn ESE680002 SpringZOD7 DeHon Distinguishing feature of programmable architectures Instructions bits which tell the device how to behave 0000 00 010 11 0110 Penn ESE680002 Sprin92007 DeHon 0 add mm 510 Focus on Instructions Instruction organization has a large effect on39 ealm of ef cient utilization for an architecture size or compactness of an architecture Terminology Primlt39ve Instruction pinst Collection of bits which tell a single bit r cessing element what to do es input sources in time retiming am an mm mm m H mm mm mm 2n 7 mm Computational Array Model Collection of computing elements 7 compute operator 7 local sturagEretirnirig Interconnect Instruction imam hnl 51m mi 7Lquot quot7 Ideal Instruction Control Issue a new instruction to every computational bit operator on every cycle mlm m lt7 lnstruclion pm Ema an 5pm InputCutout Ideal Instruction Distribution Why don t we do this Ideal Instruction Distribution Problem Instruction bandwidth and storage area quickly dominates everything else Compute Block 1M7H1K7i x IKA Instruction 64 bits Wire Pitch 8A Memory bit 12sz Instruction Distribution Two instructions in 10247 64x8l5127t Instruction Distribution Distribute X and Y 2X FeimEbEEBDUUIbpiingl ni 792mm Instruction Distribution Distribute from both sides 2X Huh Instruction Distribution Room to distribute 2 instructions across PE per metal layer 1024 2x8x64 Feed top and bottom left and right 2gtlt Two complete metal layers 2gtlt 2 8 instructions PE Side Penn ESEEBU DUZ burlngl i VDEHun Instruction Distribution Maximum of 8 instructions per PE side Saturate wire channels at BMW N o 2 at 64 PE beyond this instruction distribution dominates area Instruction consumption goes with area Instruction bandwidth goes with perimeter Perm EsEaannnz suringznm 7 mm Instruction Distribution Beyond 64 PE instruction bandwidth dictates PE size VPEm Wm 64x89i N FEW 16KAZgtltN As we build larger arrays processing elements become less dense Perm ESEEBEHNZ Suiim mi Dem Instruction Memor Av0id Instruction BW Saturation y ReqUIrements How might we avoid this Idea put Instruction memory in array Problem Instruction memory can quickly dominate area too Memory Area 64x12K7iZ nstruction PEayEa 1M 2 Instructions gtlt BOKAZ 7 new Instruction Pragmatics Typical Structure Instruction requirements could dominate What structure do we usually expect array size Standard architecture trick Look for structure to exploit in typical computationsquot mini Two Extremes Two Extremes SIMD Array microprocessors SUIID Arra GA 7 Instrumiuncycie MICI39OPI39OCSSSOI39S 7 instruetiunPE Snare instrumn muss array 7 Instrumiuncycie 7 assume tempurai uf F39Es 7 snare instructiun Incaiity uf instructiuns 7 unifurrn uperatiun in space aims array DI F39ES same Warm variance in We 7 unifurrn DpEratiun in 7 uperatiun variance in i in ii i Mi Fei Space Space 7 uperatiun variance in a umfurm uperatiuns in time time M an 35 W Perm ESBXD an Sonnvzll Placing Architectures What programmable architectures organizations are you familiar with Hybrids VLIW SuperScalar Few pinstscycle Share instruction across W bits DPGA Small instruction store PE Era loan ll39ll Gem3 quotIt Penn ESE68 Gross Parameters Instruction sharing width SIMD width granularity Instruction depth Instructions stored locally per compute element pinsts per control thread Eg VLIW width 40 Penn ESE680002 SpringZOO7 DeHon 37 Penn ESE680002 Spring2007 DeHon What differentiates a VLIW from a multicore E g 4issue VLIW vs 4 singleissue processors 39 Penn ESE680002 SpringZOO DeHon Architecture Instruction Taxonomy Control Threads PCs pinsts per Control Thread Instruction Depth Granularity I ArchitectureExamples 0 0 na Hardwired Functional Unit 0 egg ECCEDC Unit FP IVIPY I FPGA n I w Reconfigurable ALUs nvol Bitwise SIMD t l Iquot Traditional Pracgaamrs w tractor Processors i l i DIPSA Io PAUDI F H Lllhi llf 3 I i HERAI39SCURE W 1 un h gllqu It quotafE39EEZA a II E It PADDIE m H htlit lltftraditionalj 41 Penn ESE680002 Springguu Lie un Instruction Message Architectures fall out of general model too expensive structure exists in common problems exploit structure to reduce resource requirements Architectures can be viewed in a unified design space 42 Penn ESE680 002 SpringZOO DeHon ESE534 Arithmetic Work preclass exercise Penn ESESM Sprin92010 DeHon Computer Organization Day 3 January 25 2010 Penn Today Addition organization design space parallel prefix Penn ESE534 Spring201tl DeHon Preclass Penn ESE534 SpringZO39lO DeHon Last Time Boolean logic gt computing any nite function Saw gatesand a few properties of logic 2 Penn ESE534 Sprin92010 DeHon VVhy Start getting a handle on Complexity Area and time Areatime tradeoffs Parallelism Regularity Arithmetic underlies much computation grounds out complexity 4 Penn ESE534 Spring2010 DeHon Circuit 1 Can the delay be reduced How To what 6 Penn ESE534 SpringZOlO DeHon Tree Reduce AND Fem E5334 swimmiu 7 omen Circuit 2 Can the delay be reduced Eeun E5334 SmngO39lii Del tori Circuit 3 39 Cit I l L1 Wm 02 L cu E Uh m in n9 Can the delay be reduced Fem E5334 SpriiigZUlU 7 Del tori Brute Force MultiOutput AND How big 38 here FennE5334 5iwiingzuriiiwreHun in qenera about 10 Brute Force Multi Output AND Can we do better Fem E5334 swimmiu 7 omen Circuit 4 Can the delay be reduced Eeun E5334 SmngO39lii Del tori Addition 13 Penn ESE534 SpringZO10 DeHon Example Bit Level Addition Addition Base 2 example C 11011010000 A 01101101010 B 01100101100 S 11010010110 14 Penn ESE534 SpringZO10 DeHon Addition Base 2 A an12 391an22 392 a121 a02O 2 ai2i SAB What is the function for si carryi si carryi xor ai xor bi GarrYi ail bi1 camHP 2 ai1bi1ai1carryi1 bi1CaWi1 MAJai1bi1carrYi1 15 Penn ESE534 SpringZO1O DeHon Ripple Carry Addition Shown operation of each bit Often convenient to define logic for each bit then assemble A Bi bit slice CH1 ll Ci SSW D Sl ECM Tquot BS A3 B2 A2 B1 A1 BO A0 L m l l l l l l l l 5 FA FA FA FA o llll 83 82 81 SO Penn ESE534 SpringZOlO DeHon Ripple Carry Analysis Acme T T T T T T T T E CW lA39 FA FA FA o L Dm l 1 l 1 SS 82 S1 80 What is area and delay for Nbit RA adder unit delay gates Area ON 6n Delay ON 2n 17 Penn ESE534 SpringZO10 DeHon Can we do better Lower delay BB A3 82 A2 B1 A1 BO A0 H H H H 00 FA FA FA FA o llll 83 S2 81 SO 18 Penn ESE534 SpringZO10 DeHon Important Observation Do we have to wait for the carry to show up to begin doing useful work We do have to know the carry to get the right answer How many values can the carry take on B3 A3 B2 A2 B1 A1 B0 A0 83 82 81 80 Penn ESE534 Sprin92010 DeH 19 Idea Compute both possible values and select correct result when we know the answer ABn 1 n2 AB n2 1 0 Sn 1 n2 Sn2 1 0 Penn ESE534 Sprin92010 D1 Preliminary Analysis DelayRA Delay Ripple Adder DelayRAn kn DelayRAn 2kn22DRAn2 DelayP2A Delay Predictive Adder DelayP2ADRAn2Dmux2 almost half the delay Sn 1 n2 Sn2 1 0 Penn ESE534 Sprin92010 DeHon k2 this example Recurse If something works once do it again Use the predictive adder to implement the first half of the addition 22 Penn ESE534 SpringZOlO DeHon Recurse Redundant can share Penn ESE534 SpringZOlO DeHon Recurse If something works once do it again Use the predictive adder to implement the first half of the addition DelayP4An DelayRAn4 Dmux2 Dmux2 DelayP4AnDelayRAn42Dmux2 24 Penn ESE534 SpringZOlO DeHon Recurse By know we realize we ve been using the wrong recursion should be using the Predictive Adder in the recursion Another Way DelayPAn DelayPAn2 Dmux2 Every time out in half How many times cut in half DelayPAnlog2nDmux2C C DelayPA1 Parallel Prefix if use FA for PA1 then C2 25 26 Penn ESE534 Spring2010 DeHon Penn ESE534 SpringZOtO DeHon Maybe better to show this as a specialization gc carrya0b0c carrya1b0c C LA amazobzrc Fu nctIons carrya1b1c Think about each adder What functions can gCi1 be bit as a computing a 9001 function on the carry in W ambm1 1 ciigci1 gxx it Particular function fwill v v v v depend on ai bi 33 52 3 5 a39 W b391 gfab gX0 aibi0 Penn ESE534 SpringZOtO DeHon 27 Penn ESE534 SpringZOiO DeHon 28 Functions Combining What functions can gCi1 be Want to combine functions Compute Cigigi1Ci392 9001 Generate Compute compose of two aibi1 functions o What functions will the gx X PrOpagate compose of two of these am xor b391 functions be 0 gx0 Squash Same as before p t y t y a b 0 322321 e genera e Penn ESE534 Spring2010 DeHon 29 Penn ESE534 Spring2010 DeHon 30 Compose Rules LSB MSB GG so GP SP GS ss PG PP PS work on board Penn ESE534 Spring2010 DeHon Combining Do it again Combine gi3i2 and gi1i What do we get Penn ESE534 Spring2010 DeH 83 82 S1 50 Reduce Tree AU151quot AEl El5 5155 4 EH 3 EB N2 BIZ All Ell Am Elm Sq AB GenAB 39 SqoutSq1Gen1sq0 GenomGen1Sq1GenO Penn ESE534 Spring2010 DeHon Compose Rules LSB MSB GGG SGG GPG SPS GSS SSS PGG PPP PSS 32 Penn ESE534 Spring2010 DeHon AUIEP AIGIEI5 AI515l51 AMER AIJJEIJI A113W AMEN AIOIEIDI 34 Penn ESE534 Spring2010 DeHon Amara AIEJEIEJ AISJEIEI AMEN Alslam mam Amam Aluialnl Sq AB GenAB SqutSq1Gen1SqO GenomGen1Sq1GenO Delay and Area Penn ESE534 Spring2010 DeHon Reduce Tree SqAB GenAB AEncode2 DEncode1 ACombine4 39 SqoutSq1IGen1Sq0 DCombine2 Gen0utGen1Sq1Gen0 ACarry2 DCarry1 37 Penn ESE534 Spring2010 DeHon Reduce Tree Area AMEN 5135 Al513l5 AMEN Al3lEI3l 213ml N1 BU AMEN AEncode2 ACombine4 ACarry2 Area 2N4N12 39 Reduce Tree Delay AlVIBUl RISING AI5IBI5 AMEN AMEN Alzl lzl AI JBI AIBIBIUI DEncode1 DCombine2 DCarry1 Delay 12logZN1 38 Penn ESE534 Spring2010 DeHon Reduce Tree Area amp Delay N7 5U A5 BIS N5 35 AM 34 Al3 BIB AlZ BIZ Am EU An 31 AreaN 6N2 DelayN 2log2N2 40 Penn ESE534 SpringZO lO DeHon Intermediates Can we compute intermediates efficiently gt a gt z z 3 3 Penn ESE534 SpringZO lO DeHon Penn ESE534 Spring2010 DeHon Prefix Tree Penn ESE534 SpringZO u HEEL 5quot Pre x Tree Intermediates Share terms Share common terms O V Penn ESE534 SpringZO10 DeHon 5 3 Penn ESE534 SpringZO10 DeHon 44 Prefix Tree Share terms Reduce computes a spans i 03 45 39 Reverse tree 8 a D 3 Combine spans 0L 5 for missing Oi iltN Eg O345O 5 Same size as reduce tree Penn ESE534 Sprin92010 DeHon 5 5 Penn ESE534 SpringZO u HEW Sm Parallel Prefix Area and Delay Parallel Prefix 39 tWice the areadelay 0 Important Pattern Area 2N4N4N2N Applicable any time operation is 10N associative Deay 4092N2 Or can be made assoc as in MAJ case Examples of associative functions Non associative Function Composition is always associative 48 Penn ESE534 Spring201o DeHon 5 5 Penn ESE534 Sprin92010 DeHon Note Constants Matter Watch the constants Asymptotically this CarryLookahead Adder CLA is great For small adders can be smaller with fast ripple carry larger combining than 2ary tree mix of techniques will depend on the technology primitives and cost functions Penn ESE534 Spring2010 DeHon Two s Complement 2O1O 1001 OOOO 1111 211O Penn ESE534 Sprin92010 DeHon positive numbers in binary negative numbers subtract 1 and invert or invert and add 1 Penn ESE534 Sprin92010 DeHon Two s Complement 50 Numbers just works A 111 A 110 A 111 B 001 B 001 B 010 S 000 s 111 s 001 Penn ESE534 Sprin92010 DeHon Addition of Negative A 111 B 110 S 101 52 Subtraction Negate the subtracted input and use adder which is invert input and add 1 works for both positive and negative input 001 91101111 1119 000 1 OO1 000 9111 1 OOO O1O 9101 1 11O PennESE534 Springzo 1E1Qn9 001 1 010 53 Subtraction addsub Note you can use the unused carry input at the LSB to perform the add 1 BS A3 82 A2 Bl A1 BO A0 Penn ESE534 Spring2 53 52 81 so Subtract Overflow A 111 A 110 A 111 A 111 B 001 B 001 B 010 B 110 s 000 s 111 s 001 s 101 A 001 A 011 A 111 B 001 B 001 B 100 s 010 s 100 s 011 OverflowAsBsAslSs FenriEbESSO sullngmw leiInn Admin Office Hours W2pm FenriEbESSO sullngmw leiInn Big Ideas MSB Ideas Can build arithmetic out of logic FeimEbE Z suingzmu DelInn Big Ideas MSB1 Ideas Associativity Parallel Prefix Can perform addition in log time with linear area FeimEbE Z suingzmu DelInn ESE680OO2 ESE534 Computer Organization TOday Energy Tradeoffs Voltage limits and leakage Day 7 January 311 2007 Eggmyodynamics meets Information Energy and Power Adiabatic Switching This is an ambitious lecture gtquot i 39 39 Hun At Issue What can we do about it Many now argue power will be the ultimate 1 scaling limit 2 not lithography costs E 2 Proliferation of portable and handheld devices ngZQICVI battery size and life biggest issues Cooling energy costs may dominate cost IdHCOX2WLVgSVTH2 of electronics 3 4 Etamammal Wan ESEiairiimim row Tradeoff Questions Ev2 How far can this go D rgde lN return to later in lecture What do we do about slowdown We can trade speed for energy Egtltrgd2 econstant Martin et al PuwzrrAware Campunng Kluwer 2001 5 http lcaltech cstrlibrarycaltechedu1308 Penn EsEaannnz Sprinul l r DPHDH Penn ESEEBEHNZ Suringl mr DEHmi Parallelism We have AreaTime tradeoffs Compensate slowdown with additional parallelism I 7 IV iv 1V1 trade Area for Energy Architectural Option Ideal Example Perhaps 1nJ32b Op10ns cycle Cut voltage in 025nJ32b Op 20ns cycle Two in parallel to complete 20ps20ns 75 energy reduction Also 75 power reduction Dom Power Density Constrained Logic Density 1 fooopmm2 Energy cost 10nJfooop 1OGHZ Cooling limit 100Wcm2 How many fooopscmzls 10nJmm2 gtlt100mm2lcm21000nchm2 top speed 100 z 100M x 100 fooops 10l fooopslcm1ls Response How many fooopscmzls 10nJmm2 gtlt100mm1lcm11000nchm2 top speed 100MHz 100M x 100 fooops 101 fooopslcmzls Power constraint won t let us run at 0GHZ might as well lower voltage save energy 7 m What can we support Exaggmcunsram lEInJXlEEps2Exty 7 X100 rm 100px 100Wcm2 Pushing through the Math t 3710rLX100X100ps2 cycle Jx 1W 310 8 x 104 464x10 mm 500px cycle Dam Improved Power How many fooopscmzls ZGHz x 100 fooops 2 X10iifooopscm1ls At 5x lower voltage vs 100M x 100 fooops loi fooopscmzls How far Dom Limits Ability to turn off the transistor Noise ParameterVariations Sub Threshold Conduction To avoid leakage want luwvery small Use Inquot for logic determines speed Want InnIUlarge 7 7V S Off 71W x10 T S 1n1077kTe Frank lElM J RampD v4BnZ3p235 we 7 mm Sub Threshold Conduction SeQOmV for single gate Se70mV for double gate 4 orders ofmagnitude IWluquotVTgt280mV Off IW x104 S 1n1077kT 9 Frank iElM J Rim vziBnZapzas17 TRS2005 High Performance Table 4Da 18 new ITRSZOOS Low Power Thermodynamics Dom Lower Bound Numbers ITRS 2005 kTXln2 287X103921J at RT K300 i Reducing entropy costs energy Single bit gate output Set from previous value to 0 or 1 Reduce state space by factor of 2 Entropy AS kXInbeforela erkgtltln2 EnergyT ASkTXIn2 Naively setting a bit costs at least kTXln2 WL3 9WZl m WWquot cgexmriEF E1047 21 We W EDPCV225gtlt1039lEF H 7 am Sanity Check Hmm 39 V05V 39 CV225gtlt10quotEJ QCV05X1039l7columbs 18 Billion Transistors in 25cm2 e16gtlt103919 columbs Generous assumes no interconnect capacitance Qe30 electrons 39 45gtlt1O39EJZ cmzeb O39EJcm2 Cooling limit of100Wcm2 Energy in a particle Maximum operating frequency 105 10B electrons 5GHZ new Recycling Thermodynamics only says we have to dissipate energy ifwe discard n I I n a a 7 n informatio laveth 44 quot pan we compute Without discarding mmm information Can we use this 25 25 7 Low Three Reversible Primitives Universal Primitives These primitives Are universal Are all reversible If keep all the intermediates they produce Discard no information Can run computation in reverse Cleaning Up Can keep erase unwanted intermediates with reverse circuit 1 a i a Mm 0 a 25 Thermodynamics In theory at least thermodynamics does not demand that we dissipate any energy power in order to compute Adiabatic Switching Two Observations Charge Cycle Dissipate powerthrough ontransistor charging capacitance N Discard capacitor charge at end of cycle Haltin capacitur half dissipated lrl pullup AthasKullerSvensuun Usci Sl ACM 05er 1993 32 Down Adiabatic Switching Impact of Adiabatic Switching Current source charging Em a I2RtTRCTCV2 amp supplies slowly so supply constant i RI RN RcrEd 1 39 Etutalx all gt Without reducing V WT Can trade energy and time E maFlzRiIacvmzRir EXTconstant E ma IZRTRCT cv2 lgnures leakage May require large v 33 3o 7 m Adiabatic Discipline SCRL Inverter Neverturn on a device with a large voltage differential across it PAv2R I D39s nodes at Vdd2 P1 at ground Slowly turn on P1 Slow split D39s lulu Slow turn offPl39s Slow return 1239s to Vdd2 35 M YuunisKnightlSLF EDU 1994 36 SCRL Inverter Elasii uperatiun SCRL NAND Same basic idea works for nand gate 7 Set inpuls 7 military set inputs 7 Spln railslu cumpule uulpul m r m adiabalically 7i i 7 Adiabatically swnch output 7 lsulaleuulpul 7 Bring rails backlugelher will warm Isola e o p Have transferred lugictu Rese WW 5quot 5 Q uutput E Still needtu Wurry abuut Pl 13 resetting uutput adiabatically 37 M 35 SCRL Cascade SCRL Pipeline Cascade like domino logic Compute phase 1 Compute phase 2 from phase 1 How do we restore the output We must uncompute the logic Forward gates compute output 1 Vdd2 SCRL Pipeline P1 high F1 on F1 inverse oft I 121 split a0 a 2 split bF F1a0 F239lF2F1a a P1 low now F239l drives a F1 restore by CD1 converge restore F2 Use F2391 to restore a to Vdd2 adiabatically M 2 0 SCRL Requires Reversible Gates to uncompute each intermediate All switching except IO is adiabatic Can in principle compute at any energy Trickiness Generating the ramped clock rails Use LC circuits Need highQ resonators Making this ef cient is key to practical n implementatIo Some claim not possible in practice am E5530 any lV mil l39m lmlnrlnr MUSFET Rail Enmv W1 f 4V lZ Big Ideas Can trade time for energy 7 ea fur Energy Noise and subthreshold conduction limit voltage scalin Thermodynamically admissible to compute without dissipating energy Adiabatic switching alternative to voltage scaling an ase CMOS logic on these observations ESE680002ESE534 Computer Organization Day 4 January 22 2007 Memories Penn Last Time Arithmetic addition subtraction Reuse pipelining bitserial vectorization shared datapath elements FS M Ds AreaTime Tradeoffs Latency and Throughput 7 Down Today Mem ory features design technology Memory What s a memory What s special about a memory Dom Memory Function Din Typical Data Input Bus Data Output Bus Address location orname FlW readwrite control DOUI M e mo ry Block for storing data for later retrieval State element What s different between a memory and a collection of registers like we ve been discussing Collection of Registers Memory Uniqueness Cost Compact state element Packs data very tightly At the expense of sequentializing access Example of AreaTime tradeoff and a key enabler Dom Memory Organization Key idea sharing factor out common components among state elements can have big elements if amortize costs state element unique 9 small HwyMl Memory Organization Share Interconnect Input bus Out ut bus Control routing very topologywire cost aware design Note local abuttmentwiring 7 mm Share Interconnect lnputSharing wiring drivers OutputSharing wiring sensing driving AddressControl Addressing and Control over ead paid to allow this sharing A m new Memory Organization 1311 11 aha 1 11 mm H39W Penn ESE680002 mm 13 Dynamic RAM Goes a step further Minimal storage is a capacitor Feature DRAM process is ability make capacitors efficiently EST Penn ESE680002 Spri 3 Share refreshrestoration logic as well v4 m are i f m mile r to Hi2 rquot rf are are Some Numbers memory Unit of area k2 more next time Register as standalone element z 4K 2 eg as neededused last two lectures Static RAM cell z 1K7 2 SRAM Memory single ported Dynamic RAM cell DRAM process z 100712 Dynamic RAM cell SRAM process z 3007 2 DRAM Latency 1000 Moore s Law 100 10 Performance 15 Penn ESE680002 Spring2007 DeHon 1 GB DDR3 SDRAM from Micron 96 pin pakage 16b datapath IO El nI15 i ErItirig 399 5 quot39Ilillln llidmll ti Operate at 500MHZ 375ns random access latency E quot lazuli ll 3w r l w ll In Tibia H Tmiinaig Pammuln I i 39i 139 Iiiquot 39 w 39quot1 g r i I Z Z in I39 in li 39 Illfi i giggly tap L my 1 i it lizutggga Elui it my 235 mill 339 r ill allillljil 121181 tilquotl View FE ryt quot IM 5 It fin77 7 lain llf 1 H h t 15 a 1 b in tin liara quot r1quot Lari I31 ITH I IllQHIMIqlHllifg 511 lat7 39 quotHuang 5 SEE 15 IE 39 1i quot 39 39 39 niFF Hat Erinquoti it I M I5 Ht 39IrItuln39 1 55 39 it quot it Penn ESE680002 Spring2007 DeHon 17 f uProc 60lyr ProcessorMemory Performance Gap grows 50 I year 16 From Patterson et al lRAM httpIiramcsberkeleyedul Memory Access Timing HIE 1 lt um RASCAS aCCeSS Optimization for access within a row Penn ESE680 002 Spring2007 DeHon ESE534 Computer Organization Day 5 February 1 2010 Previously Arithmetic addition subtraction Reuse pipelining bitserial vectorization shared datapath elements FSMDs AreaTime Tradeoffs Latency and Throughput Penn ESE534 SpringZO39IO DeHon Preolass 1 When is zoomN N lt100N Penn ESE534 SpringZOlO DeHon Memories v P Penn ESESM Sprin92010 DeHon w Today Memory features areadelay intuition and modeling design technology Penn ESE534 Sprin92010 DeHon 3 P reclass 2 Find m to minimize C1N10g2N sz C3N m Penn ESE534 SpringZOlO DeHon Preolass 2 Related Am CNl N 1 0g2lt sz CSN m Penn ESE534 SpringZO39lO Del Ion Preclass 3 4 Delay Scale with bd Scale with bw Penn ESE534 SpringZOlO DeHon Aim 1 bank damn um Memory Typical Data Input Bus Data Output Bus Address location or name RW readwrite control Penn ESE534 SpringZOlO DeHon Din Dout Col Dout Penn ESE534 SpringZOlO De ers Memory What s a memory What s special about a memory Penn ESE534 Spnn92010 DeHon Memory Block for storing data for later retrieval State element What s different between a memory and a collection of registers like we ve been discussing Penn ESE534 Spnn92010 DeHon Memory Uniqueness Cost Compact state element Packs data very tightly At the expense of sequentializing access Example of AreaTime tradeoff and a key enabler Penn ESE534 Spn n92010 DeHon Memory Organization Key idea sharing factor out common components among state elements can have big elements if amortize costs state element unique 9 small Memorvmequot I Data Write Read Penn ESE534 Spring2010 DeHon SRAM Memory bit WL Vdd M2 39 ltl M4 I Q 0176 M P l Ms BL BL l Source http commonswikimediaorgwikiFile 6tSRAMcell png 14 Penn ESE534 Spring2010 DeHon Memory Organization Share Interconnect lnputbus Output bus Control routing very topologywire cost aware design Note local abuttment Wiring Penn ESE534 Spring2010 DeHon L15 Share Interconnect nputShanng wiring drivers Output Sharing wiring sensing driving 16 Penn ESE534 Spring2010 DeHon Add ressControl Addressing and Control an overhead paid to allow this sharing FlWA Penn ESE534 Spring2010 DeHon I II A 1 Ak 10 Aw1k Din RIW 18 Penn ESE534 Spri Dout Dynamic RAM Some Numbers memory o 3093 a Step further i o of area k2 Share refreshrestoration logic as well I more next week 39 Minimal StOFage is a CaPaCitOF Ix Register as standalone element z 4K 2 Feature DRAM process is ability to eg as neededused last lecture make capacitors efficiently Ix Static RAM Gen z 1K2 Ex i SRAM Memory single ported I Dynamic RAM cell DRAM process z 100 2 M Dynamic RAM cell SRAM process z 300 2 lti DJ 9 I 19 20 Penn ESE534 Spring20 Penn ESE534 Spring2010 DeHon 7 Memqu 09quot Fig 3 GP Memory Cell Layout D n lquot S The SF cell layout used by Micron l rag Technology has active regions in used for transistor channels 3 etc formed diagonally Layout 3 presented at the 2004 Symposium 1 39vCPU PrQC 9 on VLSI Technology 39 600 r g a Moore s Law A y e U 39 H g g 1 00 ProcessorMemory wordline g 39 Performance Gap g t 3 1 O 39 grows 50 Iyear I I h I A g bltilne a l J DRAM x n 39 39 7yr 139o39r39cximdLnnorxooouot deanrxoomo oooooooooooooooooooommmmmmmmmmo S mmmmmmmmmmmmmmmmmmmmo g 8 F F F F F F F F F F F F F F F F F I a g I a Time Al 22 Penn ESE534 Springzmo quot DeHO Source httptechonnikke bpcojparticleHONSHl20071219144399 From Patterson et 3quot IRAM httplramCSberkelel ed Ul Contemporary DRAM Aw1k Din iGB DDR3 SDRAM from Micron Rrw RASCAS 96 pin pakage 300933 16b datapath IO Options Marking Optimization for 0 Con guration 64 Me0 X 16 8 Meg x 16 x 8 banks 64MlG Operate at 500 MHz 128M3gxg16Megxgx8banks 128MB access Wlthln a 256 Meg x 4 32 Meor x 4 x 8 banks 256M4 375ns random access latency FBGApackageuead ee FOW x4 x8 86 ball FBGA 91mm X 155mm BY Table 1 Key Timing Parameters x16 ENSball FBGA 9nnnx 155mm LA I 0 Timing cycle time Speed Data Rate Target tRCD tRP CL 2511s 11 CL 6 DDR3800 25 Grade Mbs RCDIRPICL ns ns ns 2511361quot CL 5 DDR3800 25E 255 300 555 125 125 125 187118 I CL 8 DDR3 1066 187 15118 if CL 10DDR31333 15 187E 1066 777 131 131 131 y w r 487 1066 888 15 15 15 1511srv LL 9 DDR3 1333 13E 15E 1333 999 135 135 135 15 1333 1010 10 15 15 15 23 Penn ESE534 Spring2010 DeHon Penn ESE534 Spring2010 DeHon Contemporary DRAM 1GB DDR3 SDRAM from Micron 96 pin pakage x16 ENSball FBGA 9mm x1515mm Timing a cycle time Table 1 Key Timing Parameters Penn ESE534 SpringZO10 DeHon 1 datapath Options Marking Con guration 64 M 16 8 M 3916 8 bk ks 64M16 Operate at 500MHZ 128MB 375ns random access latency E fllffkig i i e lx8W5 256W x4 x8 86ball FBGA 9111111 x 155mm BY LA speed Tame CL 2511s as CL 6 DDR3800 25 Grade RCDIRPICL ns ns 2511s CL 5 DDR3800 25E 187115 CL 8 DDR31066 187 187115 cm CL 7 DDR31066 187E 15115 CL 101DDR3 1333 15 15115 CL 9 DDR31333 15E 25 1 Gigabit DDR2 SDRAM aMWJT Sourcez httpwwwelpidacomennews20041118html 26 Penn ESE534 Sprin92010 DeHon Contemporary DRAM 4GB DDR3 SDRAM from Micron Operate at 800MHz 50ns random access latency Speed Industry DI 1 Raw MT3 39RCD su 39RC Gradc 39 CLll CL10CL9 CL8 CL7 CL6 CL5 1151 n5 m 1m 11115 12m 111 1mm 113 1155 mm mm mu am 15125 13125 411115 101 pm 111111111 1131 1555 lll 11 mu am 15125 11135 1125 1C1 21c 5501 11m 0656 11111 m 15125 13125 50625 1911 9171 2150 1111111 w m 15 15 525 111x 911 mm mu 115 115 511 m1 9111 mm 111111 1m 15 15 525 27 Penn ESE534 Sprin92010 DeHon DRAM Latency 1 000 vCPU pProc w Moore s Law I I o g 100 ProcessorMemory g 39 Performance Gap g 10 39 grows 50 year w 39 DRAM 7yr From Patterson et al IRAM httpiramcsberkeeyedamp Basic Memory Design Space Emgill lb lznh o De l I39 39 I Internal vs Depth External WIdth Banking Penn ESE534 Sprin92010 DeHon What happens Penn ESE534 Sprin92010 D r Depth 1 I external Wlfilh as make deeper 4ll 1ltl 1 ltlll 1K 4ll 1 4ll 1K BanMng Tile Banksmemory blocks Penn ESE534 Spring2010 DeHon Memory Key Idea Memories hold state compactly Do so by minimizing key state storage and amortizing rest of structure across large array 33 Penn ESE534 Spring2010 DeHon Yesterday vs Today Memory Technology What s changed Capacity single chip Integration memory and logic dram and logic embedded memories Room on chip for big memories And many memories Don t have to make a chip crossing to get to memory5 Penn ESE534 Spring2010 DeHon Independent Bank Access 32 Penn ESE534 Spring2010 DeHon Yesterday vs Today Memory Technology What s changed 34 Penn ESE534 Spring2010 DeHon Important Technology Cost IO between chips ltlt IO on chip pad spacing area vs perimeter 4s vs s2 wiring technology BIG factor in multichip system designs Memories nice very efficient with IO cost vs internal area 36 Penn ESE534 Spring2010 DeHon OnChip vs OffChip BW Use Micron 1Gb DRAM as example Table 1 Key Timing Parameters Options 39 Con guration 253 JamsEL 5 t 64 Meg x 16 8 Meg x 16 x 8 banks 128 Meg X 8 16 Meg X 8 x 8 banks 256 Meg x 4 32 Meg x 4 x 8 banks 39 FBGA package lead free x4 x8 BIS ball FBGA 9111111 x1551111n x16 96ballFBGAl9m1nx155mm 39 Timing e cycle time 25118 6 CL 6 DDR3800 25115 7 CL 5 DDR3800 187118 CL 8DDR31066 On Chip BW tgiiz giii 333321 8 1311sLL9DDR31335 assume 8x16 1024b banks assume 1024x1024b banks Penn ESE534 SpringZO10 DeHon 37 Penn ESE534 SpringZOiO DeHon Costs Change Design space changes when whole system goes on single chip Can afford wider busses more banks memory tailored to applicationarchitecture Beware of old stale answers their cost model was different 38 What is Importance of Memory Radical Hypothesis Memory is simply a very efficient organization which allows us to store data compactly at least in the technologies we ve seen to date A great engineering trick to optimize resources Alternative memory is a primary State is a primary but state memory Penn ESE534 SpringZO10 DeHon Penn ESE534 SpringZOiO DeHon Admin Added paper on modeling memories as supplemental reading for today Reading for next Monday on web Classic paper on MOS scaling Reading for next Wednesday in Blackboard 40 Big Ideas MSB Ideas Memory efficient way to hold state Resource sharing key trick to reduce area Memories are a great example of resource sharing 41 Penn ESE534 SpringZO10 DeHon Penn ESE534 SpringZOiO DeHon Big Ideas MSB1 Ideas Tradeoffs in memory organization Changing cost of memory organization as we go to onchip embedded memories 42 Preclass 2 When we solved for N we found m is proportional to VN C11N10g2N C21N C3N Ifwe approximate log2N as a constant we get C 4 N C3N 20004 N PewESESMSDHnuI iD new Predass 2 Clbdlog2NC2bwC3N C Nlog N m C2 C3N H quot a 03 is for area of single 7 memory cell 39 Related ogZn is for decoder I mbw n red bw bdN C1 is for gates in So bdNlm decoder C2 is for amps at bottom L A of row and drivers at top Preclass 1 CNlo N 1 g2 sz C3N m ESE680002 ESE53439 Computer Organization Day 23 April 9 2007 Control Penn Previously Looked broadly at instruction effects Looked at structural components of computation interconnect m ute retiming Looked at timemultiplexing a Demo Today Control datadependent operations Differentforms oca instruction selection Architectural Issues Control Day 5 am ESBXLI up svnnvznni Control Control That pointwhere the da a affects the instruction stream operation selection Typical manifestation data dependent branching irtan0pAeise0pB my a one data dependent state transitions new gt goto sn 7 else gt sta v data dependent operation selection em V39 wpoint can have instruction stream sequence without control Ie staticdataindependent progres 39on throu h sequence of instmctions is control free 39 C gtCtgtCZgtCDeCtgtCZeCD gt Similarly FSM w no data in uts E Day 5nonbranching datapath Our First Programmable Architecture pm is Terminology reminder Primitive Instruction pinst Collection of bits which tell a bitprocessing element what to do Includes select compute operation input sources in space interconnect input sources in time retiming Configuration Context Collection of all bits pinsts which describe machine s behavior on one cycle 7 VVhy Why do we need want control Static interconnect sufficient Static sequencing Static datapath operations Perm ESEEBUVUDJ builiigl m leiinn Back to Any Computation Design must handle all potential inputs computing scenarios Requires sufficient generality However computation for any given input may be much smaller than general Instantaneous computation ltlt potential computation PennEbEEBDlJUISpVingZDW 792mm Screwdriver Analogy Need capability to handle Slothead Phillips But only need one at a time Penn ESEEBU DUZ burl gl i VDEHun Video Decoder Eg Video decoder frame rate 33ms if packetFRAME if typel FRAM E lFRAME computation else if typeBFRAME BFRAME computation Penn EsEaannnz Sprinul m 7 new Packet Processing If IPV6 packet If IlgV4 packet If voiP packet If modem packet Penn EsEaan an Suringl mi Del tan FSM Exam le local control Two Control Options p amuseinner 1 Local control 39 j 7 unify choices l build all Dptluns mm Spatial compute structure My and selectuperatiun mam Dan 2 Instruction selection quot m provide a different instruction instruction w l sequence for each option selection occurs when chose which M 939 4 4LUTS instructions to issue A 7 r 2 LUT Delays 3 m FSM Example Local Control Cgrgtfztg D510 3 4LUTS 39 LUTs used 2 LUT evaluations produced N50 SUACyc myAdctr Read 1LUTDelay N81 so 9 Counting LUTs not tell cycleby cycle Context 1 311 UT needs Dout 50 N30 50 N31 50 mm ContextD 510 out 0 N50 SUACyc myAdctr Reed 1LUT Delay N81 so Context 311 Do a 50 AC N50 50 myAddr Dam N31 su Rea FSM Example Instruction Local vs Instruction If can decide early enough and afford schedulereload instruction select 9 less computation lfload too expensive local instruction faster maybe even less capacity AT How show up in modern uP Slow Context Switch Instruction selection profitable only at Instruction coarse grain Xilinx ms reconfiguration times Local control HSRA us reconfiguration times still 1000s of cycles Eg Video decoder frame rate 33ms ifpacketFRA E if typelFRAME Fcontext else if typeBFRAME m BFcontext i 7 7 H on O timization Local vs Instruction p o For multicontext device ie fast single cycle switch Basic Components factor according to available contexts Minimize Capacity Tim con g time consumed TSE EEr case ATlucal AgenX gen compute time ATselect o For conventional devices TEEN generalized 39 Aseiemxliseieatiim factor only for gross differences mpute t39 and early binding time Tim 0 Ifcal i overlap Wprevious operation Aselectquot case compute area 7 know early Agenquot generalized 7 background load compute area a have sufficient oanoWiotn to load 2i FeiinEbEEBDIJUIburingl ni roenaii reineeeoeiimizapimgzom roeHuii FSM Example FSM Control Factoring FSM canonical control structure captures many of these properties Experlment can implement with deep multicontext instruction selection can implement as multilevel logic unify use local control Penn EsEaannnz snimznm 7 neHaii Serve to build intuition Penn EsEaan UM summon oeHaii Full Partitioning Experiment Give each state its own context Optimize logic in state separately Tools mustang espresso sis Chortle Use onehot encodings for single context smallestfastest dense for multicontext assume context select needs dense Look at Assume stay in context for a number of LUT delays to evaluate logicnext state Pick delay from worstcase Assume single LUTdelay for context selection savings of 1 LUTdelay gt comparable time Count LUTs in worstcase state 25 Penn ESE680002 SpringZOO DeHon Single Context Con39ext per tote Ratio Delta I v FSM States Levels AH 11 Area Levels AH 11 Area Levels MA MA 0 bbara 10 5 25 220 1 5 95 043 5 U bbsse 15 4 50 439 3 12 245 055 1 bbtas 5 3 7 51 1 5 534 10 2 L beecount 7 4 14 123 1 7 94 077 3 CU cse 15 5 83 729 2 15 307 042 4 ak14 7 4 58 509 1 8 108 021 3 39 ak15 4 12 25 220 1 7 78 035 11 ak15 27 5 80 702 1 8 232 033 4 dkl7 8 5 19 157 1 5 85 051 5 U dk512 15 2 20 175 1 7 138 079 1 G donfile 24 2 45 404 1 5 150 040 1 L ex1 20 7 120 1054 2 25 514 058 5 lt ex4 14 7 21 184 1 13 245 133 5 ex5 8 5 57 500 1 11 157 031 4 V keyb 19 7 112 983 4 14 320 032 3 me 4 2 8 70 1 7 78 110 1 C moaqu12 12 5 12 105 1 5 87 082 5 planet 48 5 150 1317 1 25 1135 085 5 O pma 24 5 82 720 2 15 401 055 4 st 20 5 137 1203 5 25 590 049 0 s1488 48 5 152 1335 3 27 1227 092 3 s1a 20 5 72 532 7 21 495 078 2 E s208 18 4 38 334 1 7 154 045 3 s27 5 2 5 44 1 4 51 120 1 CU s385 13 5 42 359 2 12 218 059 3 s420 18 3 40 351 1 7 154 044 2 D s510 47 5 54 474 1 13 581 122 4 s8 5 4 12 105 1 4 47 045 3 s820 25 5 92 808 3 30 825 102 3 sand 32 7 178 1553 5 30 989 0 53 2 3 sse 15 4 50 439 3 12 245 0 55 1 styr 30 7 185 1533 4 21 559 0 40 3 I I tbk 32 8 340 2985 5 33 1088 035 2 27 Penn ESE680002 Average 064 3 26 Penn ESE680002 Sprin92007 DeHon A Single Context Con ext per tote Ratio Delta H FSM States Levels Amp1 Area Levels N4 r Area in Levels 0 MA MA bbara 10 3 40 351 1 5 95 027 2 C bbsse 15 3 50 527 2 14 287 054 1 L bbtos 5 2 9 79 1 5 53 080 1 beecount 7 2 19 157 1 7 94 057 1 U cse 15 4 97 852 2 15 307 035 2 dk14 7 3 57 588 1 8 108 018 2 39 clle 4 3 37 325 1 7 78 024 2 did 27 3 83 729 1 8 232 032 2 gt dkl7 s 2 25 228 1 5 85 037 1 U dk5l2 15 2 20 175 1 7 138 079 1 donfile 24 2 45 404 1 5 150 040 1 Q exl 20 4 151 1325 2 25 514 045 2 D ex4 14 2 25 220 1 13 245 112 1 ex5 8 3 52 544 1 11 157 029 2 V keyb 19 4 150 1317 3 25 593 045 1 mo 4 2 8 70 1 7 78 110 1 C modulo39lz 12 1 13 114 1 5 87 075 0 planet 48 4 172 151 0 1 25 1135 075 3 O pma 24 4 139 122 0 2 15 401 033 2 st 20 4 195 1712 3 30 708 041 1 s1488 48 4 183 1507 2 28 1272 079 2 39 sla 20 3 107 939 4 30 708 075 1 t 3208 18 3 40 351 1 7 154 044 2 527 5 2 5 44 1 4 50 115 1 U 3386 13 4 54 474 2 12 218 045 2 s420 18 3 40 351 1 7 154 044 2 D s510 47 3 75 557 1 13 581 087 2 s8 5 2 13 114 1 4 48 042 1 5820 25 3 137 1203 3 30 825 059 0 sand 32 4 224 1957 3 43 1417 072 1 5 sse 15 3 50 527 2 14 287 054 1 styr 30 5 285 2502 3 23 722 029 2 I I tbk 32 5 510 4478 4 42 1384 031 1 28 Penn ESE680002 Spri Average 056 136 Full Partitioning Full partitioning comes out better 40 less area Note full partition may not be optimal area case eg intro example no reduction in area or time beyond 2context implementation 4context full partition just more area additional contexts 29 Penn ESE680002 SpringZOO DeHon Partitioning versus Contexts Area 150 140 1130 1120 110 1100 090 0 080 0246810121416 12345678 Number of Contexts C Contexts N4LUT LUT Area Area in M 12 CSE benchmark Area in M 12 Logic Levels L H N W P U1 0 Jl 00 O 0 2 4 6 810121415 Penn ESE680l Number of Contexts C Number of Contexts C 0246810121416 Partitioning versus Contexts e 160 B 140 ltlt 1 Z 120 2 130 100 H 120 a 80 4 60 a 110 E 100 40 lt LUTArea 090 080 12345678 Contexts 0 2 4 6 810121416 Number of Contexts C CSE benchmark Area in M 702 Logic Levels L 0 2 4 6 810121416 Number of Contexts C 0 2 4 6 810121416 Number of Contexts C Penn ESE680O Partitioning versus Contexts Heuristic Start with dense mustang state encodings Greedin pick state bit which produces least greatest area split least greatest delay split Repeat until have desired number of contexts Penn ESE680OO2 Spring2007 DeHon 32 Partition to Fixed Number of Co ntexts Best Area Ratio by Number of Context FSM States Single Dense Encodings Context 1 2 4 81 161 321 64 Area Target average ratio 100 151 086 063 056 070 109 192 average delta I 000 027 033 127 248 270303 306 Delay Target average ratio 100 145 105 059 050 062 095 167 average delta 000 091 048 I 006 064 091 115 121 NB more realistic device has fixed number of contexts Extend Comparison to Memow Fully local gt compute with LUTs Fully partitioned gt lookup logic context in memory and compute logic How compare to fully memory Simply lookup result in table Penn ESE680OO2 SpringZOO7 DeHon 34 Penn ESE680OO2 Spring2007 DeHon 33 Memory FSM Compare S m a Min Integral Memory FPGA 8ctx DPGA FSM states ins outs area Adar amp Data area area area MA Organization MA MA MA bbtas 6 2 2 01 25x5 02 61 71 dk15 4 3 5 03 25x 03 219 100 dk17 8 2 3 02 25x6 02 167 85 dk512 15 1 3 03 25x7 03 176 100 mo 4 3 5 03 25x7 03 70 100 modulol2 12 1 1 01 25x5 02 105 71 beecount 7 3 4 05 26x7 05 123 100 dk14 7 3 5 05 25x8 06 509 114 dk16 27 2 3 10 27x8 13 702 114 donfile 24 2 1 07 27x6 09 404 85 527 6 4 1 05 27x4 06 44 57 88 5 4 1 04 27x4 06 105 57 bbara 10 4 2 12 28x6 1 8 219 114 9x6 8 5 8 34 28 x11 34 500 157 ex4 14 6 9 140 210x13 160 184 185 bbsse 16 7 7 270 211x11 270 439 214 cse 16 7 7 270 211x11 270 729 271 F tbk 32 6 3 197 2H gtlt8 197 2985 684 Memory FSM Compare large Min Integral Memory FPGA Bctx DPGA FSM states ins outs area Addr amp Data area area area MA Organization MA ME MAE sse 16 7 7 270 2H x 11 270 439 214 3386 13 7 7 229 2H gtlt 11 270 369 185 keyb 19 7 2 204 2129 7 344 983 313 planet 48 7 19 1843 213 x25 2458 1317 541 pma 24 8 8 958 213 X 13 1278 720 342 31 20 8 6 676 213 X 11 1081 1203 627 sia 20 8 6 676 2 3 x 11 1081 632 541 exl 20 9 19 2949 2 1 x24 4719 1054 555 31488 48 8 19 3686 2M x25 4915 1335 740 styr 30 9 10 2765 2m gtlt 15 2949 1633 570 5208 18 11 2 3097 21 x7 5505 334 128 sand 32 11 9 11010 216x14 11010 1563 627 820 25 18 19 1887437 22 3 x24 2415919 808 641 5420 18 19 2 792723 224gtlt7 1409286 351 142 s510 47 19 7 3844089 225 x 13 5234491 474 256 F enn ESE630OO2 Spring2007 D Hon Memory FSM Compare notes Memory selected was optimally sized to problem in practice not get to pick memory allocationorganization for each FSM no interconnect charged Memory operate in single cycle but cycle slowing with inputs Smaller for lt11 stateinput bits Memory size not affected by CAD ualit lifH DCADPGA is 37 Penn EaEEBLl n Spring 7 Control Granularity What if we want to run multiple ofthese FSMs on the same component Local Instruction PennEbEEBDiJUIbmngDW rDeHan Control Granularity Hun Consider Two network data ports states idle firstdatum receiving closing data arrival uncorrelated between ports Penn ESEEBU DUZ uprlwl i VDEHun Local Control MultiFSM Not rely on instructions Each wired up independently Easy to have multiple FSMs units of control Penn EsEaannnz suringznm 7 Del tan Instruction Control If FSMs advance orthogonally really independent control context depth gt product of states for full partition e w single controller PC must create product FSM which may lead to state explosion N FSMs with S states gt SN product states This example 4 states 2 FSMs gt 16 state composite FSM 42 Penn EsEaan an Suiim mi Del tan Architectural Questions How many pinstscontroller Fixed or Configurable assignment of controllers to pinsts what level of granularity Cut1WD Citrillnzn Eel3951 Gram Em tram Architectural Questions Effects of Too many controllers Too few controllers Fixed controller assignment Configurable controller assignment 44 Penn ESE680002 SpringZOO7 DeHon Architectural Questions Too many wasted space on extra controllers synchronization Too few product state space andor underuse logic Fixed underuse logic if when region too big Configurable cost interconnect slower distribution Penn ESE680002 SpringZOO7 DeHon 45 Control and FPGAs Localsingle instruction not rely on controller Potential strength of FPGA Easy to breakup capacity and deploy to orthogonal tasks How processor handle orthogonal tasks Efficiency Penn ESE680002 SpringZOO7 DeHon 46 Control and FPGAs Data dependent selection potentially fast w local control compared to uP Can examine many bits and perform multiway branch output generation in just a few LUT cycles D uP requires sequence of operations 47 Penn ESE680002 SpringZOO DeHon Architecture lnstr Taxonomy Control Threads PCs pinsfs per Control Thread Instruction Depth Gronulority l ArchitectureExamples 0 0 no Hordwired Functionol Unit 0 egg ECCEDC Unit FP MPY l FPGA n l w Reconfiguroble ALUs nvl Bitwise SIIVID l Traditional F I tjtg f n Vector Processors l i UPC5A H Elli F DDI 39 Ll39uiti m it quot1 i HSRMSCUE E e rttElMD in it E lem FAEtIE htltu lllt trnditio ul l 48 Penn ESE680 002 Sprinng Lanai quot39 5 l Admin Big Ideas MSB Ideas Control where data effects instructions operation Two forms local control all ops resident 9 fast selection instruction selection may allow us to reduce instantaneous work requiremen introduce issues depth granularity instruction load time 39 Hun 5D Big Ideas MSB1 Ideas considerably factor dissimilar logic overall moderate contexts 69 8 exploits both properties better than ext mes single context all local control Jll partition emory except for smallest FSMs PennEbEEBDlJUIbmngDW rpmquot Intuition gt looked at canonical FSM case few context can reduce LUT requirements similar logic more efficient in local control 5i ESE680002 ESE534 Computer Organization Day 20 March 28 2007 Retiming 2 Structures and Balance Penn Last Time Saw how to formulate and automate retiming start with network calculate minimum achievable c c cycle delayclock cycle make cslow if wantneed to make c1 calculate new reister placements and move Last Time Systematic transformation for retiming justify andatory registers in design ii Today Retiming in the Large Retiming Requirements Retiming Structures i Dom Retiming in the Large Day3 a registers a a to al39 Align Data Balance Paths lgn data Serialization Serialization greater serialization 9 deeper retiming total same per compute larger W Data Alignment Forvideo 2D processing often work on local windows retime scan lines i E edge detect smoothing motion est r cam Image Processing See Data in raster scan order adjacent horizontal bits easy adjacent vertical bits scan line apart Retiming in the Large Aside from the local retiming for cycle optimization last time Many intrinsic needs to retime data for correct use of compute engine some very deep often arise from serialization um Reminder Temporal Interconnect Retiming 2 Temporal Interconnect Function ofdata memory perform retiming Requirements not Unique Retiming requirements are not unique to the problem Depends on algorithm mplementation Behavioral transformations can alter signi cant y Requirements Example QABCDEF 39F E N Forllt1toN AimeAw l rtleAl El Forllt1toN 4290mm Retiming Requirements quotmeow 1 enema l 1 itZeElFl 7 will eEiiitFii 7 am Hmz orl Ho 7 t2letllt2l e gt 3N regs 39 N ri ht gt2re s 7 Qmeam am Parallelism 3 W r W Flop Experiment 1 PipelineCslowretime to single LUT delay per cycle MCNC benchmarks to 256 4LUTs no interconnect accounting NumbelodRegislels l 2 3 i A 5 6 i 7 a 9 i In Percentage 72 6 45 22 is uvciiz ans uiziun average 17 registersLUT some circuits 27 15 Flop Experiment 2 Pipeline and retime to HSRA cycle place on HSRA T or in single LU terconnect timing domain b same MCNC enchmarks average 47 registersLUT 15 Berton Value Reuse Profiles What is the distribution of retiming distances needed Balance of retiming and compute Fraction which need various depths Like wirelength distributions Value Reuse Profiles 111 Reuse Distance 2 Figure 31 A value s IE on and its Iwo uses HuangampShEnMicru 1995 18 v W I 7 l 4ro 9 1 m um V n Example Value Reuse Profile E i pa 13 3 number of Ops per cycle parallelism Distances shorten Dom Interpreting VRP Huang and Shen data assume small What happens if exploit more Values reused more frequently Recall Serialization Serialization greater serialization 9 deeper retiming total same per compute larger l39emv rsaxm Idea of parallelism requirements May differ from task to task May vary independentl from Another balance issue to wa Like Rent 7 am Task implemented with a given amount Will have a distribution of retiming compute nterconnect requirements May need a canonical wayto measure Retiming Structure Structures retiming Concerns Area Azlbit Throughput bandwidth bitstime we will need data item agai new How do we implement programmable Latency important when do not know when u Just Logic Blocks Most primitive build flipflop out of logic blocks I lt DClk IClk Q lt QClk IClk Area 2 LUTs 8OOK gt1Mx2LUT each Bandwidth 1bcycle Penn ESE680002 Sprin92007 DeHon 25 Separate FlipFlops Network flip flop w own interconnect can deploy where needed requires more interconnect Vary LUTFF ratio Arch Parameter Assume routing 0C inputs 14 size of LUT Area 200K 2 each Bandwidth 1bcycle 27 Penn ESE680002 Sprin92007 DeHon Deeper Implication don t need result on every cycle number of regsgtbits need to see each cycle 9 lower bandwidth acceptable 9 less interconnect 29 Penn ESE680002 Sprin92007 DeHon Optional Output Real flipflop optionally on output flipflop 45K 2 Switch to select 5K 2 Area 1 LUT 8OOK gt1Mx2LUT Bandwidth 1bcycle 26 Penn ESE680002 Sprin92007 DeHon Deeper Options lnterconnect FlipFlop is expensive How do we avoid 28 Penn ESE680002 Sprin92007 DeHon Deeper Retiming W alt T it 30 Penn ESE680002 Sprin92007 DeHon Output Input More registers Kx Single Output 710K7I1regist r if don39t need other timings of signal 4LUT gt 3040KA1depth Multiple Output No more interconnect than retimed more routing n open compare savings to additional reg cost Area 1 LUT 1Mk1 40K7I2 get Kd regs 39 d l I ZMA2 Bandwidth Kcycle Idtncapacity r em HSRA Input Input Retiming Inputs From Network Inputs me Network r LUT Inputs To channel Chain In erconnect Rm Flop Experiment 2 Pipeline and retime to HSRA cycle place on HSRA single LUT or interconnect timing domain s Numbemrkegisleus I 2 I 3 I A I 5 I 5 7 a 9 ID Ieto Percentage 50 59I59I35I43I27 25 19 15 12I92 average 47 registersLUT Damn Input Depth Optimization Real design xed input retiming depth truncate deeper and allocate additional logic blocks Extra Blocks limited input depth Average Wurst Case Benchmark ae 7 am can use one BLB as chimingonly chains With Chained Dual Output Average Wurst Case Benehmark 39 HSRA Architecture pm Esi Register File From MIPSX 1K7i2bil 500A1lporl AreaRF d6W61 KA1porls 5007MZ wgtgt6dgtgt6 lo2 gt ZKAZIbit w1dgtgt6 o4 gt 35K9e2bit comparable to input chain More ef cient forwideword cases XilinX CLB 39 w Xilinx4K CLB 7 39 39 as memory 1 A works like RF 39 ell ii Area 12 CLB 640K9r2l1640K9e2bit but need 4 CLBs to contro Bandwidth 1b2 cycle 12 CLB 116 th capacity Virtex SRL16 Use as 16b shiftreg Does not need CLBs to control 116 th capacity Penn ESE680002 Spring2007 DeHon Xilinx Virtex 4LUT D 39 Area 1Mk216z60Kk2bu Bandwidth 1b2 cycle 12 CLB 43 39ch 5P1quot Humour Con wmllum Disk Drive not MOS no x2 Bandwidth 150MBs For 1ns array cycle 1bcycle12Gbs Penn ESE680002 Spring2007 DeHon Cheaper per bit than DRAMFlash 45 Memory Blocks SRAM bit z 1200 2 large arrays DRAM bit z 100 2 large arrays Bandwidth W bits 2 cycles usually single readwrite 12Ath capacity 44 Penn ESE680002 SpringZOO7 DeHon HierarchyStructure Summary Memory Hierarchy arises from areabandwidth tradeoffs Smallercheaper to store wordsblocks saves routing and control Smallercheaper to handle long retiming in larger arrays reduce interconnect High bandwidth out of registersshallow memories DRAM SRAM RF bit FFRF rein xc In FF net FF FFLUT A2 100 1200 2K 5K 40K 40K 75K 200K 800K bwcop 1107 1105 103 11001100116 14 11 11 46 Penn ESE680002 Sprin92007 DeHon Modern FPGAs Output Flop depth 1 Use LUT as Shift Register 16 Embedded RAMS 16Kb Interface offchip DRAM O1 1Gb No retiming in interconnect yet Penn ESE680002 Spring2007 DeHon 47 Modern Processors DSPs have accumulator depth 1 Interstage pipelines depth 1 Lots of pipelining in memory path Reorder Buffer 4 32 Architected RF 16 32 128 Actual RF 256 512 L1 Cache 64Kb L2 Cache 1Mb L3 Cache 10100Mb Main Memory in DRAM 10100Gb 48 Penn ESE680OO2 Spring2007 DeHon Big Ideas MSB Ideas Tasks have a wide variety of retiming distances depths Retiming requirements affected by highlevel decisionsstrategy in solving task Wide variety of retiming costs 100 A2gt1MA2 Routing and O bandwidth big factors in costs Gives rise to memory retiming hierarchy B ESE680002 Computer Organization Day 3 January 17 2007 Arithmetic and Pipelining Last Time Boolean logic computing any nite function Sequential logic 3 computing any nite automata included some functions of unbounded size Saw gates and registers and a few properties oflogic Kljenn 2 r Mquot zsmnmwm mm Why Today Start getting a handle on 39 Addition Complexity organization design space area time Pipelining Temporal Reuse areatime tradeoffs r a and time Areartime tradeofis Parallelism Regularity Arithmetic underlies much computation grounds out complexity Example Bit Level Addition Addition everyone knows how to do addition base 2 ri ht C 11011010000 A 01101101010 B 01100101100 S 11110010110 Addition Base 2 A am2quot39laquot722n72 a121 an z arzi SAB What is the function for s carry s xor carry xor a b carryr lt all bit carrmz 2 0quot and am blrl 30d am earnH and bit carry 6 no Adder Bit Ripple Carry Addition Shown operation of each bit Often convenient to define logic for each bit then assemble Sxor a b carry txor2 a b sxor2 t carry bitslice l l xor2 and not and2 a b CiL FA e not and2 not a not b gm 1 D Si carry not and2 not and2 a b t m I l l l l l l l l and2 not and2 b carry not and2 a L FA quot FA FA quot FA quot 0 carrygtgtgtgtgt 7 81 l l l Rlpple Carry AnaIySIS Can we do better DHW D B3 A3 32 A2 B1 A1 BO A0 3 39 I l l l l l l l l E OWL FA FA FA FA o L Em l l l 1 BB A3 B2 A2 B1 A1 BU A0 Cou What is area and delay J39 F A FA FA FA quot39 0 Area ON 6n 1 l l l Delay ON n3 83 82 81 30 Penn ESE680002 Spring2007 DeHon 9 Penn ESE680002 Spring2007 DeHon 10 Idea Important Observation Compute both possible values and 39 DO we haye t Wa39t for the Carry to Show select correct result when we know the up to begin domg useful work answer We do have to know the carry to get the min1mm Bfn Emi right answer But it can only take on two values 11 if H H e COIL FA FA quot FA FA 39 0 r s a v 7 A i ii ul F n I 813 EL 31 JD 11 5ni mtg SlimEl Lil Preliminary Analysis DelayRA Deay Ripple Adder DelayRAn kn DelayRAn 2kn22DRAn2 DelayP2A Delay Predictive Adder DelayP2ADRAn2DmuX2 amost half the delay 5m1 n71 5lw19 39E liri ESEFS HJHL gtLlli39i i1ii 7 url lmi Recurse If something works once do it again Use the predictive adder to implement the first half of the addition H 75ku miugz urw mum Recurse Recurse If something works once do it again Use the predictive adder to implement the first half of the addition DelayP4An DelayRAn4 Dmux2 Dmux2 DelayP4AnDeayRAn42Dmux2 16 unri EEE39VUV I U 3 liiigfllil W lielrlm Recurse By know we realize we ve been using the wrong I39GCUI39SlOH should be using the Predictive Adder in the recursion DelayPAn DelayPAn2 Dmux2 Every time cut in half How many times out in half DelayPAnog2nDmux2C lZEElwlllrllll rllhu lliiquot 7 In Him Another Way Parallel Prefix Think about each Functions What Jnctions can gci1 be 1 adder it as a computing a function on the c rry in a xbm 1 a CDFQLEMD 5 b 1 7 Panicuiarfunctiunfwiii quot 3W 1 uepand on em hm x0 swab aibiE 19 2D cw Man Functions Combining What functions can gci1 be Wm WWW Mmquot CDNPU39E PUFEWuMWD 9001 Generate r Cumpuie cumpuse mm aibii unciiuns Whatfunmuns WiH the X X Pmpaga e mmpuse uftvvu uftnese am Xur bii function 27 gx0 Squash 7 Sa 2 as new amth zzzgme genevme 7 21 22 cm r W Compose Rules Compose Rules LSB MSB LSB MSB GG SG GG G SG G GP SP GP G SP 5 GS 88 GS S 88 8 PG PG G PP PP P P8 P8 39 S Riven f Hm Combining Do it again Combine gi3i2 and gi1i What do we get Reduce Tree mum nuti willquot mm mm min i i mm Prefix Tree Parallel Prefix Important Pattern Applicable any time operation is associative 39 Examples of associative functions Nonassociative7 Function Composition is always associative tartan Note Constants Matter Watch the constants Asymptotically this CarryLookahead Adder CLA is great For small adders can be smallerwith fast ripple car larger combining than 2arytree mix oftechniques will depend on the technology primitives and cost Jnctions mm Two s Complement Everyone seem ed to kn ow Two s complement 2 s complement positive numbers in binary negative numbers or invert and add 1 Penn E Two s Complement 2O1O 1OO1 OOOO 1111 211O SE680002 Spring2007 DeHon Addition of Negative Numbers just works A 111 A 110 A 111 BI 001 B 001 B 010 SI 000 S 111 S 001 Penn ESEBBOOOZ SpringZOO DelIon A 111 B 110 S 101 Subtraction Negate the subtracted input and use adder which is invert input and add 1 works for both positive and negative input 001 91101111 11190001OO1 00091111OOO O109 101 1 110 BS A3 82 A2 81 A1 BO A0 Penn ESE550002 SDr 53 Subtraction addsub Note you can use the unused carry input at the LSB to perform the add 1 Penn ESEBSOAOOZ Spl ll1g 2IC171QeH 1 O 33 Overflow A 111 A 110 A 111 A 111 B 001 B 001 B 010 B 110 s 000 s 111 s 001 s 101 A 001 A 011 A 111 B 001 B 001 B 100 S 010 s 100 s 011 OverflowAsBsAsSs quot89380002 SpringZOO DeHon 35 Reuse Penn ESEGBUOU2 Spring 007 DeHon Reuse In general we want to reuse our components in time not disposable logic How do we do that Wait until done someone sused out 7 BS A3 82 A2 B1 A1 BO A0 00 a a a a O 83 82 81 SO 37 Penn ESE680 002 Sprin92007 Del Example 4b Ripple Adder 53143 32 Hi 39139 REM Recall 1 gates l Latency and throughput Latency 4 gates to 83 Throughput 1 result4 gate delays magtg9 Penn ESE680002 SpringZOO39I39 DeHon Stagger Inputs Correct if expecting AB32 to be staggered one cycle behind AB10 and succeeding stage expects S32 staggered from S10 Penn ESE680002 SpringZOO DeHu Reuse Waiting Discipline Use registers and timing or acknowledgements for orderly progression of data Penn ESE680002 SpringZOO7 DeHon Penn ESE680OO2 SpringZOO DeHon Can we do better ll ll ll ll Gwi FA 1 FA Pi FA FA o it it 40 Align Data Balance Paths ll ll discipline to FA L FA n line up pipe stages in diagrams 7 Gout o FAHFA i i Penn ESE680002 SpringZOO DL Example 4b RA pipe 2 Recal1 gatesFA Latency and Throughput Latency 4 gates to 3 Throughput 1 result 2 gate delays mag mm g min mm M Deeper Can we do it again What s our limit Why would we stop mm More Reuse Saw could pipeline and reuse FA more 39equently Suggests we re wasting the FA part of the time in nonpipeline a l More Reuse cont Ifwe re willing to take 4 gatedelay units do we need 4 As r em Ripple Add pipe View Can pipeline 0 FA n What if our need the throughput If don t need throughput reuse FA on SAME addition Mm Bit Serial Addition 5 T 7 Assumes LSB a rst ordering of in put data Bit Serial Addition Pipelining Latency and lluoughput7 Latency 4 gate delays Throughput 1 result5 gate delays Can squash Cout3 and F do in 1 result4 gate delays registers do have time ovemea 3 setup hold time clockjitter g immmw WM twp Multiplication Can be de ned in terms ofaddition Ask you to play with implementations and tradeofk in homework 2 Out toda Pickup from syllabus page on web mm Compute Function Compute yAxZ Bx C Assume DMpy gt DAdd AMpy gt AAdd AQuad 3AMpy 2AAdd Spatial Quadratic A C a DQuad 2DMpyDAdd Throughput 12DMPyDAdd 52 Pipelined Spatial Quadratic B DQuad 3DMpy Throughpul1DMpy AQuad 3AMpy 2AAdd6AReg 53 r rm Bit Serial Quadratic data widt w one bit per cycle roughly 1wth the area of pipelined spatial roughly 1wth the throughput latencyjust a little largerthan pipelined 9 m Quadratic with Single Multiplier and Adder We ve seen reuse to perform the same operation pipelining bitserial homogeneous datapath We can also reuse a resource in time to perform a different role Here x x Axx B x also Bxc A x xBxc Quadratic Datapath Start with one of each operation alternatives where build multiply from addseg homework iniv ism win when 55 mm 55 Quadratic Datapath Quadratic Datapath Multiplier sewers Multiplier servers multiple roles multiple roles xx xx Axx A x x B x B x Will need to be able x xx A B 57 my Quadratic Datapath 39 Multiplier servers X X T99 multiple roles x x x A x x B x A x xx B xAB rum Quadratic Datapath Adder servers X39X mgr multiple roles Bxc x A x xBxc one always mpy output omgt C BxC BxC Quadratic Datapath x X 99 Add input register for x x x39x reg Y re x A g v veg B A a c c BxC reg BxCreg a W Quadratic Control Now we just need to control the datapath FSMD What control Control LDgtlt LD x x LDY r MA Select MB Select AB Select LD BxC x39x reg x A v reg B c BxC reg FSMD FSM Datapath Stylization for building controlled datapaths such as this a pattern Ofcourse an FSMD isjustan FSM It39s often easier to think about as a datapath synthesis APampR tools have been notoriously bad about discoveringexploiting datapath structure m r mm out w am an Wm Quadratic FSMD x39x reg iquot lte v rag am reg Quadratic FSMD Control So if 90 LDX goto S1 else goto S0 51 MASELXMBSEL10lt LDgtltgtlt goto sz MASELltMBSEL10B go 0 S3 83 ABSELCMASELlt X MBSELA goto S4 54 ABSELBXC LDY goto S0


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Janice Dongeun University of Washington

"I used the money I made selling my notes & study guides to pay for spring break in Olympia, Washington...which was Sweet!"

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."

Parker Thompson 500 Startups

"It's a great way for students to improve their educational experience and it seemed like a product that everybody wants, so all the people participating are winning."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.