New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Ms. Jerad Bernhard


Ms. Jerad Bernhard
Texas A&M
GPA 3.75


Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Class Notes
25 ?




Popular in Course

Popular in ComputerScienence

This 99 page Class Notes was uploaded by Ms. Jerad Bernhard on Wednesday October 21, 2015. The Class Notes belongs to CPSC 489 at Texas A&M University taught by Staff in Fall. Since its upload, it has received 9 views. For similar materials see /class/226084/cpsc-489-texas-a-m-university in ComputerScienence at Texas A&M University.

Similar to CPSC 489 at Texas A&M




Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 10/21/15
Data ow Networks Data ow Networks Syntax and Semantics 7 actor tokens and ring Static scheduling Other data ow models Data ow networks I Powerful formalism for datadominated system specification I Partiallyordered model no overspeci cation I Deterministic execution independent of scheduling I Used for o simulation 0 scheduling 0 memory allocation 0 code generation for Digital Signal Processors HW and SW slide ASVUCB Data ow network I A data ow networkis a collection of functional nodes which are connected and communicate over unbounded FIFO queues I Nodes are commonly called actors I The bits of information that are communicated over the queues are comlnonly called tokens slide ASVUCB Intuitive Semantics I UnbOunded FIFOs perform conununicatidn via sequences of tokens carrying values 0 integer float fixed point 0 matrix of integer oat xed point 0 image ofpixelsk I State implemented as selfloop I Determinacy 0 unique output sequences given unique input sequences 0 Suf cient condition blocking read process cannot test input queues for emptiness slide ASVUCB Intuitive semantics I At eachrtime one actor is red I When firing actors consume input tokens and produce output tokens I Actors can be fired only if there are enough tokens in the input queues slide ASVUCB Intuitive semantics I Exalnple FIR filter 0 single input sequence in 0 single output sequence 0n o 0n c1 in c2 inl slide ASVUCB Intuitive semantics I Exalnple FIR filter 0 single input sequence in 0 single output sequence 0n o 0n Cl in c2 inl slide ASVUCB Intuitive semantics I Exalnple FIR filter 0 single input sequence in 0 single output sequence 0n o 0n Cl in c2 inl slide ASVUCB Intuitive semantics I Exalnple FIR filter 0 single input sequence in 0 single output sequence 0n o 0n Cl in c2 inl slide ASVUCB 9 Intuitive semantics I Exalnple FIR filter 0 single input sequence in 0 single output sequence 0n o 0n Cl in c2 inl slide ASVUCB 10 Intuitive semantics I Exalnple FIR filter 0 single input sequence in 0 single output sequence 0n o 0n Cl in c2 inl slide ASVUCB 11 Intuitive semantics I Exalnple FIR filter 0 single input sequence in 0 single output sequence 0n o 0n Cl in c2 inl slide ASVUCB 12 Intuitive semantics I Exalnple FIR filter 0 single input sequence in 0 single output sequence 0n o 0n Cl in c2 inl slide ASVUCB 13 Intuitive semantics I Exalnple FIR filter 0 single input sequence in 0 single output sequence 0n o 0n Cl in c2 inl slide ASVUCB 14 Intuitive semantics I Exalnple FIR filter 0 single input sequence in 0 single output sequence 0n o 0n Cl in c2 inl slide ASVUCB Questions I Does the order in which actors are red affect the nal result I Does it affect the operation of the network in any way I Go to Radio Shack and ask for an unbounded queue slide ASVUCB Formal semantics sequences I Actors operate from a sequence of input tokens to a sequence of output tokens I Let tokens be noted by x1 x2 x3 etc I A sequence of tokens is defined as X x1 x2 x3 I Over the execution of the network each queue will grow a particular sequence of tokens I In general we consider the actors mathematically as functions from sequences to sequences not from tokens to tokens slide ASVUCB l7 Ordering of sequences I Let X1 and X2 be two sequences of tokens I We say that X1 is less than X2 if and only if by definition X1 is an initial segment of X I Homework prove that the relation so defined is a partial order re exive antisynunetric and transitive I This is also called the prefix order I Exmnplm X13 X2 lt X13 X2 X3 I Exalnple x1 x2 and x1 x3 x4 are incomparable slide ASVUCB 18 Chains of sequences Consider the set S of all finite and infinite sequences of tokens This set is partially ordered by the pre x order A subset C of S is called a chain iff all pairs of elements of C are comparable If C is a chain then it must be a linear order inside S hence the name chain ExaHlple X1 X13 X2 X15 X2 X3 lis a Chain Exalnple x1 x1 x2x1 x3 is not a chain slide ASVUCB 19 Least Upper Bound Given a subset Y of S an upper bound of Y is an element z of S such that z is larger than all elements of Consider now the set Z subset of S of all the upper bounds of Y If Z has a least element u then u is called the least upper bound lub of Y The least upper bound if it exists is unique Note u might not be in Y if it is then it is the largest value of Y slide ASVUCB 20 Complete Partial Order I Every chain in S has a least upper bound I Because of this property S is called a Complete Partial Order I Notation if C is a chain we indicate the least upper bound of C by lub C I Note the least upper bound may be thought of as the limit of the the chain slide ASVUCB 21 Processes I Process function from a ptuple of sequences to a q tuple of sequences F SP gt 8 1 I Tuples havekthe induced pointwise order Y y1 5y Y y l ry P in SP Ylt Y iff yilty i foralll ltiltp I Given a chain C in SP F C may or may not be a chain in S 1 I We are interested in conditions that make that true slide ASVUCB 22 Continuity and Monotonicity Continuity F is continuous iff by definition for all chains C lub F C exists and F lubC lub F C I Similar to continuity in analysis using limits I Monotonicity F is monotoniciff by definition for all pairs X X X ltX gtFXltFX I Continuity implies monotonicity o intuitively outputs cannot be withdrawn once they have been produced 0 timeless causality F transforms chains into chains slide ASVUCB 23 Summary of function class 0 Summary of function class and their relationship for the function F Sp gt Sq Sequential 2 continuous 2 monotonic Mahapatra Texas AampM FallOO 24 N onmonotonic processes I Canonical nonmonotonic process fair merge xv x2 x3 Xv Yia xza ha nyy Yb 3 2 Y3 xv x2 Xv Y1 X2 yza Y3 Yb 3 2 Y3 l xv xza x3 X u Yb xza Yza X3 l In yzl slide ASVUCB 25 Nonmonotonic processes I Inithe previous example we have 78912 ypprynl lt X1X2X3 y1y7y3 but x1 yi X2 yz x3 y3 and x1 y1X2 y2 y3 are incomparable I The process is notmonotonic needs prediction of the future to be really fair slide ASVUCB 26 From Kahn networks to Data ow networks I Each process becomes an actor set of pairs of o ring rule number of required tokens on inputs 0 function including number of consumed and produced tokens I Formally shown to be equivalent but actors with firing are more intuitive I Mutually exclusive firing rules imply monotonicity I Generally simplified to blocking read slide ASVUCB 27 Examples of Data ow actors I SDF Synchronous or better Static Data ow n fixed input and output tokens 1 x l l 1024 1024 10 1 I BDF Boolean Data ow gt control token determines consumed and produced tokens F T F 5 slide ASVUCB 28 Static scheduling of BF I Key property ofDF networks output sequences do not depend on time of ring of actors I SDF networks can be statically scheduled at compile time execute an actbr39when it is known to be fireable o no overhead due to sequencing of concurrency 0 static buffer sizing I Different schedules yield different 0 code size 0 buffer size slide ASVUCB Static scheduling of SDF I Based only on process graph ignores functionality I Network state number of tokens in F IF Os I Objectiye find schedule that is valid ie o admissible only fires actors when fireable o periodic brings network back to initial state by firing each actor at least once I Optimize cost function over admissible schedules slide ASVUCB Balance equations I Number of produced tokens must equal number of consumedtokens39 on every edge I Repetitions or ring vector vS of schedule S number of firings of each actor in S I VSA n V503 110 must be satis ed for each edge slide ASVUCB 31 Balance equatiOns Balance for each edge 0 3 VSA VSB 0 V503 VsC 0 o 2 VSA VSC 0 o 2 VSA VSC 0 slide ASVUCB Balance equations 3 l 0 M 0 l l 2 0 l 2 0 l I M vS 0 iff S is periodic I Full rank as in this case n no nonzero solution gt no periodic schedule slide ASVUCB 33 Tagged Token DFM Arvind and Gostelow 80 s 0 Each token has a tag firing is enabled when tokens have matching tags 7 No need of FIFO discipline in the channel MahapatraTexas AXLMFall DO 34 Boolean Data ow Actors have Boolean control ports which may control input and output ports All other ports obey SDF semantics EX Switch dcmultiplexing Select MuX Boolean control ports are read rst and then other input ports 0 Switch and select can be used to datadependent iteration Allows conditional ow of data MahapatraTexas AXLMFall DO 35 Exa mple of general DF I iMergestf anis tofr multiples of and 3 in Order removing duplicates I Deterministic merge no peeking slide ASVUCB 36 Summary of BF networks I Advantages 0 Easy to use graphical languages 0 Powerful algorithm for n veri cation fast behavioral simulation n synthesis scheduling and allocation 0 Explicit concurrency I Disadvantages 0 Efficient synthesis only for restricted models gt no input or output choice 0 Cannot describe reactive control blocking read slide ASVUCB cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498 MahapatraTexas AampMFallDO Introduction What is cosynthesis 7 It is a problem to synthesize the hardware software and interface for nal implementation Based on a target architecture we consider the development of architecture and present here This step comes after partitioning the nodes and selecting the application beans and obtaining a suitable schedule MahapatraTexas AampMFallDO 2 Architecture Model 0 Target Architecture 4 4 4D 4 4p u u 39 u u 4gt System System Inputs outputs MahapatraTexas AampMFallDO Architecture Model Single programmable processor and multiple hardware modules connectedthrough a single system bus Hardware Module datapath and controller input and output interface IO interfaces between hardware modules and processor 7 Each node mapped to hardware is synthesized as hardware module no hardware reuse between module Processor 7 software component of the architecture the program that runs here Includes device drivers that communicate between hw amp sw 7 Processor is nonpipelined architecture MahapatraTexas AampMFallDO Architecture Model Communication between hardware and software 7 Memory mapped asynchronous and blocked 7 Different type of communication are possible between the components Component Configurations Globally asynchronous and locally synchronous Reasoning schedule generated by partitioning is based on execution time estimates and is not guaranteed to be cycleaccurate One can not determine when processor or hardware module would clock Controller responsible to activate the hardware modules based on the schedule due to partition MahapatraTexas AampMFallDO 5 Hardware Module Architecture Hardware Module A k A A in Out oiEl in EU Kemel on El 0 i in Datapath nutK IEk an 0Ek Controller 1 r It ll V V L i L 7 system In39plitinte39rface Outputmterface Bus MahapatraTexas AampMFallDO Hardware Module Architecture Handshaking Signals ready completion Latch controlled by input enable IE and output enable OE Working kernel begins computation at ready signal indicating the availability ofdata at its input Then raises completionflag after computation The ready signal is a function of input enable signals and the completion signal generated by logic MahapatraTexas AampMFallDO 7 Hardwaresoftware interface Each input and output of hardware module has unique address in the shared address space of the processor memory mapped How to communicate between processor and HW module now 7 Can we use decoders to select appropriate input in the module 7 It is possible but for large number of modules decoder area and delay becomes signi cant The ordered transaction principle is applied 7 Based on the schedule the order of data transfer is examined across hardwaresoftware interface Each data transfer is assigned a unique memory address This way the sequence of addresses to which readwrite takes place is known apriori This address corresponds to unique latch inpquutput address A global controller uses the order information to activate the input and output latches MahapatraTexas AampMFallDO 8 Hardware software interface completion Read Write 39 Global controller Order of data transfers Processor Stall IE OE 1 Processor issues Write request Global controller issues appropriate lE signal for the latch The read cycles can not overlap When all input signals available ready signal is issued 2 Read action controller checks if corresponding module has completion signals set A er completion is set OE signals for the output latch is set MahapatIaTexas AampMFallDO 9 Software Software interface Data transfer between two software nodes are assigned memory addresses in the internal data memory of the processor Communication between the two takes place by writing the results to corresponding locations in the internal data memory There is no need to check semaphore as the software executes sequentially Hardware Hardware interface 7 direct connection and use of ready and completion signals MahapatIaTexas AampMFallDO Cosynthesis Approaches Need to synthesize 7 datapath amp controller for each hardware module 7 program running on processor including codes for memory access and communication with HW modules 7 Global controller 7 10 interface of hardware modules and 7 netlist connecting the controller to various modules and processor MahapatraTexas AampMFallDO 11 Cosynthesis approach Refer the flowgraph in the handout The input to the cosynthesis stage is annonated DAG after partitioning It includes the possible mapping selections implementation bins and a schedule The first step in cosynthesis is Retarget the DAG It is a technology dependent presentation tool 7 Each incoming node in DAG is converted to appropriate state of presentation Example A software node takes either C code or assembly code and hardware node as VHDLVerilog 7 If different implementations are available for a node the target tool selects the necessary bin here 7 In case of quot pp quot nodes implementation is possible and 1 1 MahapatraTexas AampMFallDO 12 Retargeting approaches Library Approach Assumes the existence of library for each technology and implementations Ex if FIR filter is the node one should maintain FIR filter in C verilog assembly codes Also different algorithms of FIR filter with multiple representations Advantages 7 computationally ef cient and retains modularity If a node or its implementation changes one can select the new library element and re synthesize 7 Each module can be individually optimized to improve the quality of implementation Disadvantage 7 It depends on richness of the library If an element is not there one has to develop it MahapatraTexas AampMFallDO 13 Retargeting Approach Compilation Approach Compile the technologyindependent representation of every node into its desired representation The node is described in high level description such as C or Java Then this highlevel description is translated to intermediate representation like controldataflow graph The intermediate representation is compiled to either Verilog or assembly depending on desired synthesis technology Advantage Simplistic and traditional approach Can use available compilers Not to worry about the special library components Disadvantages optimization depends on compiler Recompile complete design that takes more time MahapatraTexas AampMFallDO 14 HW SW and Interface Graph Refer the diagram in the handout These graphs are fed to the hardware software and interface synthesis tools Hardware Graph 7 A separate hardware graph is generated for each node mapped to hardware 7 Hardware graph contains several subnodes These are annotated with the transformation and sample period corresponding to implementation bins 7 The hardware synthesis tool generates a datapath and controller for each hardware graph MahapatraTexas AampMFallDO Software graph All nodes mapped to software are combined into a single software graph Send and receive nodes are added wherever a hardwaresoftware communication occurs Note partitioning algorithm generated a global schedule to execution order of all no es An ordering between the nodes of the software graph is derived from the above schedule The software synthesis tool generates a single program from the software graph The program contains codes for all nodes The code is concatenated in the predetermined ordering MahapatraTexas AampMFallDO Interface graph Interface Graph 7 The order generator determines the order of transfers between nodes mapped to software and hardware 7 The interface synthesis tool generates the global controller using this order of transfers 7 It also interface glue logic latches logic to generate ready signal for the hardware module A netlist that describes the connectivity between hardware modules the processor and the global controller is next generated A standard placement and routing tool can be used to synthesize the layout for the complete system MahapatraTexas AampMFallDO Hardware Synthesis Approach Use VHDL or Verilog may be Silage for Ptolemy to describe the hardware graph to synthesize its datapath and controller Feed this description to hardware synthesis tool for implementation Hardware graph Silage Code generator in Ptolemy or VHDL VHDL code or Silage code High level hardWare synthesis Dalapalh and controller MahapatraTexas AampMFallDO Software Synthesis All the software nodes are to be together and augmented by send receive Each software node is called a cadeblack which is technology dependent and represents the node functionality Software synthesis process needs to stitch together these codeblocks and form a single program to be executed by the processor Use attened ordering that consider the schedule from partitioning and includes sendreceive nodes MahapatraTexas AampMFallDO 19 Software synthesis Software gra h l ordering schedule eachhierarchical dde r expand each hierarchicalnode accordingio schedule update orderingtoget attened ordening i Flattened so ware graph and its ordering ride for attened Sd War e graph dilation cod r a d communication code stitch codeblocks together 39ac cdr39ding to ordering l program MahapatraTexas AampMFallDO 20 Interface synthesis Involves synthesis of global controller and other glue logic FSM can be used to describe the global controller 7 Use standard logic synthesis tool A templatebased approach may be used for glue logic MahapatraTexas AampMFallDO 21 Other synthesis approaches VULCAN system hardwareC and C COSYMA system C like description hardwareC SIERRA system for multi board application MahapatraTexas AampMFallDO 22 CoDesign of Digital Telecommunication System Ref HardwareSoftware CaDesz39gn afDigiIal Telecommunication System I V0 Balsens et aL Proceedings afIEEE March 1997 HardwareSoftware Codesign of Embedded System CPSC489501 Why codesign of Telecom System The rapid breakthrough of consumer electronics CD DCC DAB wireless or wired voice and data networking ISDN GSM Videophone broadband networks ATM ADSL and multim edia is phenom enal these days The digital communication technique is the basis of all these fast growing industrial activities Development of digital communication is possible due to combined growth of VLSI and DSP DSP System Performs realtime mathematical transformation on digitized samples of analog signal with finite bandwidth and signal to noise ratio SNR 7 These 39 canbe 39 either ona processor using software or applicationspeci c hardware and determined by tradeoffs between cost power performance and exibility MahapatraTexas AampM FallDO 2 Codesign of Telecom system DSP based products have a growth rate of more than 35 per year The average time to market window is reduced to few months only Complexity and functional density is on demand Design productivity 7 communication system designer conceive the design at board and executable concurrent programmable paradigm that is not understood by chip architect who works in the RTL domain Gap between system design and implementation 7 The system need to be design at processormemory level by reusing component designs This needs methodology and codesign approach 7 Size of design team does not seem to grow as chip complexity grows 7 Hence there is a need to increase design productivity and seamless transition of design strategy from software centric to implementation reuse hardwaresoftware codesign approach MahapatraTexas AampM FallDO 3 Speci cation View of DSP system Digital Signal results from binary encoding of time and range discretized measurable continuous time continuous range quantities Sampling occurs at or above Nyquist frequency and coding is done with just enough wordlength to maintain SNR Digital signals are usually fixedpoint type to save power and hardware to meet desired performance Digital signals are stream of digital words due to periodic sampling These words are naturally structured into multidimensional arrays which are to be processed frame period Tf that is the duration of the algorithm when it takes a set of input to produce the set of results Thus the elementary DSP algorithm is a dataflow function MahapatraTexas AampM FallDO 4 DSP System Specification Synchronous data ow SDF algorithm are modeled as DAG 7 The data ow DSP algorithms are described ef ciently by many applicative programming languages LUSTER DFL SILAGE 7 The operations can be scheduled at compile time and execution of compiled code can be two orders of magnitude faster than the execution of event driven code VHDL simulation Dynamic data ow CDDF algorithms contain datadependent token production and consumption 7 Allows while and ifthen else constructs 7 Hard periodic constraints are satis ed by inserting exit paths in while loops that guarantee a bounded execution time whereby unassigned signals are assigned default values 7 One can transform DDF s into worst case SDF and schedule statistically 7 Else DDF s can be partitionedinto a set of compiled time scheduled SDF functions red by internal or external Boolean events MahapatraTexas AampM FallDO 5 DSP Specifications CAD systems for DSP DSPStation of Mentor graphic Ptolemy GRAPEll COSSAP from Synopsis SPW from The Alta group allow the specification in SDF and DDF and use static scheduling as much as possible to achieve the speed over event driven simulator Communication system typically consist of one or more signal paths slow control loops and a reactive control system The control loop and reactive control system are slow environments due to some user interfaces and status update in signal paths The signal path is usually a concatenation of data ow functional blocks DFFB and could operate at different speed and data set The difference in data rate is due to digital communication system it self MahapatraTexas AampM FallDO 6 Heterogeneous functional specs Signal path Example h1 h2 matched filter correlator m1 low rate demo ul t L1 error correction in base band control L2 speech decompression MahapatraTexas AampM FallDO 7 Specification of DSP The DFFB s in the signal path are strongly connected DFG s and rarely partitioned if timing constraints allow it Usually fast Control loop and mode control by parameter setting are common to most communication systems tracking and acquisition loop These loops are orders of magnitude slower that the data rate Hence reactive semantics are used here They run concurrent to data flow Program State Machine PSM can be used for such environment It is a hierarchical program state where each state may be a distinct mode of computation SIaleCharls or SpecCharts are suitable that allow exception handling and interprocess communication But VHDL is too low level to describe it Co simulate data ow and event driven models MahapatraTexas AampM FallDO 8 DSP specs In case of video and image processing there is virtually no other way for validation than emulation Which requires retargetable synthesis technique The retargetable synthesis technique allows to map on to FPGA architecture programmable DSP and video signal processor Also can map on to the final onchip architecture to be discussed MahapatraTexas AampM FallDO 9 System architecture and design process Implementation of the signal path Signal paths concatenated DFFG s that consist of loops that operate on multidimensional arrays This is executed periodically every time frame The nodes in DFG are arithmetic register transfer operations on Boolean arrays Low throughput systems Tf lt 10 ms few hundred scalar samples Base band voice audio system and backend image processing High throughput system 100000 scalar samples Tf lt 10 ms Frontend and intermediate video image and graphic processing These algorithms execute kernels on massive data set an characterized by deeply nested loop structure of which the inner loops execute a restricted set of operation on large set of pixels MahapatraTexas AampM Fall 00 Processor Architectures Highly MultiplexData Path HMDP processor 7 Executes low throughput algorithms 20 to 500000 clock cycles of irregular flow graphs less than 10 operations per cycle 7 Few concurrently operators with rich instruction set controlled by a single thread sequencer with instruction and status pipelines 7 Three types of processors 1 Commercial DSP processor fixed point core with separate program memory and data memory and fixed IO peripherals Core has encoded instruction set tuned to arithmetic sum ofproduct operation and regular memory access for convolution type algorithms Has very heterogeneous register architecture C compiler for regular register architecture has poor performance and hence DSP engineers use assembly language MahapatraTexas AampM FallDO 11 Processor Architectures HMDP Processor types 2 VLIW processor Has global single thread hardwired controller and data memory only one DSP algorithm is executed due to customization Higher degree parallelism is possible using multiple functional units and memory blocks Examples Cathedral2 MISTRALZ 3 ASIPs Uses application specific instructions for programmability 7 Better area efficient than commercial DSP processor and more flexible compared to VLIW 7 Combine reuse of hardware with performance low power MahapatraTexas AampM Fall 00 Processor Architectures Low Multiplexed Data Path LMDP Processor 7 High throughput algorithms are characterized by a set of repetitive computation intensive kernels for which only few cycles are available to execute each iteration Hence low multiplexing is used to save cycles 7 Decision making must be done within single clock cycle by inserting local control in the datapaths 7 You need applicationspeci c pipelined datapath that is tuned to a time folding of the kernels to execute 7 Example applications video processing CDMA transceiver etc Known as hardware accelerators 7 Distributed Memory architecture is used to meet the data throughput 7 Synthesis tools Cathedral3 PHIDEO and Hyper 7 Low power design favors this central memory fetching consumes one order of magnitude more than the local memory MahapatraTexas AampM FallDO 13 Control Loops and UI Control loops consist of interactive state machines and hard to keep it time constrained unlike arithmetic operations Micro controller based implementation is mostly used today This helps for redefinition later and maintain product legacy Synthesis of hardware FSMs may be cheaper in area and power RT level synthesis is useful MahapatraTexas AampM FallDO 14 Typical Hardware components of a Telecommunication System and their role in the implementation Component Class HMDP DSP Processor LMDP DSP Processors Micro Controller dware FSM Peripherals Communications Blocks amp memory Table from Bolsens etal paper Implementations Com 391 Spec Low datarate DSP Retargetable Assembly Slow control Loops Code generators C Appl Spec Algorithm High level Synthesis FL High datarate DSP High level synthesis C DFL RT Level Synthesis VHDL User interface C compiler Multithread C Slow control loops FSM T Synthesis Usually FSDMs RT level Synthesis VHDL clock generators Asynchr PSM DMA blocks synthesis Internal amp External Memory mgmt synth Datasheets Communication Asynchronous STGagm Gm Storage amp buffering CSP of Shared variables Interface synthesis MahapatraTexas AampM Fall 00 IMEC Architecture Design Environment VHDL c DFL Emulator IoSpeci cation oces39sor Corim enmu39un Mla Os Kernel MahapatraTexas AampM Fall 00 System Design Process coware environment System design process must bridge the gap between heterogeneous functional specifications Functional specification is based on the concept of communicating co ware processes Processes are concurrent programs in a host language In coware there is no difference between cosimulation and co implementation Both are based on refining the specs to implement First step in implementation is to allocate the processors optimally The next step is to merge functional specs and assign that to allocated processor This assignment step is Partitioning and hard to automate Most essential step is to implement abstract communication between HW and SW Interface synthesis is required Then the process follows as usual synthesis step MahapatraTexas AampM FallDO 17 CoWare Data Model Process container for a number of host language encapsulations of a component different implementations for the same component 7 presently the languages are C DFL VHDL Verilog and CoWare In CoWare language incapsulation one can instantiate processes and connect their ports with channels The other host language encapsulation consists of a context and threads The context contain code that is common to all threads useful for interthread comm Ports Objects through which processes communicate The port is characterized by a protocol and data type parameter Construct port is the implicit port to which remote procedure call is performed Protocols Define communication semantics of a port master inmaster outmaster inoutmaster slave inslave outslave inoutslave MahapatraTexas AampM FallDO 18 CoWare Model Thread A single flow of control within a process It contains code in host language of encapsulation 7 Slave threads associated with slave port activated due to RPC 7 Autonomous thread not associated with any port Channel A point to point connection of master port and slave port Can be uni or bidirectional Data exchanges between connected channels In HW implement it using wire In SW use function call Assignment Identify various type of ports channels threads in the Fig 4 MahapatraTexas AampM FallDO 19 Communication in CoWare Communication always happens between two threads 7 Ifthreads are part ofthe same process intraprocess comm Uses shared variable for communication that has been declared in that process Two threads access same variable or protection of critical section is provided by the host language 7 Else interprocess communication lnterprocess communication with a primitive protocol is RPC based On a master port RPC function can be used to initiate a thread in a remote process A master port can be used from anywhere host language The RPC function returns when the slave thread is completed Read and Write functions can be used in the slave thread to access data from the slave port The Index function access indices of the protocol of the port The RW function finds the direction of inoutslave port MahapatraTexas AampM FallDO 20 Communication Refinement Once the designer is convinced on the correctness of functionality the communication behavior is refined Communication behavior in CoWare is refined by making the communication obj ects channel port protocol hierarchical Hierarchical Channels are processes that assign a given communication behavior to a primitive channel The behavioral interface of a hierarchical channel is fixed by the ports connected by the primitive channels One can parallelize or pipeline the processes by adding buffers The only prpoerty that is preserved by making a channel hierarchical is the direction of data transfer see Fig5 7 The channel between P1 and P23 is re ned to be FIFO behavior Now the rate of RFC issued by P1 is no more controlledby RPC serviced by P23 MahapatraTexas AampM FallDO 21 Comm refinement Hierarchical Parts are processes that assign a given communication behavior to a primitive port The hierarchical port process has one primitive port return that is connected to the primitive port that is made hierarchical 7 In Fig5 say we want to impose a data formatting over the data transported between P1 and FIFO process This is achieved by making the port pl and left hierarchical 7 The format process that re ne port p might add a CRC to the data through them The unformat process that re ne port le ofthe FIFO process then uses this CRC to check the data for validity 7 The actual data and CRC are sent sequentially over the same primitive channel MahapatraTexas AampM FallDO 22 Comm refinement Hierarchical protocols refine primitive protocols with a timing diagram and the associated lO terminals These are high level models for alternative implementation of a primitive protocols 7 To access the terminals of the hierarchical protocols a hierarchical port is introduced at the same time The terminal can be accessed from Within the thread code by using functions Put Sample and Wait 7 In Fig5 the primitive protocol of op port and ip port of the format and unformat process are re ned into an R8232 protocol 7 In the R5232 hierarchical port an RPC issued in the format process on the op port is converted into manipulations of the terminals according to a timing diagram MahapatraTexas AampM FallDO 23 Communication in CoWare Hierarchical channels and ports being processes can be removed from a system description by expansion or flattening The result is a coware description that contain only process instances of which the primitive ports possibly with hierarchical protocols are connected by primitive channels Thus we have three basic communication mechanism as follows MahapatraTexas AampM FallDO 24 Communication mechanism in CoWare Intr aiprooess communication context shared Variables Interiprocess communication Primitive protocol Remote procedure call Master portRPC Slave port ReadO Write IndexO RWO MahapatIaTexas AampM Fall 00 Table 2 of the reference Interiprocess communication communica on Terminals Timing diagrams InputSample Wait Output PutO MahapatIaTexas AampM Fall 00 Cosimulation II Cosimulation Approaches MahapatraTexas AampMFa1100 How to cosimulate How to simulate hardware components of a mixed hardware software system within a uni ed environment 7 This includes simulation of the hardware module the processor and the software that the processor executes How to simulate hardware and software at same time What are various challenges 7 Software runs faster than hardware simulator How to run the system simulation fast keeping the above synchronized 7 Slow models provide detailed and accurate results than fast models How to balance these effects 7 Use of different platforms for simulations MahapatraTexas AampMFa1100 Some basic approaches Detailed Processor Model 7 processor components memory datapath bus instruction decoder etc are discrete event models as they execute the embedded software Interaction between processor and other components is captured using native eventdriven simulation capability of hardware simulator Gate level simulation is extremely slow 7tens of clock cyclessec behavioral model is 7hundred times faster Most accurate and simple model Backplane MahapatraTexas AampMFallDO Some basic approaches Bus Model Cycle based simulator 7 Discreteevent shells that only simulate activities of bus interface without executing the software associated with the processor Useful for low level interactions such as bus and memory interaction Software executed on ISA model and provide timing information in clock cycles for given sequence of instructions between pairs of IO operation 7 Less accurate but faster simulation model Program runn 39 mg 071 Hart B ackplane MahapatraTexas AampMFallDO 4 Some basic approaches Instruction Set Architecture Model 7 ISA can be simulated ef ciently by a C program Cprograrn is an interpreter for the embedded software 7 No hardware mode Software executed on ISA model Execution on ISA model provides timing clock details of the cosimulation 7 Can be more ef cient than detailed processor modeling because internals of the processor do not suffer the expense of discreteevent scheduling Program 071 Hart B ackplane MahapatraTexas AampMFallDO Some basic approaches Compiled Model v ry fast processor models are achievable in principle by translating the executable embedded software speci cation into native code for processor doing simulation Ex Code for programmable DSP can be translated into Sparc assembly code for execution on a workstation 7 No hardware software execution provides timing details on interface to cosimulation 7 Fastest alternative accuracy depends on interface information Program running on host Backplane MahapatraTexas AampMFallDO Some basic approaches Hardware Model 7 If processor exists in hardware form the physical hardware can often be used to model the processor in simulation Alternatively processor could be modeled using FPGA prototype say using Quickturn 7 Advantage simulation speed 7 Disadvantage Physical processor available Backplane MahapatraTexas AampMFallDO A New Approach This is a combined HWSW approach The host is responsible of having OS some applications and might have superset simulating environment RSIM SIMICS SIMOID Use of fast backplane PCI for communication Real processor or processor core in FPGA as hardware model and ASlCFPGA for interface and interconnection for hardware modeler Good for fast complex architecture simulations including multiprocessor MahapatraTexas AampMFallDO Domain coupling In four out of siX approaches we have host that run software and required to interact with hardware model or simulator Difficulties 7 providing timing information across the boundaries 7 coupling two domains with proper synchronization MahapatraTexas AampMFallDO 9 Migration across cosimulation Consider the system simulation at different levels of abstraction throughout the design process 7 In the beginning of design process hardware synthesis is not available Hence use functional model to study the interaction between HW and SW 7 As design progress with more implementations replace functional model of hardware by netlist level 7 Once detail operation of hardware is veri ed swap back the high level description of HW design to gain simulation speed The cosimulation environment should have this migration support across the levels of abstraction Offtheshelf Components design is not a part of the current design process Functional model is enough no need to know internal details MahapatraTexas AampMFallDO 10 Master slave cosimulation One master simulator and one or more slave simulators slave is invoked from master by procedure cal The language must have provision for interface with different language Difficulties 7 No concurrent simulation possible 7 C procedures are reorganized as C functions to accommodate calls Master HDL InterfaCe Slave MahapatraTexas AampMFallDO Distributed cosimulation Software bus transfers data between simulators using a procedure calls based on some protocol Implementation of System Bus is based on system facilities Unix IPC or socket It is only a component of the simulation tool Allows concurrency between simulators MahapatraTexas AampMFallDO Synchronization and Time in cosimulation In case of a single simulator say Verilog there is no problem for timing as single event queue is managed for simulation If there are several simulators and software programs in the domain 7 hardware and software domain are using a handshaking protocol to keep their time clock synchronized Signals events transferred from one side to the other should have attached a time stamp 7 It is possible to use a loosely coupled strategy which allows the two domain to proceed more independently If a signal is received with a time stamp lower than the current clock in the respective domain the respective simulator have to be back up MahapatraTexas AampMFallDO 13 Aspects of cosimulation A frame work of cosimulation consists of variety of components levels of abstractions and different models MahapatraTexas AampMFallDO 14 Heterogeneous Environment Ptolemy Ptolemy supports different design styles encapsulated in objects called Domain 7 Domain realizes a computational model appropriate for a particular sub system EX SDF Dynamic Data 0WDDF Discrete EventDE and Digital hardware modeling environment Thor Domain consists of a set of Blocks and Schedulers that confirm to a odesic common computatlonalmodel 39mmallzeo Block numlnit semp set50urcePort0 setDesLPort Block objects send and receive data encapsulated in particles to outside world through PortHoles isen Bu ering by Geodesic MahapatraTexas AampMFallDO Garbage Collection Plasma 39pn39n 0 39operator 0 39clone 1 5 Heterogeneous cosimulation Ptolemy Any model can be used at the top of hierarchy Within each level of hierarchy it is possible to have Blocks containing foreign domain Hierarchy heterogeneity is quite different from the concept of a simulation backplane that imposes a toplevel models of computation through which all subsystem interact Active objects in a Domain are Stars They perform computation and communications with other obj ects through PortHoles MahapatraTexas AampMFall3900 Heterogeneous cosimulation Ptolemy Domain XXX can contain foreign Domain YYY in it Call it XXXWormhole To the stars of Domain XXX the XXXWormhole appears as another star Though inside the Wormhole it is entirely another domain MahapatraTexas AampMFallDO Cosimulation Ptolemy EventHorizon provides an interface between the external and internal Dom ains W ormholes There exists an EventHorizon for all Domains Thus each domain needs only an interface to this EventHorizon 39SDFWtif mlinle DEtuUniverszl DEmUniverszl gt SDFfmmUniverszl mmltmH SDFtnUniverszl MahapatraTexas AampMFallDO Introduction to cosirnulation CPSC489501 HardwareSoftware Codesign of Embedded Systems MahapatraTexasAampMFa1100 1 What is HWSW cosirnulation A basic de nition Manipulating simulated hardware with software Goal of cosimulation To verify as much of the product functionality hardware and software as possible before fabricating the ASIC Assume that the ASIC contain some significant amount of software In this lecture Learn the methods and motivations of hardwaresoftware cosimulation MahapatraTexasAampMFa1100 2 Overview In the Past cosimulation was adopted late in the process after hardware is deemed to be working and stable Software developers were often left to develop code for months with severely limited ability to test 7 Painful integration process design aw and could respin the silicon Now 7 Behavioral model simulation has matured and simulation tools in general have improved to allow better simulation throughout the development cyc e 7 Co design tools provide project architects with a simulation environment at a very high level of abstraction M ahapatraTexasAampMFa1100 overview 7 Possible to convert the English description of functionality into a formal specification language and allow designers to work out the fmetional split between the hardware and the software in a simulation environm ent Debug implementation can proceed with event and cycle driven simulators With codesign tool can simulate the product before actual implementation Testing can be real cheap Component integration is done at cosimulation environment with con dence before actual fabrication M ahapatraTexasAampMFa1100 Simulation components Hardware design Memory CPU or many ASICs each with one or more CPUs Simulation platform N 7 PC or workstation Everything exists as process 7 Hybrid platforms with coprocessors offload part of the load to coprocessor peripheral and test benches remain in software M ahapatraTexasAampMFa1100 3 Emulation Special simulation environment with hardware 7 runs whole design 7 expensive 7 10 of real time 7 FPGA arrays may be the hardware 7 allow designers of large products to find a class of problem that cannot be found in simulation 7 can attach to real devices router using Quickturn39s Ethernet SpeedBridge could route real network traffic M ahapatraTexasAampMFa1100 4 Algorithms Event driven simulationgate level simulation 7 Most accurate as every active signal is calculated for every device during the clock cycle as it propagates 7 Each signal is simulated for its value and its time of occurrence 7 Excellent for timing analysis and verify race conditions 7 computation intensive and hence very slow Cyclebased simulation 7 Calculate the state of the signals at clock edge0 or 1 7 suitable for complex design that needs large number of tests 7 10 times faster than event driven simulation 20 area efficient M ahapatraTexasAampMFa1100 Algorithm contd DataFlow Simulator 7 Signals are represented as stream of values Without notion of time Functional blocks are linked by signals Blocks are executed when signals present at the input 7 Scheduler in the simulator determines the order of block executions 7 High level abstraction simulation used in the early stages of verification typically to check the correctness of the algorithms M ahapatraTexasAampMFa1100 Simulation s Requirement on Hardware Most simulators can handel behavioral models except the emulators that require synthesizable codes Some simulators may not handel HDLs Cyclebased simulators can handel asynchronous designs at severe performance penality Choose simulator s in the beginning of design stage You may use multiple simulators concurrently in a project Example You may use emulator for a problem that requires timing until few hundred microseconds of failure point then export the state of the system into a simulator MahapatraTexasAampMFa1100 9 Simulator s requirement on Software Simulation environment has effects on application software 7 Programmers certainly need alternate version of application that do not have user interface code or any references to chips that is not part of the simulation environment 7 Reduce size of functionality and tables for speed 7 Example consider simulating a 500 MHZ processor that initializes a 8KB table that might take few minutes Such trivial tasks together consume time but do not add to the quality of the simulators MahapatraTexasAampMFa1100 10 Simulation Methods codesign is a way to simulate at a very high level of abstraction prior to the actual implementation These simulations follow the theme of trading details for runtime speed By creating a functional model which can be tested system designers can make sure the requirements are clear making a single model of both hardware and software functionality the design boundary between the two is effectively removed Having a running model also allows engineers to test different hardwaresoftware functionality splits for performance and get some rough timing estimates for various ideas Running a functional model also allows engineers to find fundamental bugs in the design before implementing them Can reuse for performance updates MahapatraTexasAampMFa1100 11 Cosimulation methods POLIS from UC Berkeley 7 Cadence39s Cierto VCC is based on ideas from POLIS Synopsy s COSSAP and Eaglei tools promise a way to check the implementation against the original algorithmic specification for function equivalence But the standard method of cosimulation is to run software directly on simulated hardware It is implied that the CPU is part of the ASIC Thus CPU is simulated at the same level as other hardware 7 Good if your purpose is to design the CPU However if you use a core from the vendor you are wasting valuable simulation resources MahapatraTexasAampMFa1100 12 Cosimulation methods contd Heterogeneous cosimulation Network different type of simulators together to attain better speed Claims to be actual cosimulation strategy as it affords better ability to match the task with the tool simulates at the level of details 7 Synopsis s Eaglei let hW run inmany simulators sw on native PCWorkstation or in instructionset simulator ISS Eaglie tool interfaces all these MahapatraTexasAampMFa1100 13 Heterogeneous cosimulation 1 orli m g Evan Gym Databe Sirmilation Engine Emulator PC MahapatraTexasAampMFa1100 14 Heterogeneous cosimulation How about perform ance 7 Complex enough to describe any situation 7 Proponents since software not running hardware simulation speed in actual the performance will be more 39 How fast the software running when not doing hardware related task If target CPU is not PC you may use cross compiler 7 When software runs directly on PCW S runs at the speed of WS 7 When software can not run directly as processes on WS you need instruction set simulator ISS interprets assembly language at instruction level as long as CPU details are not an issue ISS usually runs at 20 ofthe speed ofactual or native processes MahapatraTexasAampMFa1100 15 Hardware density of Heterogeneous simulation How much time software accesses hardware Hardware density depends on applications and with in an application In loosely coupled CPU system the block responsible for hardware initializations has 30 instructions to access the hardware In tightly coupled system every memory reference could go through simulated hardware In general hardware density is important for simulation speed The base hardware and tools that communicate between the heterogenous environment can attribute to the speed too If simulation is distributive most often it happens these days the network bandwidth reliability and speed matters too MahapatraTexasAampMFa1100 16 Cosimulation strategies What you simulate is what you get Simulation is important for bug free test of the product The product schedule forces suitable strategies Due to decrease in feature size and increase in die size more functionality are pushed to hardware could never happened in the past Creates challenges for testing due to increased functionality 7 Formal design methods code reviews and code reuse have help Emulation engine is also of help but expensive For typical strategies we need to know the thoroughness to test and system s environment If it involves with health and safety then detail testing strategy is sought M ahapatraTexasAampMFa1100 Strategy Multipronged functional test strategy to build levels of assurance 7 Basic initial tests prove functionality and complex tests are built upon working 7 Any single test method has some coverage hole Event driven tests are closest to the real hardware but its slowness is coverage hole 7 Make balance between required test coverage and what might be avoided A simulation strategy might call for the functional specification to be written as a functional model codesign 7 Hardware designer could use event driven tests for hardware blocks 7 Software designer could do basic debug using ISS or cross compiler and with fake hardware calsFor detailed functional blocks software could interface A er completion of blocks these canbe dropped into the functional model for regression tests M ahapatraTexasAampMFa1100 Strategy Simulation speed Degrades when real components replace the functional blocks The simulation speed depends on simulation engine the simulation algorithm the number of gates in the design an whether the design is primarily synchronous or asynchronous Low cost cycle based simulation is a good compromise Since it can not test physical characteristic of a design event driven simulator may be used in conjunction Cycle based simulators and emulators may have long compilation Hence not suitable for initial tests that needs many changes Event driven and cycle based simulators have fairly equal debugging environments all signals are available at all times Emulators on the other hand require the list of signals to be traced to be declared at compilation time MahapatraTexasAampMFa1100 19 Strategy If the next problem can be found in a few microseconds of simulated time then slower simulators with faster compilation times are appropriate If the current batch of problems all take a couple hundred milliseconds or even seconds of simulated time then the startup overhead of cycle based simulation or even an emulator is worth the gain in run time speed How about the portability of test benches Test after fabrication 7 Fast simulators are useful Track down the hardware fault is dif cult May patch the problem so as to make the problem reappear easily unless regression tests MahapatraTexasAampMFa1100 20 Strategy To determining which parts of the system software to run and how much software debug can be done without the hardware Software engineer need to go through the code and disable functionality which is too costly for simulation or if the sequence is important find ways to reduce its execution time The degree of fidelity between the simulated environment and the real world is both a requirement of simulation and a constantly shifting target throughout the simulation effort MahapatraTexasAampMFa1100 21 Summary Issues and trade off discussed The use of HDLs by programmers and logic designers is good sign of convergence Cosimulation is crucial for tightly coupled hardware and software ASICs Every project is different with varying objectives Hence choose the strategy as required MahapatraTexasAampMFa1100 22 Partitioning 11 Functional partitioning MahapatraTexas AampMFallDO Earlier partitioning Partition large number of processes among processors Partitioning after synthesis Synthesis used to be more time consuming due to nonlinear characteristics of its tool heuristics More power consumption MahapatraTexas AampMFallDO Partitioning trend Many applications eons ofone or small number of very large processes Partitioning before synthesis or compilation has advantages 7 order of magnitude reduction in logic synthesis runtime 7 Improved system performance as smaller processes canbe synthesized With shorter clock period than one large processor 7 Improved satisfaction of IO and size capacity constraints on a package reducing interpackage signals compared to structural partitioning MahapatraTexas AampMFallDO Partitioning approaches Functional partitioning synthesis l l l l l l l l l l Data mtm path um Structural MahapatraTexas AampMFallDO Functional Partitioning Divides a system s functional specification into multiple subspecification Each subspecification represents the functionality of a system component such as a custom hardware or software processor Then the components are synthesized down to gates or compiled to machine codes MahapatraTexas AampMFallDO 5 Advantages of FF Power reduction due to mutual exclusive components smaller board size lower cost increase software speed concurrent synthesis and debugging less physical design problems MahapatraTexas AampMFallDO 6 Problem description Model Input process x C program or VHDL process A view ofthe process set of procedures F fl f2fn with one as main procedure Variable simple processor with read and write being the procedure calls Execution of F procedures executing sequentially staring with main and that calls other procedures only one is active at a time MahapatraTexas AampMFallDO 7 Problem description Model Functional partitioning creates a partition P consisting of a set ofparts p1 p2 pm such that every procedurefiis assigned to exactly one part pj ie plu p2 U pmF and pi m pj 0 for all ij i j Each pj represents the function to be implemented on a single processor The processors are mutually exclusive Each part pj is converted to a single process before synthesis this process consists of a loop that detects a request for one of the part s procedures receive input parameters calls the procedure and sends back output parameters MahapatraTexas AampMFallDO 8 Model contd Function Bus single bus carries parameter passing between processors Protocol putting destination procedure s address pulsing address request putting parameter pulsing the data request Process custom processor component Ci For application we target Ci nontrivial data path and a complex controller with hundreds of states Procedure on Ci may be implemented either as a control subroutine or datapath component Synthesis may implement process s procedures in parallel if data dependencies are not violated 7 While procedures are not mutually exclusive after partitioning processors are still mutually exclusive MahapatraTexas AampMFallDO 9 Five tasks for good partitioning Model creation 7 converts input to an intemal model call graph model Allocation 7 lnstantiating processors of varying type done before Partitioning 7 Dividing input process among allocated processors Transformation 7 modifies the input process into one with different organization but same overall functionality leading to better partition Estimation 7 provides data used to create values for design metrics Pre estimatian and anlineesn39malian MahapatraTexas AampMFallDO 10 steps proposed by Vahid Partitioning Methodology Access Graph Threestep method Sequence of partitioning MahapatIaTexas AampMFallDO Step1 Granularity Selection Goal Extract procedure from specification which are to be assigned to processors during N Way assignment Granularity is a measure of complexity 7 Fine many procedures of low complexity 39 Little preiestimation and onlineiestjmation less accurate Make onlinei estimation more complex to build higher acc r cy an be more time consuming and may prohibit the use of assignment heuristics that need many estimatio 7 Course few procedures of high complexity 39 many behaviors are grouped together into inseparable unit so that any possible solution that separate those behavior is exc MahapatIaTexas AampMFallDO Granularity Procedures are selected very carefully to balance the above effects Each statement is treated as atomic unit Granularity Selection Problem Partitioning statements into procedures such that 1 procedures are as course grained as possible to enable maximum pre estimation and application of powerful N way heuristics and 2 statements are grouped into a procedure only if their separation would yield inferior solution MahapatraTexas AampMFallDO 13 Granularity A straight forward heuristic choose a specification construct to represent a procedureIe each statement or block Also user defined procedure for partitioning Transformations can be used to improve the above strategy 7 Procedure Inlining replace procedure call by procedure s contents making granularity coarser lnline procedure disappears 7 Procedure cloning makes a copy of a procedure for exclusive use by a particular caller Ex Multiplycalled procedure if inlined might grow excess and if notinlined might needs more communication Cloning is a compromise MahapatraTexas AampMFallDO 14 Illustration MWL bytelevel LcdS endltbytegt ModelO Lcdcle Input Spem catlon ModeZO LedUpdateltbytebytegt Lcdlmto XmiLLeveKbyte With many procedures WWW begin quotsequmce throgh modes quotwhich the 1 quotother procedures M req 1 b LTOCDclear Freqlbit8 CDSmd Access graph d MahapatraTexas AampMFallDO Transformation contd Procedure Exltntng Replaces a subsequences of a procedure s statements by a call to a new procedure containg only that subsequences opposite to inlining This technique moves towards ner granularity 7 Redundancy exlt39nt39ng replaces two or more nearidentical sequences of statem ents by one procedure use string matching method statem ents are encoded characters 7 Distinct computation exlt39nt39ng Divide a large sequence of statements into several smaller procedures such that statem ents within a procedure are tightly related and would not be separated during N way assignment solution MahapatraTexas AampMFallDO Illustration of eXlining Freq1bit8 P LcdSmd cdlmt 39 chpd 65qu od 1 Level Mo laltIW XmiLLevel th odeZ MahapatraTexas AampMFallDO Step2 Proclustering Goal Reduce the number of procedures for subsequent N way assignment by merging procedures whose separation among parts would never represent good solution Different from granularity step procedures being clustered here may not be such that they could eXlined into single new procedure Ie calls to theses procedure are non adjacent Different from Nway assignment each cluster does not represent a processor and therefore can not be guided by direct design metrics estimates MahapatraTexas AampMFallDO Preclustering method Uses hierarchical clustering procedures after granularity selection are converted to a graph node and edges are created between every pair weighed by the closeness of the nodes closest pair of nodes are merged to a new node This is repeated until no nodes are exceeding the threshold weight 10 MahapatIaTexas AampMFallDO 19 Illustration of preclustering I quotHindnu and T 39 heavily 48 times per call These two should never be separated Since LcdSeml appears 48 times inside Lchpdate inlining during granularity selection was not reasonable option q1 mm th L LcdSmd odel 8 MahapatIaTexas AampMFallDO 20 More on preclustering 0 Can reduce runtime of NWay assignment by 30 or more 0 May look at Ethernet example in the reference MahapatraTexas AampMFallDO 21 Step3 N Way assignment Goal Distribute the procedure among given set of processors Procedures are created after granularity selection and preclustering constructive heuristics are used to create initial solution and can include random distribution and clustering There is an additional metric Balanced size Size of an implementation of both sets of node divided by the size of all nodes This favors merging small sets over large ones Heuristics applied Greedy Simulated Annealing Hill climbing MahapatraTexas AampMFallDO 22 N Way assignments 7 Greedy algorithm linear time heuristic that moves nodes that reduce the value of cost function 7 Simulated annealing randomized hill climbing to avoid local m inima with long runtim e 7 Extended hill climbing with some restrictions and tightly coupled data structure On logn runtime cloning transformation can be applied selectively here port calling another transform for IO balance and ease access to shared ports IO procedures are used in place of external port access that take care of sendreceive etc MahapatraTexas AampMFallDO 23 Illustration of N Way assignments Freq1bit8 MahapatraTexas AampMFallDO 24 Other partitions of operations Aparty among datapath modules using multistage clustering Vulcanamong packages using iterative improvement heuristics Chop among packages focusing on providing suite of feasible solutions for each package that would satisfy overall constraints Multipar among packages simultaneous with scheduling and allocation using linear programming SpecPart partitioned procedures among packages using clustering and iterative improvements MahapatraTexas AampMFallDO 25 Limitation of threestep approach Total hardware increase may be large for examples with small controllers and large datapaths Problems that has large number of small processes much like a scheduling problem parallel execution on processors Reference Frank Vahid A three step approach to the functional partitioning of large behavioral processes MahapatraTexas AampMFallDO 26 Codesign Framework Parts ofthis lecture are borrowed from lectures of Johan Lilius of TUCS and ASVLL of UC Berkeley available in their web Mahapatra Texas AampMFall3900 1 System Design What is System Design Is the process of implementing a desired functionality using a set of components Mahapatra Texas AampMFall3900 2 System Design contd Step1 Design must begin with specifying the desired functionality Mahapatra Texas AampMFall3900 3 Specification For precise specification we need to think the system to be a collection of simpler subsystems a method rules to compose these pieces Mahapatra Texas AampMFall3900 4 Functions vs Computation Functions specify only a relation between two sets of variables input and output Computation describe how the output variables can be derived from the value of input variable ASVLL UCB 5 Essential Issues Modeling System description System behavior Validation verification Performance estimation Partitioning and mapping Synthesis Software synthesis Hardware synthesis Interface synthesis MahapatraTexas AampM Fall39OO 6 A Codesign framework compuLllon r System de39scnptlon System b el laylor Valldatlon Performance estlmallon HW Synthesls Partllloned speclflcallon SWSynlhesls lnlerface synthesls Lilius of TUCS 7 Model The method or rules to compose the pieces of subsystems to create a system functionality is usually called a Model Mahapatra Texas AampMFall3900 8 Model Model should be formal no ambiguity complete with sets of properties performance indices and constraints comprehensive and easy to modify natural to understand Mahapatra Texas AampMFall3900 9 Model What is a model Model is a formal system consisting of objects and composition rules and is used for describing a system s characteristics Modeling Modeling digital systems is often complicated by the heterogeneity of its components Distinguish between models of computation hardwaresoftware specification languages A language may imply many models UML State Machines synchronous behavior within a state machine asynchronous behavior between state machines Lilius TUCS Model of Computation A M00 is a framework in which to express what sequence of actions must be taken to complete a computation An instance of a model of computation is a representation of a function under a particular interpretation of its constituents Not necessarily a bijection almost never ASVLL UCB Models of Computation Often existing models belong to several categories Finite state machines totally ordered discrete events Petri nets partially ordered events Synchronous dataflow multirate discrete time partiallyordered events why different Modes Various models have certain strong properties that might be useful for some applications Some problems might be undecidable Specification languages Finitestate based synchronous communication Statecharts Esterel asynchronous communication SDL Partial orders oftasks VHDL at behavioral level Discrete time cycle based VHDL at RTL level Synchronous dataflow Silage DFL Lustre Lilius TUCS 15 Specification languages No perfect language exists Control software asynchronous fsm DSP dataflow Hardware synchronous gtForce user to think in terms of one model of computation system POLIS many models of computation PTOLEMY Lilius TUCS 16 A Codesign framework compuLllon System description System behavior Validation Partitioned Specification Performance estimation 3 Evaluation 3 SWSynthesls interface Synthesis HW Synthesis 17 O Systemlevel validation covalidation Methods for gaining reasonable certainty that the design is free from errors Methods Verification CoSimulation Emulation Lilius TUCS 18 Verification Specification verification Is the specification consistent Does it have the required properties Implementation verification Have we implemented the speci cation Checking of safety nothing bad ever happens liveness something good eventually happens Weaky heterogeneous systems One or more processors some dedicated hardware and several software layers Desired features of simulator adequate timing accuracy fast execution visibility of internal registers for debugging purposes Problems One step in program is equivalent to many steps in hardware i long running times iAvailability of hardware models with the required abstraction level Highy heterogeneous systems Specialized simulators PTOLEMY UC Berkeley A Co design framework compuleon sygtem d System behawor Vahdahon Pamuoned specmcauon SWSynthesws Performance estwmahon i Evamanon 1 nterface synthesws HW synthesws Partitioning Input functional specification Output an architecture composed of hardware black boxes software black boxes and interconnection media and mechanisms a mapping function that assigns functional units to architectural units Lilius TUCS Partitioning Construction of mapping is an optimization problem mapping function optimizes a cost time area communication What is the architecture Automated synthesis of custom architectures difficult common restrictions limited to a library of predefined choices communication mechanisms are standardized Lilius TUCS A Codesign framework CompuLllOl l System descnptlon System behavlor Valldatlon Partltloned Speclflcatlol i SWSvnthesls lnterface synthegls Performance estlmatlon Hw Synthesis Hardware synthesis Well established research field Several commercial tools exist Levels of abstraction Behavioral synthesis algorithmic synthesis RegisterTransfer level synthesis VHDL Verilog Logic level synthesis netlist Research issue reuse of hardware Lilius TUCS 26 Software synthesis jI Dif cult problem for general purpose computing For embedded system much more constrained no swapping devices no stacks only polling and static variables mpbabomhms Translating FSMs to programs especially simple Speci cation consists of concurrent tasks Problem How do we find a linear execution order that satisfies the timing constraints Use scheduling theory Interface synthesis Interface between processor and ASIC synthesis of software synthesis of glue logic Automatic generation of bus interfaces PC VME Interfacing of sensors and actuators Lilius TUCS 28 Existing Toos Academic Commercial POLIS UC Arexys SDL VHDL Berkeley C PTOLEMY UC CoWare CC Berkeley LavalLogic Java to VULCAN Stanford Verilog U Hardware 0 Cynlib c to CHINOOK U of Verilog Washington VHDL Art Algorithm to RT COSYMA U of C to RTL Braunschweig C MIIZI IE IMDIA antl SUPERLOG Systqu Inllnl 39 POLS Speci cation FSMbased languages Esterel Internal Representation CFSM network Validation highlevel cosimulation FSMbased formal verification Partitioning by hand based on cosimulation estimates Scheduling classical realtime algorithms Synthesis Sgraphbased code synthesis for software logic synthesis for hardware Main emphasis on unbiased veri able speci cation PTOLEMY Specification Data flow graph lnternal representation DFG Validation multiparadigm cosimulation DF discrete events Partitioning greedy based on scheduling Scheduling linear sorting blocks by criticality Synthesis DSP code custom DSP synthesis for hardware Projects MiniEsterel Cosimulation using C and Verilog Implementation of a DSP algorithm in PTOLEMY Implementing a small ES in POLIS Codesign using FPGA Your own project Partitioning I Introduction to Partitioning M nhnpnlmrTcxns MM anll 00 Synthesis Why synthesis Synthesis converts behavioral process into customized digitalhardware processor 0 Problems with large behavioral processes synthesis runtime lasts tens of hours or more resulting processor may have high power consumption package size may exceed the constraints M nhnpnlmrTcxns MM anll 00 Possible remedies 4 synthesis runtime lasts tens of hours or more use heuristic and trade of the quality 4 resulting processor may have high power consumption isolation of processor sub circuit to avoid unnecessary signal switching A package size may exceed the constraints structural partitioning of processor M nhnpntmrTcxns MM anll 00 System partitioning System level partitioning problem hardware 0 1 software Assignment of an operation to HW or SW determines the delay of the operation Assifgmnent of operations to Assignment of operation to a processor and to more application specific HW circuits involve additional delays due to communication overhead Good partitioning scheme gt Minimize this communication M nhnpntmrTcxns MM anll 00 System partitioning contd Increasing operations in software on a single processor gt increases processor utilization System performance depends on hw sw partition on utilization of processor and bandwidth of bus between processor and application specific hardware Characteristic of Partitioning scheme capture and make use of partition s effect on system performance in making trade off between hw and sw implementation of an operation 7 Devise a partition conjunction MnhnpmrmTcxns MMFnl 00 5 Partitioning Cost function 7 Directs the partitioning algorithm towards desired solution optimum solution is minimum cost function Need to capture 7 effects of size of hwsw parts 7 effects on timing behavior of these portions on cost function contrast optimized areapinout difficult due to the problem being global in nature approximation is used to account the effect on total latency MnhnpmrmTcxns MMFnl 00 6 Partitioning Partitioning in software extensive use of statistical timing properties to drive partitioning algorithm 7 Dynamic or rimtime excess time exible Partitioning in hardware attempts to divide circuits that implement schedule operations 7 Static less time non exible An intermediate approach is advised incrementally computable of cost function f 7 partial deterministic bound on timing properties M nhnpntrmTcan MM 31100 Timing properties in partition cost function Statisti cal D etermini stic None Timing proportics In partition cost function 5 Static Partial Dynamic Scheduling exibility M nhnpntrmTcan MM 31100 A Partitioning cost function Consider software model in terms of set of program threads and cost function f M 7 W3 i m l HS 7 where 9 per second is thread latency execution delay 7 p1 per second thread reaction rate invocation rate of the program thread 11 processor utilization P is calculated by P 2 7 p m 11 BUS utilization B per second I eri um variables to be transferred rj inverse of minimum time interval betwden consecutive samplm for variable r Partition cost function Software characterization using 11 p P and B Static bound 7 can be used to select appropriate partition of system functionality between hardware and software over estimation of processor and bus bandwidth is possible since actual distribution of data communication is not captured above Include S hardware size bottom up Characterize interface using set of communication ports one per variable 7 overhead due to communication between hw and sw is manifested by the utilization ol bus bandwidth MnhnpntrmTcan MMFnll 00 10 Partitioning with cost function From a given set of sequencing graph models and timing constraints create two sets of sequencing graph models such that one can be implemented in hw and the other in sw and the following is true 7 timing constraints are satisfied for the two sets of graph models 7 processor utilization P S l 7 bus utilization B S B 7 A partition cost function f f SH B I m is minimized MnhnpntrmTcxns MMFnl 00 11 Partitioning using heuristics Minimum cost function can be achieved by trying very large number of solutions exponential relation to number of operations 7 heuristics are usedfor good solution that may show minimum costfunctionfor some local properties Start with constructive initial solution on which iterative procedure can be applied to improve the solution 7 exchange operations or paths between partitions apply procedure A good heuristic is relatively insensitive to initial solution 7 exchange of large number of operation makes it more insensitive to starting solution MnhnpntrmTcxns MMFnl 00 12 More to continue More on functional partitioning 39 Algorithms used for partitioning Assignments 7 partitioning some benchmark problems in Ptolemy environment Current reference 7 Rajesh Gupta eta HardwareSoftware Cosynthesis for Digital Systemsquot IEEE Design amp Test of computers Sept 1993 M nhnpnlmrToxns MM anll 00


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Bentley McCaw University of Florida

"I was shooting for a perfect 4.0 GPA this semester. Having StudySoup as a study aid was critical to helping me achieve my goal...and I nailed it!"

Amaris Trozzo George Washington University

"I made $350 in just two days after posting my first study guide."

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.