Class Note for EECS 700 at KU
Popular in Course
Popular in Department
This 6 page Class Notes was uploaded by an elite notetaker on Friday February 6, 2015. The Class Notes belongs to a course at Kansas taught by a professor in Fall. Since its upload, it has received 26 views.
Reviews for Class Note for EECS 700 at KU
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 02/06/15
Optimized Generation of Datapath from C Codes for FPGAs Zhi Guo Betul Buyukkurt Walid Naj jar University of California Riverside zgua abuyukku najjarcsucredu Abstract FPGAs as computing devices o er signy icant speedup over microprocessors Furthermore their configurability o ers an advantage over traditional ASICs However they do not yet enjoy highlevel language programmability as microprocessors do This has become the main obstaclefor their wider acceptance by application designers ROCCC is a compiler designed to generate circuitsfrom C source code to execute on FPGAs more specy ically on CSoCs It generates RTL level HDLs from frequently executing kernels in an application In this paper we describe ROCCC s system overview andfocus on its data path generation We compare the performance ofROCCC generated VHDL code with that oinlinx IPs The synthesis result shows that ROCCCgenerated circuit takes around 2x N 3x area and runs at comparable clock rate 1 Introduction Continued increases in integrated circuit chip capacity have led to the recent introduction of Configurable System onaChip CSoC which has one or more microprocessors integrated with a fieldprogrammable gate array FPGA as well as memory blocks on a single chip In these platforms both the FPGA fabric as well as the embedded microprocessors are essentially programmed using software The earliest example is the Triscend E5 followed by the Triscend A7 1 the Altera Excalibur 2 and Xilinx Virtex ll Pro 3 The capabilities of these platforms span a wide range with the Triscend A7 at the low end and the Xilinx Virtex ll Pro 2VP125 at the highend These amazing computing devices have the exibility of software and have been shown to achieve very large speedups ranging from 10x to 100x over microprocessors for a variety of applications including image and signal processing 456 Such speedups come from largescale parallelism made possible by highcapacity FPGAs as well as from customized circuit design The main problem standing in the way ofwider acceptance of CSoC platforms is their programmability Application developers must have an extensive hardware expertise in addition to their application area expertise to develop efficient designs Presently most CSoCs are programmed manually The main drawback of this approach is that it is very labor Kees Vissers X ilinx Corp kees vi ssersxili nx corn intensive and requires large design times Some commercial effort in programming FPGAs have been proposed by companies such as Synopsys 7 and Tensillica 8 Their focus is on moving simple loops to hardware or on instructionset extension Optimizing compilers for traditional processors have benefited from several decades of extensive research that has led to extremely powerful tools Similarly electronic design automation EDA tools have also benefited from several decades of research and development leading to powerful tools that can translate VHDL and Verilog code and recently SystemC 9 code into relatively efficient circuits However little work has been done to combine these two approaches In other words work is still needed to compile a highlevel language program based on CCava with software level optimizations with the intent of generating a hardware circuit Obviously it is neither practical nor desirable to translate the whole program into hardware It is therefore imperative to provide the programmer with tools that would help in identifying which code segments ought to be mapped to hardware as well as the cost and benefit tradeoffs implied Compiling to CSoCs and FPGAs in general is challenging Traditional CPUs including VLlW have a fixed hardware platform Their architectural features may or may not be exposed to the compiler FPGAs on the other hand are completely amorphous The task of an FPGA compiler is to generate both the hardware data path and the sequence of operations control flow This lack of architectural structure however presents a number of advantages 1 The parallelism is very high and limited only by the size of the FPGA device or by the data memory bandwidth 2 Onchip storage can be configured at will registers are created by the compiler and distributed throughout the data path where needed thereby increasing data reuse and reducing recomputations or accesses to memory 3 Circuit customization the data path and sequence controller are tailored to the specific computation being mapped to hardware Examples include customized data bitwidth and pipelining The objective of the ROCCC Riverside Optimizing Configurable Computing Compiler project is to design a highlevel language compiler targeting CSoC It takes high level code such as C or FORTRAN as input and generates RTL VHDL code for the FPGA and C code for the CPU In this paper we describe the overall structure of the compiler and emphasize the data path generation component We compare the clock speed and area of automatically generated circuits to a number of IP codes available on the Xilinx web site The results show that the speed is within 10 while the area is larger by a factor of2 to 3 The work in 2526 has compared generated code with hand written VHDL Both have shown a factor of 2 on the performance decrease of the generated code in area and clock rate ROCCC is built upon the knowledge acquired from SAC and StreamsC We experimentally show that the resultant VHDL is much closer to the handwritten one The rest of this paper is organized as follows The ROCCC compiler is introduced in section 2 Related work is discussed in section 3 Section 4 presenm ROCCC compiler RTL code generation for the controller the buffer and the data path Experimental resulm are reported in section 5 Section 6 concludes the paper 2 ROCCC System Overview Figure 1 shows the overview of the ROCCC compiler The profiling tool set has been described in a prior publication 1 0 It identifies the frequently executing code kernels in a given application ROCCC s objective is to compile these kernels to HDL code which is synthesized using commercial tools The ROCCC system is built using SUTF 11 and MachineSUTF l 2 platforms SUTF le intermediate representations provide abundant information about loop statemenm and array accesses ROCCC performs loop level optimizations on SUTF le Loop unrolling for FPGAs requires compile time area estimation The work reported in 13 shows that in less than one millisecond and within 5 accuracy compile time area estimation can be achieved Information to generate highlevel unis such as controllers and buffers is also extracted from SUTF le MachineSUTF analysis and optimization passes such as Control Flow Graph CFG library 1 4 Data Flow Analysis library 15 and Static Single Assignment library Esummn ROCCC System 9AM Loop enemy Optimization Pawer Ganeramm Data Path Generation Graph Editor Annmaum Figure 1 ROCCC System Overview 16 are used to generate the data path ROCCC s conventional optimizations include constant folding loop unrolling etc Full loop unrolling converts a forloop with constant bounds into a noniterative block of code and therefore eliminates the loop controller In addition to these conventional optimizations at loop level ROCCC performs FPGAspecific optimizations such as loop stripmining loop fusion etc At storage level and circuit level ROCCC s optimizations are closely related with HDL code generation and are discussed in section 4 The restrictions on the C code that can be accepted by the ROCCC compiler for mapping on an FPGA fabric include no recursion no usage of pointers that cannot be statically unaliased Function calls will either be inlined or whenever feasible made into a lookup table 3 Related Works Many projects employing various approaches have worked on translating highlevel languages into hardware SystemC 20 is designed to provide roughly the same expressive functionality of VHDL or Verilog and is suitable to designing softwarehardware synchronized systems HandleC 21 as a low level hardwaresoftware construction language with C syntax supports behavioral descriptions and uses CSPstyle Communicating Sequential Processes communication model SAC 22 is a singleassignment highlevel synthesizable language Because of special constructs specific to SAC such as window constructs and is functional nature is compiler can easily exploit data reuse for window operations SAC uses preexisting parameterized VHDL library routines to perform code generation in a way that requires a number of control signals between components and thereby involves extra clock cycles and delay Our compiler avoids spending clock cycles on handshaking by focusing more on the compiletime analysis It takes a subset of C as input and does not involve any nonC syntax StreamsC 23 relies on the CSP model for communication between processes both hardware and software StreamsC can meet relatively highdensity control requirements However it does not support accesses to twodimension arrays and therefore image processing applications including video processing must be mapped manually This makes it very awkward to efficiently support algorithms that rely on sliding windows For onedimension input data vector such as a one dimension FIR filter StreamsC programmers need to manually write data reuse in the input C code in order to make sure that a data value is retrieved only once from external memory SPARK 24 is another C to VHDL compiler Is transformations include loop unrolling common sub expression elimination copy propagation dead code elimination loopinvariant code motion etc SPARK does not support multidimension array access es 4 The ROCCC Compiler ROCCC targets high computational density low control density applications Figure 2 shows the execution model An engine moves the data from offchip to a BRAM storage The compilergenerated circuit accesses the arrays in BRAM and stores the output data into another BRAM from which an engine retrieves data into the offchip memory Inside the compilergenerated circuit the data path is fully pipelined The controllers and buffers are in charge of feeding input data and retrieving output data to and from the data path Offchj 41 Controller MEMp Block RAM and Buffers F ROCCC s scalar S nallt bLllffelr 3 replacement 3 transformation 0 H g g converm for g S instance the segment E in Figure 3 a into m H E the segment in Figure Sma buffer E 3 b We can see that I I I I scalar replacement isolates memory 013ng quot BIOCk RAM access from calculation The Figure 2 The Execution Model highlighted reg i011 Of code is exported in the form of Figure 3 c and goes to the data path generator At the same time the loop statement and memory loadstore code are used to generate the controllers and buffers The controllers include address generators which export a series of memory addresses according to the memory access pattern and a higherlevel controller which controls the address generators They are all implemented as preexisting parameterized FSMs finite state machine in a VHDL library One of the major reasons that account for FPGA s for i039 iltN39 ill Ci 3Ai 5Ai1 7Ai2 9Ai3 7 Ai439 a for i039 ilt1739 ill A0 Ai39 A1 Ai139 A2 Ai239 A3 Ai339 A4 Ai439 TmpO 3A0 5Al 7A2 9A3 A439 Ci TmpO39 03 void mainidfint A0th A1th A2th A3th A4int TmpO TmpO 3A0 5Al 7A2 9A3 A439 return C a 7 A Stap FIR in original C code b 7 The FIR a er scalar replacement 0 7 The FIR C code fed into the data path generator Figure 3 A Stap FIR in C speedup over generalpurpose processor is that FPGA is capable of providing optimized IO interface between data path and memory unis 17 For example each iteration of the forloop in Figure 3 a is essentially an operator on a window of five consecutive array elements The window slides on the array Two adjacent windows have four input data in common and only one new input data per windowiteration ROCCC as a highlevel synthesis compiler uses the knowledge of memory access pattern from the input code such as the code shown in Figure 3 b to automatically generates an intelligent buffer called smart bu er based on the bus size window size data size and slidingwindow stride This buffer unit is able to reuse live input data clean unused data and export the present valid input data set the 5data window in Figure 3 b to the data path 1 8 42 Data Path Generation Before building the data path a few preparation passes are done both at the frontend and backend Then ROCCC s backend passes perform the analysis optimization and data path generation 421 Preparation Passes ROCCC uses MachineSUIF virtual machine SUIva 19 intermediate representation as the backend IR The original SUIva assemblylike instructions by themselves cannot completely cover HDLs hardware description functionality On the other side the frontend analysis may assist and simplify the data path generation at backend Besides backend data flow analysis ROCCC performs highlevel data flow analysis at frontend and the analysis information is transferred through predefined macros to assism backend hardware generation Figure 4 b shows an accumulator after applying scalar replacement in C The variable sum is detected as a feedback signal Figure 4 c shows the resultant segment in C in which macro ROCCCiloadjrevO and macro ROCCCistoreZnext annotate the signal feedback After applying scalar replacement and frontend dataflow analysis the function that describes the scalar computing like the codes shown in Figure 3 c or Figure 4 c is fed into MachineSUIF ROCCC performs circuit level optimizations and eventually generates data path on a modified version of the MachineSUIF virtual machine SUIva 19 intermediate representation Before fed to ROCCC s passes the virtual machine IR first undergoes MachineSUIF Static Single Assignment and Control Flow Graph transformations At this point control flow graph information is visible and every virtual register is assigned only once The preserved macros are converted into ROCCC specific opcodes For example ROCCCiloadjrevO and ROCCCistoreZnext in Figure 4 c are converted into instructions with opcode LPR load previous and SAM store next respectively We are working on supporting bit int sum 039 fori039ilt3239i for i 039i lt 3239 i mainiTmpO Ai39 sum sum Ai39 sum sum mainiTmpO39 a 03 int sum 039 int sum 039 void mainidpint mainiTmpO intquot mainiTmpl int mainidpiTmp239 mainidpiTmp2 ROCCCiloadJrevsum mainiTmpO39 ROCCCistore2nextsum mainidpiTmp2 mainiTmpl sum39 C a 7 An accumulator in original C code b 7 The accumulator a er scalar replacement c 7 The C code fed into data path generator a er detecting feedback variable and adding preserved macros Figure 4 An Accumulator in C manipulation macros which are the lack of highlevel languages 422 Data Path Building Each instruction that goes to hardware is assigned a location in the data path We add new fields into Machine SUTF IR to record the location of each arithmetic logic or register copying instruction s location For example Figure 6 shows the data path for the C code list in Figure 5 We maximize instruction level parallelism All the input and output operands are copied to the entry or exit of the data flow respectively A virtual register s definition and reference should be adjoining in the data flow If not extra register copying instructions are added to satisfy so The compiler first builds data path for each nonnull node in the CFG as node 1 through node 4 shown in Figure 6 To parallelize alternative branches the compiler adds a new mwc node between alternative branch nodes and their common successor node for instance node 7 in Figure 6 A new pipe node node 6 in Figure 6 for instance is added void ifielseint x1 int x2 intquot x3 intquot x4 int ac39 c xl x239node1 ifc lt x2 a x1x139node 2 else a xl x2 339 node 3 cca39 88 node4 x4a39 return Figure 5 An Alternative Branch in C The pointers are only used to indicate multiple return values ROCCC does not support pointers son nod quotquot53 hard uorle iiode miiuber El latch Figure 6 The Alternative Branch Data Path to copy live variables from alternative branches parent node to their common successor node In Figure 6 node 6 and 7 are called hard nodes since they only appear in hardware and have no equivalence in software Nodes 1 through 4 are thereby called soft nodes Notice that if we only consider soft node vrll in node 4 is vrll in node 1 the same case as of vr13 Therefore the soft nodes by themselves will have the same behavior on a CPU compared with the whole data path on a FPGA 423 Data Path Pipeline ROCCC automatically places latches in a data path to pipeline it The latch location in a node is decided based on the delay estimation of instructions which is beyond this paper s scope The latch location also satisfies special opcodes requirements For example SM instruction must have a latch to store the feedback signal to Figure 7 The the corresponding LPR instruction Accumulator Data Figure 7 shows the data path of Path Figure 4 c After data path pipelining each pipeline stage is an instance of single iteration in the for loop body 424 VHDL Code Generation ROCCC generates one VHDL component for each CFG node that goes to hardware In a node every virtual register is single assigned and is converted into wires in hardware All arithmetic opcodes in SUTva have corresponding functionality in IEEE 10763 VHDL with the exception of division Arithmetic logic and copying instructions become combinational or sequential VHDL statement according to ma i ILT mpO ma EILT mp l Table 1 A comparison of hardware performance from Xilinx IPs and ROCCCgenerated VHDL code The wavelet engine is not from the Xilinx IF it is written in VHDL Xilinx IP ROCCC Example ClockMHz Area slice Clock MHZ Area slice Clock Area bitcorreator 212 9 144 19 0679 211 mulacc 238 18 238 59 100 328 udiv 216 144 272 495 126 344 square root 167 585 220 1199 132 205 cos 170 150 170 150 100 100 Arbitrary LUT 170 549 170 549 100 100 FIR 185 270 194 293 105 109 DCT 181 412 133 724 0735 176 Wavelet 104 1464 101 2415 0971 165 whether the instruction needs latched or not A LUT instruction invokes an instantiation of a lookup table component If the lookup table is a preexisting one such as a cos lookup table the compiler automatically includes the existing component Otherwise for example if a user wanm to have a probability distribution function lookup table the compiler instantiates the lookup table as a regular ROM lP core unit in the VHDL code The only thing the user needs to do is to edit a pure text initialization file which defines the lookup table s content By adding more data type in MachineSUTF ROCCC supports any signed and unsigned integer type up to 32 bit The compiler infers the inner signals bit size automatically 5 Experimental Results We compare the hardware performance generated from Xilinx lP cores and ROCCCgenerated VHDL code We use Xilinx ISE 51 i and IP core 5li All the Xilinx lP cores and ROCCCgenerated VHDL code are synthesized targeting a Xilinx VirtexH xc2v20005 FPGA All the benchmarks in Table l are picked from Xilinx lP core except the wavelet engine The input and output variables of ROCCC equivalents have the same bit sizes as that of the IP cores Biticorrelator counm the number of his of an 8bit input data that are the same as of a constant mask Muliacc is a multiplieraccumulator whose input variables are a pair of 12bit data Udiv is an 8bit unsigned divider Squareiroot calculates a 24bit data s square root Cos s input is 10bit output 16bit The arbitrary LUT has the same port size as that of cos FIR is two 5tap 8bit constant coefficient finite impulse response filters whose bus sizes are 16bit DCT is a onedimension 8data discrete cosine transform The input data size and output data size are 8bit and 19bit respectively For Xilinx lP FIR and DCT the multiplications with constanm are implemented using distributed arithmetic technique which performs multiplication with lookuptable based schemes Therefore we set the synthesis option multiplier style as LUT for the ROCCCgenerated DCT and FIR The second and the third column of Table 1 show Xilinx lP cores clock rate and device utilization and the forth and the fifth column show ROCCC s corresponding performance Clock is the percentage difference in clock rate of ROCCCgenerated VHDL compared to Xilinx lP Area is the percentage difference in area of ROCCC generated VHDL compared to Xilinx lP Bitcorrelator udiv and square root consist of a number of bit manipulations The C input as a highlevel code is not good at describing bit operations and therefore is one of the major causes of the performance difference Xilinx muliacc IP has a control input signal nd new data whose Boolean value true indicates the present data is valid In C code we describe the equivalent behavior using ifelse statement whose condition evaluates Boolean input nd Thus extra nodes and latches are added to support the alternative branch and take extra area We used to convert this C code by multiplying ad with the new input data instead of using ifelse statement Though one more multiplier was used the overall area and clock rate performance was better than the one listed in Table 1 Obviously this is not compile level optimization But at the same time it shows one of highlevel synthesis s advantages ease to do algorithm level optimizations In terms of lookup tables ROCCCgenerated VHDL code instantiates Xilinx 1P cores Therefore they have exactly the same performance In Xilinx VirtexH lObit input16 bit output cossin lookup table stores only half wave which is one of the reasons that this cossin lookup table utilizes less area compared with the arbitrary ROM lookup table with the same port size Fir operates on an array Basically a 5data window slides on the onedimension array ROCCC generates smart bu er to reuse the previous input data The FIR s data path consism of multipliers adderssubtracters and no branch ROCCC fits this type of algorithms and gem comparable performance with IP cores Like FIR DCT has high computational density and no branch The throughput of Xilinx DCT IP is one output data per clock cycle while ROCCC s throughput is eight output data per clock cycle Therefore though ROCCC generated DCT runs at a lower speed 735 the overall throughput of ROCCCgenerated circuit is higher Both ROCCC DCT and Xilinx lP DCT explore the symmetry within the cosine coefficienm The last row in Table 1 shows an implementation of a twodimension 5 3 wavelet transform engine which is the standard lossless JPEG2000 compression transform This wavelet transform engine includes the address generator smart bu er and data path The ROCCCgenerated circuit is compared with a handwritten one We derive bit width only based on port size and opcodes More aggressive bit narrowing performed by users orand the compiler may reduce device utilization 6 Conclusion The reconfigurable computing paradigm is a powerful computing model that has a lot of potential for long running or streaming applications that are somehow regular in nature The main obstacle to is use is is programmability Handwritten HDL code for large scale applications is not the most desirable approach Automatic compiler generation of HDL code from highlevel languages is very challenging The ROCCC compiler generates VHDL for reconfigurable computing from highlevel languages such as C or Fortran ROCCC performs loop level storage level and circuit level optimizations In this paper we have mainly presented is data path generation At frontend the compiler performs highlevel data flow analysis and transfers the analysis information through preserved macros At backend the compiler explores lowlevel parallelism pipelines data path and narrows inner signals bit sizes ROCCC suppors lookup tables through automatically instantiating preexisting lookup table IPs or ROM IPs We compared the performance of ROCCCgenerated VHDL code with that of Xilinx IPs The synthesis result shows that ROCCCgenerated circuit takes around 2X N 3X area and runs at comparable clock rate ROCCC performs better on high computational density examples than on high control density ones 7 References 1 Triscend Corporation quotTriscend A7 Con gurable System on a Chip Familyquot httpwwwtriscendcomproducsa7 htm 2004 2 Altera Corp quotExcalibur Systemon aProgrammable h pMwwwalteracom 2004 3 Xilinx Corp IBM and Xilinx Team h pMwww Xilinxcomprs rlsibmpanner htm 2004 4 W Chen P Kosmas M Leeser C Rappaport An FPGA Implementation of the TwoDimensional FiniteDifference TimeDomain FDTD Algorithm Int Symp Field Programmable gate Arrays FPGA Monterrey CA February 2004 5 J Keane C Bradley Clark C Ebeling A Compiled Accelerator for Biological Cell Signaling Simulations Int Symp FieldProgrammable gate Arrays FPGA Monterrey CA February 2004 6 Berkeley Design Technology Inc httpwwwbdticomarticlesinfo eet0207fpga htmDSP Enhanced20FPGAs 2004 7 Synopsys Inc httpwwwsgopsyscom 2004 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Tensilica httpwwwtensilicacom 2004 SystemC Consortium httpwwwsystemcorg 2004 D C Suresh W A Najjar J Villareal G Stitt and F Vahid Pro ling Tools for HardwareSoftware Partitioning of Embedded Applications Proc ACM Symp On Languages Compilers and Tools for Embedded Systems LCTES 2003 San Diego CA June 2003 SUIF Compiler System httpsuifstanfordedu 2004 MachineSUIF httpwwweecs harvardeduhuberesearchmachsuifhtml 2004 D Kulkarni W Najjar R Rinker and F Kurdahi Fast Area Estimation to Support Compiler Optimizations in FPGA based Recon gurable Systems IEEE Symp on Field Programmable Custom Computing Machines FCCM Napa CA April 2002 G Holloway and M D Smith Machine SUIF Control Flow Graph Library Division of Engineering and Applied Sciences Harvard University 2002 G Holloway and A Dimock The Machine SUIF BitVector DataFlowAnalysis Library Division of Engineering and Applied Sciences Harvard University 2002 G Holloway The MachineSUIF Static Single Assignment Library Division of Engineering and Applied Sciences Harvard University 2002 Z Guo W Najjar F Vahid and K Vissers A Quantitative Analysis of the Speedup Factors of FPGAs over Processors Int Symp FieldProgrammable gate Arrays FPGA Monterrey CA February 2004 Z Guo B Buyukkurt W Najjar Input Data Reuse In Compiling Window Operations Onto Recon gurable Hardware Proc ACM Symp On Languages Compilers and Tools for Embedded Systems LCTES Washington DC June 2004 G Holloway and M D Smith MachineSUIF SUIva Library Division of Engineering and Applied Sciences Harvard University 2002 SystemC Consortium httpwwwsystemcorg 2004 HandelC Language Overview Celoxica Inc httpwwwceloxicacom 2004 W Najjar W B hm B Draper J Hammes R Rinker R Beveridge M Chawathe and C Ross From Algorithms to Hardware A HighLevel Language Abstraction for Recon gurable Computing IEEE Computer August 2003 M B Gokhale J M Stone J Arnold and M Lalinowski Streamoriented FPGA computing in the StreamsC high level language In IEEE Symp on FPGAs for Custom Computing Machines FCCM 2000 SPARK project thmeslucsdeduspark 2004 J Frigo M Gokhale and D Lavenier Evaluation of the StreamsC CtoFPGA Compiler An Applications Perspective Ninth ACMSIGDA International Symposium on Field Programmable Gate Arrays FPGA Monterey CA 2001 Z Guo D C Suresh W A Najjar Prograrnmability and Ef ciency in Recon gurable Computer Systems Workshop on Software Support for Recon gurable Systems held in conjunction with the Int Conf Of HighPerformance Computer Architecture HPCA Anaheim CA February 2003