LIT ARTS MASS CULT
LIT ARTS MASS CULT ENG 475
Popular in Course
Popular in Foreign Language
This 8 page Class Notes was uploaded by Asha Block on Thursday October 15, 2015. The Class Notes belongs to ENG 475 at North Carolina State University taught by Staff in Fall. Since its upload, it has received 20 views. For similar materials see /class/224041/eng-475-north-carolina-state-university in Foreign Language at North Carolina State University.
Reviews for LIT ARTS MASS CULT
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/15/15
Handson Computer Architecture Teaching Processor and Integrated Systems Design with FPGAs Jan Gray Gray Research LLC PO Box 6156 Bellevue WA 98008 jsgrayacm org Abstract 7 Field programmable gate arrays are an ideal substrate for computer architecture project courses FPGA based processor development offers some learning opportunities that pure simulation approaches cannot rival This paper rst introduces the XSOC Project a free kit that includes the xr16 RISC CPU core systemon a chip infrastructure peripheral cores C compiler and simulator Such kits have numerous applications in a rst architecture course It then suggests that FPGA based processors can also be applied to the study of advanced architecture topics including memory systems multithreading LIVVs chip multiprocessors and architectural support for programming languages and networking I Introduction FPGAs are ideal for teaching hands0n digital design and computer design It is now possible to achieve an affordable FPGAbased computer design kit that is simple enough to be understood endtoend and rich enough to demonstrate whole computer systems This paper presents one such kit and then examines how FPGA processors and systems could be applied in teaching undergraduate computer architecture The paper concludes with speculation on how FPGA CPUs might also be applied in graduate level advanced architecture studies and research II The XSOC Project To promote doityourself processor development the author posted numerous Usenet postings 1 and a web site 2 on FPGA processor design but surprisingly few FPGAbased CPUs emerged over the years 3 Perhaps there were too many barriers to success 7 lack of published reference designs expensive FPGA tools and lack of compiler support To help remove these barriers the author prepared howto articles 4 and the XSOC Project Kit 5 to provide a concrete endtoend example of a practical FPGAbased processor SOC The goals of this kit are to promote processor and integrated systems design in FPGAs to show that FPGAbased SOCs can be costeffective alternatives to ASICs to help establish a community of designers and a library of reusable cores and especially to make it practical for students and hobbyists to design and build their own custom processors and systems The kit depends upon the recent emergence of lowcost FPGA development tools XSOC uses Xilinx Student Edition 15 XSE approximately 100 which includes the Xilinx Foundation Express tools including schematic capture HDL synthesis simulator and FPGA placeandroute development tools XSE also includes a textbook 6 that gently introduces digital design concepts through FPGA lab exercises culminating in the design of simple 4 and 8bit processors The exercises can be performed on the XESS XS40 prototyping board which includes an XC4005XL or XC4010XL FPGA a 100 MHZ programmable oscillator 32128 KB of RAM 8031 MCU parallel port VGA port keyboardmouse port as well as full documentation tools to download FPGA designs and memory images to the board and a thriving support mailing list Building on this foundation the XSOC Kit includes design les both schematics and synthesizable Verilog of the XSOC systemonachip the xr16 pipelined RISC processor core onchip bus and offchip memory controller and peripherals cores including onchip RAM parallel port and a bilevel VGA controller It also includes a port of the Ice retargetable C compiler an assembler and instruction set simulator and documentation and speci cations Full sources are included and XSOC may be used without charge for noncommercial educational and research purposes Most soft CPU cores are synthesized implementations of legacy instruction sets that ll large expensive FPGAs and have slow cycle times In contrast the xr16 processor demonstrates that a simple thrifty CPU design can achieve a costeffective integrated computer system in a tiny FPGA The XSOC systemonachip excluding RAMROM ts in a 392 logic cell XC4005XL39 the processor itself is about 270 logic cells A logic cell is an FPGA unit of area that includes one 4input lookup table LUT and one D ip op Put another way the entire xr16 core occupies less than 12 of 1 of the area of the largest Xilinx VirtexE device XCV3200E 73008 logic cells It has a cycle time of approximately 30 us in Xilinx SpartanXL devices which should approach 10 ns in newer Xilinx Virtex FPGA devices III xr16 Processor The goal of the xrl6 processor design was to demonstrate a fullfeatured pipelined RISC processor that runs integer C code and ts in an XC4005XL Area and performance were important but so were simplicity and easeofunderstanding The xrl6 processor is a classic RISC processor with a 16bit instruction word sixteen 16bit registers byte and word loadstore and add with carry support to form long integers from register pairs A stretch version xr32 with 32bit registers is forthcoming A Instruction set design The xrl6 instruction set was designed as follows First the lcc retargetable C compiler was ported to a generic 16bit RISC Then a handful of sample C programs were compiled and histograms of instruction frequencies were studied Next it was determined which instructions could be synthesized from others Finally a strategy for encoding wordwide immediate constants was selected The result 0 only add sub addi are 3operand 0 r0 always reads as 0 o 4bit immediate elds for 16bit constants a pre x imm establishes the mostsigni cant 12bits of the immediate operand of the instruction that follows interlocked compare and conditional branch sequence instead of architected condition codes jal jumpandlink jumps to an effective address saving the return address in a register 0 call func compactly encodes jal r15 func o perform mul div and variablebit shifts in software The xrl6 processor has six instruction formats and 43 instructions xr32 adds 2 additional instructions loadstore longword 15121187 430 Table 1 XR Instruction Formats slli slxi srai srli srxi rdimm blt bge ble bgt bltu bgeu bleu bgtu label reserved er6 Table 2 XR Instruction Set Some instructions are synthesized from others ra ra xori rd 0X80 subi 0X80 loadbyte Table 3 Synthesized Instructions To keep things simple there are no branch delay slots The architecture also re ects a streamlined implementation One shared memory port means loadstore instructions take two cycles To save an adder and a mux jumps and taken branches take three cycles And to save another mux result forwarding is performed on only the rst register operand The assembler handles other cases B Implementation The FPGA XSOCxrl6 is implemented in a Xilinx XC4005XLPC84C 3 This device has a 14x14 array of con gurable logic blocks CLBs and 61 10 blocks IOBs in a sea of programmable interconnect Every CLB has two 4input lookup tables LUTs and two ip ops Each LUT can implement any logic function of 4 inputs or a l6gtltlbit synchronous static RAM or ROM Each CLB also has carry logic to help build fast compact ripplecarry adders Each IOB offers input and output buffers and ip ops The output buffer can be 3stated for bidirectional I O The programmable interconnect routes CLBIOB output signals to other CLBIOB inputs It also provides wide fanout lowskew clock lines and horizontal long lines which can be driven by 3state buffers at each CLB The XC4000XL is ideal for implementing processors Just 8 CLBs can build a singleport 16x16bit register le using LUTs as SRAIVD a 16bit adder subtractor using carry logic or logic unit Since each LUT has a ip op the device is register rich enabling a pipelined implementation style and as each ip op has a dedicated clock enable input it is easy to stall the pipeline Long lines and 3state drivers form ef cient wordwide result multiplexers and onchip buses Pipeline design The xrl6 has a 3stage pipeline 0 IF instruction fetch 0 DC decode and operand fetch o EX execute and writeback results Each pipeline stage incurs an instruction fetch memory access an optional loadstore access and optional DMA accesses Each memory access can take one or more memory cycles 7 if the memory RDY signal is not asserted the pipeline does not advance For pipeline data hazards the xrl6 includes a result forwarding mux on the A operand only Conditional branches and jumps take place during the EX pipeline stage the IF and DC stage instructions in progress are annulled Control unit design The control unit inputs are the next instruction INSNm and RDY signals from memory and the ZNCOV outputs from the datapath The control unit outputs include the next memory access control signals and the datapath control signals As instructions ow through the instruction register IR pipeline they are decoded In the DC stage and the control unit drives the register le and operand selection control signals If the DC stage instruction is a conditional branch the EX stage instruction must be an addsub and its condition code outputs are evaluated against the branch condition If the branch is taken the branch displacement is added to PC in the EX stage On jumps and taken branches the control unit FSM annuls the two instructions in the branch shadow In the EX stage the control unit drives ALU result mux and addressPC unit control outputs On interrupts the control unit replaces the fetched instruction with jal r14 10 r0 which calls the interrupt handler Interrupt return is jal r0 0 r14 Datapath design The datapath executes instructions at up to 1 IPC It consists of a register le operand selection multiplexers ALU result multiplexer and an addressPC unit The 2R lW register le is implemented as two copies of a lR lW file each a 16x16 SRAM l6 LUTs The two register operands are read in the rst half of each cycle and the EX stage result is written back into both copies in the second half The A operand is either the register file port A output or the forwarded result value selected by a multiplexer The B operand is either the register le port B output or a signZeroextended immediate value formed from the IR s imm andor imm12 elds The A and B operand muxes and registers are each a column of 16 LUTs and ip ops The ALU consists of a 16bit addersubtractor 20 LUTs and a 16bit logic unit 16 LUTs The shifter requires no logic it merely skews the A operand register left or right by one bit The result multiplexer selects an EX stage result value from the adder logic unit shifters return address or a word or zeroextended byte load result It implements a 7input l6bit wide mux using long lines and 3state buffers conserving precious logic The addressPC unit adds either 2 or the branch displacement to PC 8l6 LUTs The next address is either the next PC or the effective address computed in the ALU as selected by ADDRMUX l6 LUTs The processor also acts as the DMA engine Instead of a simple register PC is a 16x16 register file with PCO storing the program counter and PC15 1 storing DMA address counters EXTERNAL BUS AND SRAM lNTERFACE Figure l is the XSOC system and xrl6 processor toplevel schematic with processor P memory and onchip bus controller MEMCTRL on chip bus and peripherals PARIN PAROUT IRAM and the VGA controller The onchip data bus uses an abstract peripheral 00172701 Figure 1 XSOCIxr16 Schematic mogumcwwm STARTUF swamp ea m are in mm at mew bus to provide gluelogicfree interfacing to peripheral cores This abstraction also makes it possible to evolve the onchip bus protocol without impacting existing systems BuF16 and perlpheral cores CONTRCFgtSITM STATE MACHINE INSTRUSDEEI EEJE DECODER DC OPERAND SELECTION EXECUTE STAGE C l o we W was usa rowmsi o wand 7 W cesar orzzur km rim mar mm The processor schematic not shown simply interconnects an instance of the control unit Figure 2 and the datapath Figure 3 7 my mm mon mi in m 4 3m 1 4 SL1 The design is synchronous E and it is safe to stop the clock During system bring up the RQ39ECONDITIONALmg X840 prototype board was attached to a PC parallel port Ski m M w Copyright 0 2000 Gray Research Ewen None and Was dl lVCl l at 1 Figure 2 XR16 CPU COMFO Uh SChema C ECODE This work and Its use subjamm xs cmCTRL16 Hz using a shell script EXECUTION UNIT cm ADIR iSH i 1 RESULT MUX AREGS FWD A I TO 1 I 11111 I llZC area and 53 W quotW A ZELQRE DWH SUVF gtE N eumI E39EUW SECMEUF uremia time the datapath is hand oorplanned using RLOC attributes Figure 4 Datapath CLBs are white other placed CLBs are light gray A few BanEx macr 39 LOGICBUF r4 LLchch sh r Fur Max 5m SRBUF swam E critical paths are manually H technology mapped using FMAPs The design is placed and routed with timing constraints to further optimize critical paths Figure 5 shows the XSOC ADDRESSPC UNIT W H EWSXHBUF FPGA in the context of the 5 X840 board The 8031 is not 531 r used and is held in reset Copyright 0 2000 Gray Research smear None Thrs work and Its use subject to xs era DP16 Figure 3 XR16 CPU Datapath Schematic Figure 4 XSOCxr16 Floorplan mm 41 gtltA XcAoost FPGA VGA PORT Figure 5 XSOC In Situ Figure 6 shows the VGA display while running a demo Figure 6 XSOC Graphics Demo Display The 576x455 bilevel bitmapped VGA controller displays all 32 KB of RAM The top lines of the screen are the demo program binary followed by the display font tables Below that are some two dozen lines of text output and some XOR line graphics And below that lies the stack It s fun to watch all of memory at 60 Hz One can observe the top of the call stack moving up and down the stack variables changing and counters counting and one is left with a visceral impression of the speed of the machine and where its program spends its time C Development tools The XSOC Project Kit includes source code or references to same and Win32 binaries for lcc xr16 a port of the lee retargetable C compiler 7 for xrl 6 and xr16 the xrl6 assembler and instruction set simulator The first port of lcc to target xrl6 s l6bit int and pointer model initially based upon the MIPS machine description took the author a compiler developer only one day Further modifications to also target the 32bit xr32 processor required just a few hours to revise the approximately 200 lines of xr32 specific instruction templates These pleasing results are a testament to lcc s retargetability and re ect that the XR processors were designed as targets of this compiler The xrl6 assemblerinstruction set simulator is also straightforward The assembler about 1300 lines of co e can be run for the sideeffect of emitting a listing file and image file or to initialize memory for the instruction set simulator The latter is a simple switch based interpreter some 400 lines of code which runs a perfectly adequate 3000000 instructions per second on a 266 MHZ PC As mentioned the kit includes full design sources in both schematics and Verilog source The latter are compact enough to run the entire system test bench using the free 1000 line limit Veriwell Verilog simulator It is quite instructive to compare and contrast the instruction set simulator output to the Verilog simulator output the latter highlighting pipeline stalls annulled instructions and so ort IV Teaching Applications of FPGA CPUs XSOC serves as a proof by example that an entire systemonachip sans RAM can be built in an modest PGA using inexpensive tools This section explores the teaching value of such FPGA computer systems Why build an undergraduate architecture course around FPGAbased processors and systems Because there is such value in the experience of building real hardware Besides the emotional appeal of booting a computer made of your own ideas and your own hands and how many educators have had that pleasure FPGA CPUs can impart a realism to the learning experience that is probably not available in more textbook or simulator based approaches So much of computer architecture is about making tradeoffs such as performance versus area versus cycle time versus power etc While there is much value in a course project to develop a processor model in an HDL and then study its behavior in a simulator it doesn t go far enough It s like teaching how to balance a home budget but with a bottomless checking account By not closing the loop with some kind of realistic cycle time area and resourceusage data the design tradeoffs aspect suffers Of course student projects could close the loop through the use of real EDA tools but that would be unnecessarily expensive and complicated In practice student editions of FPGA EDA tools suffice producing the desired timing analysis and resource usage reports In an implementationoriented course the tradeoffs are so much more quanti able Students can now experience as did the author that there is an area and delay cost to everything even a multiplexer how to nd a critical path and how retiming moves them about how to tradeoff area for new functionality vs area for reduced cycle times the importance of oorplanning and may even discover that adding something to a design can make it slower Of course the lessons of FPGA implementation will not directly apply to custom silicon implementations A student of FPGA CPUs might be surprised to learn that a l6x32bit register le 32bit adder or 32bit 21 multiplexer each 16 CLBs in an FPGA vary widely in area in a full custom design But the method of systematically evaluating design altematives and tradeoffs is the same regardless of the implementation technology Since compared to an FPGA a full custom design offers perhaps 20 times more gates each 10 times faster it follows that designing processors in programmable logic is not unlike taking a time machine back through the last seven years of Moore s Law 100 MHZ pipelined scalar and 2issue RISC processors are feasible but 800 MHZ associatively indexed outof order issue buffers are not Here are some other benefits of the realism imposed by a hardwarebased course project 0 There is less handwaving allowed 7 designs must be more thorough and complete or the implementation tools will fail to compile them Students learn the testing imperative 7 to the extent students produce an untestable design or skimp on writing testbenches they will learn their lesson in the lab hunched over a hot oscilloscope Students live the system bring up experience working through the adversity of a design that does not start or that fails intermittently and ending with the sublime glow felt when the darn thing finally wor Realism aside FPGA CPUs have other bene ts The use of a custom teachingoriented CPU FPGA or otherwise should permit course material on architecture and implementation to be streamlined as compared to a legacy instruction set architecture or even a subset For example the xrl6 core is so simple that a student should be able to understand the purpose and function of every last gate Studying computer design with FPGAs also confers vocational training benefits As FPGAs get faster larger and cheaper and as the minimum volume for gate array starts continues to rise FPGAs will increasingly displace gate arrays and even full custom implementations from many application areas In time the majority of digital systems designs could well be in programmable logic and FPGA CPU cores could become as commonplace as are discrete embedded processors today FPGA system design expertise should be a quite marketable skill V FPGA CPU Project Ideas Is it realistic to expect undergraduates perhaps working in teams to produce working FPGA CPUs and systems Perhaps 7 it would not seem to be too difficult a stretch at least for BB students who have already had a first course in digital design and exposure to HDLs Of course the speci c course project can be tailored to the class level class prerequisites and to available teaching and lab resources Assume students receive a kit with infrastructure software instruction set architecture and core interface speci cations compiler assembler instruction set simulator test suites system onchip cores and in some cases the processor core itself Here are some projects they might tackle 0 Implement a processor core for the given ISA 0 Double its performance 7 evolve a nonpipelined core into a pipelined core 0 Add a cache MMU or exceptions 0 Given C code with a critical inner loop build an on chip coprocessor to speed it up Or add new custom instructions to speed it up Don t forget to enhance the compiler assembler simulator and test suite 0 Port the design to a new FPGA device architecture and retime the pipeline 0 Build a systemonachip for a particular embedded application 0 Add a new onchip peripheral core 7 design the core add it to the 30C write the interrupt handler or device driver add testing support to the test bench 0 Develop a test suite or test bench for the system or processor 0 Reimplement a subset of a famous legacy ISA Our favorite project idea simulates the competitive processor design industry Student teams are issued a CPU design kit including compiler tools a working nonpipelined processor core a benchmark suite and an FPGA board which runs out of the box and are instructed to evolve and optimize their processor including its instruction set architecture and tools in order to run the benchmark suite as fast as possible or in as little total energy as possible At end of term teams submit their designs and vie for the coveted fastest CPU design trophy This sort of project could uniquely motivate students to practice all manner of quantitative analysis and design activities VI FPGA CPUs for Advanced Computer Architecture Studies and Research FPGA devices improve at a rapid pace Comparing the Xilinx XC4013 1152 logic cells in 1993 with the XCV3200E 73008 logic cells in 2000 reveals an improvement of 26 in just seven years Recent FPGAs from Xilinx and Altera include VirtexE and APEX now provide vast quantities of programmable logic and some dozens of large eg 256x16b embedded RAM blocks This section considers how these new devices enable direct prototyping of some more advanced areas of computer architecture research The xr16 core consumes 270 logic cells The 32bit xr32 core will be approximately 430 logic cells Redesigning xr32 for Virtex and trading off some logic for speed using dualport RAM for the register les and replacing the result multiplexer TBUFs with actual LUTbased muxes this could rise to 600700 logic cells Even so such a streamlined 32bit RISC would still use only 1 of the largest VirtexE device and less than 5 of a midrange 15552 logic cell XCV600E Therefore FPGAs now have adequate capacity to prototype 83 2way 32bit chip multiprocessors Then there are the new embedded RAM blocks The XCV600E provides 72 256x16 or 512x8 etc dual port synchronous SRAMs with sub5 ns cycle times Block RAM has many architectural applications 8 including o registers vector register les windowed register les and multicontext register les 0 stacks operands activation records control 0 onchip RAM ROM and microcode control stores 0 caches data tags writeaccumulation structures victim buffers o branch prediction branch history tables and branch target addressinstruction caches o MlVIUs segmentation registers and translation lookaside buffers no associative lookup though 0 debug and tracing support breakpoint code and data address count or value registers branch traces PC traces memory access traces and systems applications including o interconnects onchip packetcell buffers queues and virtual channel buffers 0 graphics video line input or output buffers or delay lines texture caches sprites pattern generators display lists span buffers color Z color mapping LUTs o garbage collection read write barriers via page table attribute bits or region table address checks card marking bit array 0 multimedia DCT and IDCT support 8x8 pixel blocks coefficient tables compression tables Indeed there are enough embedded RAM blocks in that midrange XCV600E to prototype an 8 or 16 CPU chip multiprocessor where each processor is multithreaded with an 8context 32x32register le 2 block RAMs and with a 256x32b Icache 2 block RAMs that share one 1024x32b L2 cache 8 block RAMs FPGA IO capabilities have also made great strides with the newer devices supporting a smorgasbord of signaling standards and supporting interfaces such as 200 MHZ ZBT SSRAM and 266 MHZ DDR SDRAM and providing interFPGA signaling rates of up to 300 Mbpspin It should be possible to prototype fast multi banked memory systems and interconnect fabrics Returning to the subject of advanced architecture studies again it would seem that there is some value in the additional grounding in reality inherent in a project oriented course Assume once again that students are issued an FPGA CPU and systemonachip kit including software tools and a toolkit of processor system interface and peripheral cores designs Here are some of the topics students might investigate analyze simulate and then prototype 0 Build a 23 operation longinstruction word machine including tools support Challenges here are mostly in the register le design and the code generator Build a 2issue superscalar processor Build a chip multiprocessor including a suitable memory system Build a multithreaded processor for a given workload Build a fault tolerant processor from several lockstep selfchecking processor pairs Add architectural support for nonpausing multi threaded garbage collection via hardware read and write barriers Add architectural support for network routing packet inspection etc possibly including integrated memory streaming or DMA instructions Add architectural support for message passing between two processors on one or separate devices Add architectural support for debugging or tracing For a particular signal processing problem add a fixed FPGA DSP datapath to the FPGA computer system Put another way modern FPGAs make it possible to studybyprototype just about anything that appears in a modern computer system Indeed except for large or content associative or manyported memories there are few structures in the computer architect s toolbox that are not easily implemented in an FPGA VII Related Work Several schools have used FPGA processor design projects to help teach computer design At Virginia Tech in 1995 students of EE6504 Rapid Prototyping of Computing Machinery designed 16bit HOKIE RISC processors for an XC4010 FPGA 9 The 1998 Cornell EE475 architecture class labs included VHDL design and FPGA veri cation of a simple CPU and a subsequent pipelined version 10 At Hiroshima City University more than 75 about 40 of the students in a class succeed to create their own original FPGA computers within 15 weeks in the rst term of their junior student days every year since 1996 Students work in pairs design RISC or CISC CPUs write HDL and target FPGAs 11 At Georgia Tech students model pipelined RISC processors in HDLs and synthesize and run them in lowcost Altera and Xilinx prototyping cards 12 VIII Conclusion Simulation is good but it does not model the elation felt when one s computer design boots in real hardware All that material on architectural tradeoffs retiming and oorplanning seems so much more relevant as applied to the device on your laboratory workbench I hear and forget I see and remember I do and understand 7 old Chinese saying courtesy Prof Philip Leong IX References 1 J Gray FPGA CPU Usenet Posting Archives wwwfpgacpuorgusenetl March 2000 2 J Gray Homebrewing RISC Microprocessors In FPGAs www3sympaticocajsgrayhomebrewhtm August 1996 3 J Gray FPGA CPU Links wwwfpgacpuorglinkshtml March 2000 4 J Gray Building a RISC System in an FPGA Part 1 Tools Instruction Set and Datapath Part 2 Pipeline and Control Unit Design Part 3 System onaChip Design Circuit Cellar Magazine 116118 MarchMay 2000 5 J Gray The XSOC Project Kit wwwfpgacpuorgxsoc March 2000 6 D Vanden Bout The Practical Xilinx Designer Lab Book Prentice Hall 1998 7 C Fraser and D Hanson A Retargetable C Compiler Design and Implementation Benjamin Cummings 1995 wwwcsprincetonedulcc 8 J Gray e Myriad Uses of Block RAM wwwfpgacpuorgusenetbbhtml Oct 1998 9 P Athenas Th Hokie Instant RISC Microprocessor WWWeevteducoursesee6504 1995 10 B d Elect Eng 475 Microprocessor Architectures instructlcitcomelleduCoursesee475 11 R Takahashi and N Yoshida Diagonal Examples for Design Space Exploration in an Educational Environment CITY1 Proc 1999 Int l Conf on Microelectronic Systems Education pp7173 1999 Also Microcomputer Desi n ucational Environment City1 WWWlclcehiroshimacuacjpNactivityCity1 12 J Hamblen Using Large CPLDs and FPGAs for Prototyping and VGA Video Display Generation in Computer Architecture Design Laboratories IEEE Technical Committee on Computer Architecture Newsletter Feb 1999 pp 12 14
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'