Popular in Course
Popular in Computer Information Technology
This 11 page Class Notes was uploaded by Haylie Satterfield on Saturday September 19, 2015. The Class Notes belongs to CISC879 at University of Delaware taught by Staff in Fall. Since its upload, it has received 23 views. For similar materials see /class/207175/cisc879-university-of-delaware in Computer Information Technology at University of Delaware.
Reviews for ADVANCEDPARALLELPROGRAMMING
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 09/19/15
THE RAW MICROPROCESSOR A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERALPURPOSE PROGRAMS Michael Bedford Taylor Jason Kim Jason Miller David Wentzlafl Fae Ghodrat Ben Greenwald Henry Hofhnan PaulJohnson Jae Wook Lee Walter Lee Albert Ma Arvind Saraf Mark Seneski Nathan Shnidman Volker Strumpen Matt Frank Saman Amarasinghe Anant Agarwal Massachusetts Institute oI Technology 02721732021700 2002 IEEE WIRE DELAY IS EMERGING AS THE NATURAL LIMITER TO MICROPROCESSOR SCALABILITY A NEW ARCHITECTURAL APPROACH COULD SOLVE THIS PROBLEM AS WELL AS DELIVER UNPRECEDENTED PERFORMANCE ENERGY EFFICIENCY AND COST EFFECTIVENESS o o n n o o The Raw microprocessor consumes 122 million transistors executes 16 different load store integer or floatingpoint instruc tions every cycle controls 25 Gbytess of inputoutput IO bandwidth and has 2 Mbytes of onchip distributed LI static RAM providing onchip memory bandwidth of57 Gbytess Is this the latest billiondollar 3000 manyear processor effort In fact it took only a handful of graduate students at the Labora tory for Computer Science at MIT to design and implement Raw Our research addresses a key technological problem for microprocessor architects How to leverage growing quantities of chip resources even as wire delays become substantial The Raw research prototype uses a scalable instruction set architecture ISA to attack the emerging wiredelay problem by providing a parallel software interface to the gate wire and pin resources of the chip An architecture that has direct rstclass analogs to all of these phys ical resources will ultimately let programmers achieve the maximum amount of performance and energy ef ciency in the face of wire delay Existing architectural abstractions such as interrupts caches context Witches and virtu alization can continue to be supported in this environment even as a new lowlatency com munication mechanismithe static networki enables new application domains Technology trends Until recently the abstraction of a wire as an instantaneous connection between transistors has shaped assumptions and architectural designs In an interesting twist just as the clock frequency of processors has risen expo nentially over the years the fraction of the chip that is reachable by a signal in a single clock cycle has decreased exponentially1 Thus the idealized wire abstraction is becom ing less and less representative of reality Archi tects now need to explicitly account for wire delay in their designs Today it takes on the order of two clock cycles for a signal to travel from edgetoedge roughly fteen mm ofa ZGHZ processor MICROARCHITECTURE An evolutionary response for current instruction set architectures Figure A snows now we beiieve today s microarcnitectures wiii adapt as effective siiicon area and pin resources increase and as wire deiay worsens Designers wiii want to utiiize the increased siiicon resources wniie maintaining nign ciock rates We envision a nignefreguency execution core containing severai nearby ciustered ALUs witn specuiative controi guessing to which ALU ciusterto issue Around this core is a host of pipeiinedfioatingrpointunitswnicn areiessiaiency en iii 1 p f b 39 b39 specuiative controi and otneroutrofrband iogic tnat s focused on making the tiny execution core run as ef cientiy and as fast as possibie In addition since most conventionai instruction set arcnitectures ISAs do not have an arcnitecturai anaiog to pins most pins wiii be aiiocated to wide externairmemory interfaces tnat are opaque to the software Tne ISA gap between softwarerusabie processing resources and the actuai amount of underiying pnysicai resources is going to steadiiy increase in these future designs Even today it is easy to teii tnat the percentage communicating transistors on critical paths In the past pre cious silicon area dictated logic reuse but designers now freely duplicate logic to reduce wire lengt sifor example the adders in the loadstore unit and in the arithmetic logic unit Even microarchitects are making concessions to wire delay The architects of Com paqs Alpha 21264 were ofsiiicon performing actuai come putation has dropped guadraticair iy over time Tne Compaq Aipna Zi464 designian eigntrway issue superscaiariis in excess of 27 times as iarge as tne originai two way issue ZiUB4Tne area of the management iogic dwarfs the area occupied by the ALUs and system performance is getting increasing iy nondeterministic and sensitive to the particuiars oftne program impiementation Intei for instance has produced hundreds of pages tnatsuggest metnods ofimproving system performance by avoiding staii conditions in the Pentium 4 microarcnitectureFurtnermore the power consumption as weii as design and verification costs of these increasingiy compiex arcnir tectures is skyrocketing IEEE MICHU forced to split the integer unit into two physically dispersed clusters with a oneCycle penalty for communication of results between clusters More recently the Intel Pen tium 4 architects had to allo Floating point units Speculative an o ofband contro cate two pipeline stages solely for the traversal of long wires The wire delay problem will only get worse In the arena oflOGHZ processors designers will experience latencies of 1 0 cycles or more across a processor die It will become increasingly more Q Execution core ione cycle Wire mArithmetic logic units challenging for existing Figure A How today39s microarchitectures might adapt to worsening wire deiay as effective siiicon area and pin resources increase die Processor manufacturers have strived to maintain high clock rates in spite of the increased impact of wire delay Their innova tion has been at many levels The transition from aluminum to copper wires has reduced the resistance and thus the resistancecapaci tance delay of the wires Process engineers have also altered the aspect ratio of the wires to reduce resistance Finally the introduction oflow k dielectrics will provide a onetime improvement in wire delay Unfortunately materials and process changes have not been sufficient to solve the problem Forced to worry about wire delays designers place the logic elements of their processors very carefully minimizing the distance between architectures to turn chip resources into higher perfor mance See the An evolu tionary response for current instruction set architectures sidebar for how we think existing architectures will attempt to overcome this challenge The Raw microprocessor We designed Raw2 to use a scalable ISA that provides a parallel software interface to the gate wire and pin resources of a chip An architecture with direct firstclass analogs to all of these physical resources lets program mers extract the maximum amount ofper formance and energy efficiency in the face of wire delay In effect we try to minimize the ISA gap by exposing the underlying physical resources as architectural entities The Raw processor design divides the usable silicon area into 16 identical pro grammable tiles Each tile contains a 32bit full duplex network link 13 Compute resources Programmable Flgure l Onichlp lntercomects ln Flaw The Flaw mlcroprocessor comprlses 16 tlles a Each tlle b has computatlonal resources and four networks each wlth elght polnttoipolnt 327blt buses to nelghbor tlles 0 one static communication router two dynamic communication routers an eightstage inorder singleissue MIPSstyle processor 0 a fourstage pipelined oatingpoint unit a 32Kbyte data cache and 96 Kbytes ofsoftwaremanaged instruc tion cache We sized each tile so that the time for a signal to travel through a small amount oflogic and across the tile is one clock cycle Future Raw processors will have hundreds or perhaps thousands oftiles The tiles interconnect using four 32bit fullduplex onchip networks consisting of over 12500 wires as Figure 1 shows Two net works are static routes specified at compile time an J i I I J at runtime Each tile only connects to its four neighbors Every wire is registered at the input to its destination tile This means that the length of the longest wire in the system is no greater than the length or width of a tile This property ensures high clock speeds and the continued scalability of the architecture The Raw ISA exposes these onchip net works to the software enabling the program mer or compiler to directly program the wiring resources of the processor and to care fully orchestrate the transfer of data values between the computational portions of the tilesimuch like the routing in a fullcustom application speci c integrated circuit AS lC Effectively the wire delay manifests itself to the user as network hops It takes six hops for a data value to travel from corner to corner of the processor corresponding to approximate ly six cycles of wire delay On the edges of the network the network buses are multiplexed in hardware down onto the pins ofthe chip using logical channels5 as Figure 2 next page shows To toggle a pin the user programs the onchip network to route a data value off the side of the array Our 1657 pin ceramiccolumn gridarray r r i 14 fullduplex 32bit 75 Gbps lO ports at 225 MHZ for a total of25 Gbytess ofbandwidth The design does not require this many pins rather we use this number to illustrate that no matter how many pins a package has 100 or 10000 Raw s scalable architectural mechanism lets the programmer put them to good use Fewer pins merely require more multiplexing Alter MAHCH APHIL 2002 Wideword analog todigital converter MICROARCHITECTURE Multiplexing of networks down onto pins via logical channels Highspeed input device Raw lO ports a single ROM hooked up to 7 5 Gbps Channels any IO port is suf cient to 14 total Figure 2 Pin multiplexing and device usage in Flaw Physical entity Table 1 How Raw converts physical resources into architectural entities Raw ISA analog Conventional ISA analog Gates Tiles new instructions New instructions Wire delay Networllt hops None Pins lO ports None Conventional ISAs attempt to utilize increasing gate quantities through the addition of new instructions like parallel SIMD instructions and through dynamic mapping of operations to a small number of architecturally invisible ALUsr Wre delay is typically hidden through pipelining and speculation and is re ected to the user in the form of dynamic stalls for non fast path and mispredicted coder Pin bandwidth is hidden behind speculative cache miss hardware prefetching and large line sizesr IEEE MICHU nately a small pincount package can be sup ported by bonding out only a subset of the ports The Raw IO port is a highspeed simple a threeway multiplexed IO port has 32 data and ve control pins for each direction and exible wordoriented abstraction that lets sys tem designers proportion the quantities of I 0 devices according to the application domains needs Memory intensive domains can have up to 14 dedicated interfaces to DRAM Other applications may not have external memoryi boot Raw so that it can exe cute out of the onchip mem ory In addition to transferring data directly to the tiles off chip devices connected to Raw IO ports can route through the onchip networks to other devices to perform direct memory accesses We plan to hook up arrays of high speed data input devices including wideword analog todigital converters to exper iment with Raw in domains that are 10 communication and compute intensive Architectural entities e conventional ISA as chipset enjoyed enormous success because it hides the details of the underlying implementa tion behind a wellde ned compatibility layer that matches the underlying implementation substrate fairly well Much as the existence of a physical multiplier in a processor merits the addition of a corresponding architectural enti ty the multiply instruction the prominence of gate resources wire delay and pins will soon merit the addition of corresponding architectural entities Furthermore we expose these entities in a way that will allow subse quent generations of Raw processors to be backward compatible Table l contrasts how the Raw ISA and conventional ISAs expose gates wire delay and pins to the programmer Because the Raw ISA has interfaces that are more direct Raw processors will have more functional units as well as more exible and more ef cient pin utilization Highend Raw processors are like ly to have more pins because the architecture is better at turning pin count into perfor mance and functionality Finally Raw proces sors will be more predictable and have higher clock frequencies because of the explicit expo sure of wire delay This exposure makes Raw scalable Creat ing subsequent more powerful generations of the processor is straightforward we simply stamp out as many tiles and IO ports as the silicon die and package allow The design has no centralized resources global buses or structures that get larger as Video the tile or pin count increas data stream es Finally wire length design complexity and verification complexity are all indepen dent of transistor count Application domains e Raw microprocessor runs computations that form a superset of those run on today s generalpurpose pro cessors Our goal is to run not just conventional scalar codes for example Specint and Specfp but wordlevel com putations that require so much performance that they have been consigned to custom AS IC implementations If an applica tion can take advantage of the customized placement and routing the ample gates and the programmable pin resources available in an ASIC process it should also bene t from the architectural versions of those same resources in the Raw microprocessor For instance our rstcut untuned implementation of a soft ware Gigabit Internet protocol router on a 225 MHz 16tile Raw processor runs more than ves times faster than a handtuned imple mentation on a 700MHZ Pentium III proces sor Additionally an implementation of Video median lter on 128 tiles attained a 57time speedup over a single Raw tile Unlike an AS IC however applications for Raw can be written in a highlevel language such as C or Java or new languages such as StreamIt4 and the compilation process takes minutes not months We call applications that leverage the Raw static networks AS IC lilie placeandroute facility m ware circuits C I L Memory and functional units i 7 Frame buffer r 7 and screen Twoway threaded Custom ourway Java program data path parallelized pipeline scalar code httpd ZZZZ Figure 3 Application mapping onto a Flaw microprocessor Application mapping The Raw operating system allows both space and time multiplexing of processes Thus not only can a Raw processor run multiple inde pendent processes simultaneously it can con text wvitch them in and out as on a conventional processor The operating system allocates a rec tangularshaped number of tiles correspond ing to physical threads that can themselves be virtualized proportional to the amount of com putation that is required by that process When the operating system contextWitches in a given process it nds a contiguous region of tiles that corresponds to the dimension of the process and resumes the execution of the physical threads We employ this gang scheduling poli cy because the physical threads of the process are likely to communicate with each other Con tinuous or realtime applications can be locked down and will not be contextswitched out Figure 3 shows a Raw processor running multiple processes simultaneously Tradition al applications including threaded Java pro with include gigabit routers video and audio processing lters and modems IO protocols RAID SCSI FireWire and communica tions protocols for cellular phones multiple channel cellular base stations highde nition TV Bluetooth and IEEE 80211a and b These protocols could run as dedicated embedded hosts or as a process on a general purpose machine grams 1 i interface MP1 codes and server applications use Raw as a high density multiprocessor The corner tile is in sleep mode to save power The fourtile block is running an automatically parallelized5 scalar code The top eight tiles in Figure 3 illustrate the more novel software circuit usage of the tiles Video data is streamed in over the pins transformed by a ltering operation and then streamed out to a frame buffer and screen In MARCH APRIL 2002 Network Input FIFO buffers MICROARCHITECTURE Ex 1b r25 0x341r26 Network out ut FIFO buffers structs such as data and instruction virtualization caching interrupts context switches address spaces latency tolerance and event counting To achieve these goals a Raw tile employs its compute processor static router and dynamic routers The static router manages the two static networks which provide the lowlatency co m munication required for soft and other ware circuits IF Instruction fetch M2 Multiply 2 D Instruction decode A Address RF Register fetch TL Tag Iookup TV Tag verify M1 Multiplyt WB Writerbackstage F P U F4 Four stages of the oatingrpolnt unit Figure 4 Raw compute processor pipeIine IEEE MICHU this situation the tiles work together paral lelizing the computation We assign opera tions to tiles in a manner that minimizes congestion and then configure the network routes between these operations This is very much like the process of designing a cus tomized hardware circuit In Raw the com piler performs this customization4 5 This customization can also be done manually using Raws assembly code To make both parallelized scalar codes and software circuits work we need a very low latency networkithe faster the network the greater the range of applications that can be parallelized Multiprocessor networks designed for MP1 programs have latencies on the order of 1000 cycles stateoftheart networks have latencies of 30 cycles6The Raw network can route the output ofthe ALU on one tile to the input of the ALU of a neighboring tile in three cycles Design decisions T e Raw tile design crystallized around the need to provide lowlatency communication for ef cient execution of software circuits and parallel scalar codes At the same time we wanted to provide scalable versions of the standard toolbox of useful architectural con applications with compile time predictable communica tion The dynamic routers manage the dynamic net works which transport unpredictable operations like interrupts cache misses and compiletime unpredictable communication for example messages between tiles Compute processor Our focus in designing the compute proces sor was to tightly integrate coupled network interfaces into the processor pipelines We wanted to make the network first class in every sense to maximize its utility The most com mon network interfaces are memory mapped other networks use special instructions for sending and receiving6 7 The most aggressive processor networks are register mapped and do not require a special send or receive com mand instructions can target the networks just as easily as registers5 Our esign takes network integration one step further The networks are not only regis ter mapped but also integrated directly into the bypass paths of the processor pipeline This makes the network ports truly firstclass citizens in the architecture Figure 4 shows how this works Registers 24 through 27 are mapped to the four onchip physical net works For example a read from register 24 will pull an element from an input FIFO buffer while a write to register 24 will send the dataword out onto that network If data is not available on an input FIFO buffer or if an output FIFO buffer does not have enough room to hold a result the processor will stall in the register fetch stage The instruction for mat also provides a single bit in the instruc tion which allows the instruction to specify two output destinations one network or reg ister and the network implied by register 24 the first static network This gives the tile the option ofkeeping local copies of trans mitted values Each output FIFO buffer connects to each pipeline stage The FIFO buffers pull the old est value out of the pipeline as soon as it is ready rather than just at the writeback stage or through the register le5 This decreases the latency of an ALUto network instruction by as much as four cycles for our eightstage pipeline This logic is exactly like the standard bypass logic ofa processor pipeline except that it gives priority to older instructions rather than newer instructions In early processor designs the register le was the central communication mechanism between functional units Starting with the rst pipelined processors the bypass network became largely responsible for 1 i cation of active values and the register file evolved into a dumping ground or check pointing facility for inactive values The Raw networks the static networks in particular extend this trend and are in essence 2D bypass networks serving as bridges between the bypass networks of separate tiles The low latency of inorder intertile ALUtoALU operand delivery distinguishes Raw from pre vious systolic or messagepassing systems5 6 7 The integration of networks into the pipelined bypass path of the compute processor re ects our view that scalar data transport among functional units on adjacent tiles is as impor tant as that which occurs among functional units within a single tile The resulting low latency ofintertiI inn allows the Raw static network to perform customized scalar data routing with AS lClike latencies The early bypassing ofvalues to the net work has some challenging effects on pipeline operation Perhaps most importantly it changes L i f L r r sor s commit point An instruction that has had a value bypassed out early has created a side effect which makes it dif cult to squash the instruction in a later stage The simplest solution we have found for this is to place the commit point at the execute stage Static router For software circuits and parallel scalar codes we use the two static networks to route values between tiles The static networks pro vide ordered flowcontrolled and reliable transfer of singleword operands and data streams between the tiles functional units The operands need to be delivered in order so that the instructions issued by the tiles are operating on the correct data Flow control of operands allows the program to remain cor rect in the face of unpredictable architectural events such as cache misses and interrupts The static router is a vestage pipeline that controls two routing crossbars and thus two physical networks Each crossbar routes values between seven entitiesithe static router pipeline the north east south and west neigh bor tiles the compute processor and the other crossbar The static router uses the same fetch unit design as the compute processor except it fetches a 64bit instruction word from the 8096entry instruction memory This instruc tion i 39 39 J comman conditional branches with or without decre ment and accesses to asmall register le and 13 routes one for each unique crossbar output for a total of 14 operations per cycle per tile For each word sent between tiles on the sta tic network there must exist a corresponding instruction in the instruction memory of each router on the words path These instructions are typically programmed at compile time and are cached just like the instructions of the com pute processor Thus the static routers collec tively recon gure the entire communication pattern of the network on a cyclebycycle basis Further because the router program memory is large and also cached there is no practical architectural limit on the number of inn patterns sup ported in a computation In this manner the communication and compute operations are treated with equal importance This mecha nism distinguishes Raws network from previ ous systems such as iWarp or Numesh5 8 D L ic router knows what route will be performed long before the word arrives route preparations can be pipelined This allows data word routing immediately upon the words arrival This low latency is critical for exploiting instructionlevel paral lelism in scalar codes Dynamic routers have MARCH APRIL 2002 MICROARCHITECTURE fta 80 w contro crossb Software controlled cross ar re ed ar b Figure 5 Two tiies communicating over tne static network IEEE MICHU more latency than the static router despite the same bandwidth because they have to wait for a header word to arrive before they can initiate preparations for a route Thus dynamic networks are better suited for long data streams and not scalar transport The static router is owcontrolled and it proceeds to the next instruction only after all of the routes in a particular instruction are com pleted This ensures that destination tiles receive incoming words in a known order even when tiles suffer branch mispredicts cache misses interrupts or other unpredictable events The static router I i 39 i g y p hop latencies and can route two values in each direction per cycle Figure 5 shows two tiles communicating across the static network The word computed by the oatingpoint multi ply instruction spends one cycle in each stat ic router and one cycle in the decode stage of the target tile This gives a total latency of three cycles for a word to travel from the out put of the ALU ofone tile to the input of the ALU of the next tile Unlike this example actual Raw programs have more random and dense communication patterns which re ect the internal ow of data between operations in the original highlevel language source The dynamic networks Ear y on in t e Raw project we realized the need to support dynamic as well as static events We thus added a pair of dimensionordered Wormholerouted dynamic networks to the architecture9 To send a message on one of these networks the user injects a single header word that specifies the destination tile or 10 port a user eld and thelength of the message The user then sends up to 31 data words While this is happening the message worms its way through the network to the destination tile Our implementation of this network takes one cycle per hop plus an extra cycle for every hop that turns We use an optimization that exploits the fact that most routes are straight On an uncongested network the header reach es the destination in 2 X l Y 2 cycles That is two cycles of latency to leave the com pute processor this counts as a hop and a turn a number of X that is horizontal hops one hop if a turn is required a number of Yhops and then a hop and a turn to enter the com pute processor One major concern with dynamic networks is deadlock caused by the over commitment of buffering resources Classically there are two solutions for deadlock avoidance and recov ery Deadlock avoidance requires users to limit their usage to a set of disciplines that have been proven not to deadlock Unfortunately this limits the network s usefulness Deadlock recovery places no restrictions on network use but requires that the network be drained to some source of copious memory when the network appears to be deadlocked Unfortu nately if that outofband memory system itself 4 439 I recovery the system could fail Our solution uses a pair of otherwise iden tical networks he memory network has a restricted usage model that uses deadlock avoidance and the general network is unre stricted and uses deadlock recovery If the gen eral network deadlocks an interrupt routine is activated that uses the memory network to recover Trusted clientsioperating system data cache interrupts hardware devices DMA and lOiuse e memory network Each client is not allowed to block on a network send unless it can guarantee that its input buffers can sink all messages that it might be receiving Alterna tively a client can guarantee that it will not block on a network send if it has enough private buffering resources on the path to its destina tion Clients are each allocated a number of buffers to accomplish their basic functionality For large highperformance transfers the clients can negotiate permission with the operating sys tem for temporary use ofa pool of additional buffers associated with each lO port Untrusted clients use the general network and rely on the hardware deadlock recovery mechanism to maintain forward progress when deadlock occurs The operating system programs a con gurable counter on the com pute processor to detect if words have waited too long on the input6 This counter causes an interrupt so that the network can be drained into DRAM A separate interrupt then allows the general network input port to be virtualized substituting in the data from the DRAM Because it is a userlevel construct the gen eral network is virtualized for each group of tiles that corresponds to a process The upper left tile is considered Tile 0 for the purposes of this process On a context switch the contents of the general and static networks are saved off and the process and its network data can be restored at any time to a new offset on the Raw grid Both the hardware caches and the software share the memory network On a cache miss the hardware cache consults a con gurable hash function to map addresses to destination ports or tiles The cache then issues a header word like every other client of the network and then a sequence of words that is interpreted by the DRAM controller On a cache ll the cache state machine will also wait for the result and then transfer words into the cache memory Just like any other 1 O device on the network the DRAMs are equally accessible by hardware and software It is often useful to write codes that operate on large data blocks that stream direct ly in and out of the DRAMs The Raw processor supports parallel imple mentation of external device 1 O interrupts each tile can process an interrupt indepen dently of the others The interrupt con trollers implemented by a dedicated tile or as part of the support chipset signals an inter rupt to a particular tile by sending a special oneword message through an 1 O port The tile s network hardware checks for that mes sage and transparently pulls it off the network and sets the compute processor s external inter rupt bit When the compute processor services the interrupt it queries the interrupt controller for the cause of the interrupt and then con tacts the appropriate device or DRAM Implementation We implemented the Raw chip 16tile pro totype in IBM s SA 27E 01 Smicron sixlevel copper ASIC Although the Raw array is only 16 X 16 mm we used an 182 X 182mm die to allow a high pincount package The 1657 pin ceramiccolumn gridarray package provides 1080 high speed transceiver logic 1 O pins We estimate that the chip consumes 25 W mostly in memory accesses and pins We quiesce unused functional units and memories and tristate unused data 1 O pins We tar geted a 225MHZ worstcase frequency average case fre PI OOCSS quency is typically 25 percent higher which is competitive with other 01 Smicron ASIC processors like IRAM from the University of California Berke ley and the customizable processors from Tensilica We pipelined our processor aggressively and treated con trol paths very conservatively to avoid spending signi cant periods closing timing in the back end Despite this we found that wire delay inside the tile was large enough that placement could not be ignored as an issue We creat ed a library of routines 7000 lines of code that automati cally places the majority of the structured logic in the tile and around the perimeter of the chip This structured logic is visible in Figure 6 a screen capture of the Raw tile placement Figure 7 shows a screen capture of placement for the entire Raw chip The replicated 4 X 4 pattern in the middle is the array of Raw tiles The semistructured logic around the edges is the 10 multiplexing logic Figure 8 next page gives a detailed floorplan of the Raw tile The placement routines dropped cycle time from 8 to 4 ns which matches the Synopsys synthesis tool s timing estimates The synthe sis backend processing and our placement infrastructure can turn our Verilog source into a fully placed chip in approximately 6 hours on one machine We used a logic emulator donated by IKOS Technologies coupled with LI 1 39 MARCH APRIL 2002 33 MICROARCHITECTURE BIST Static router crossbar 2 Staticrouter control E 3 i9 Staticrouter crossbar 1 Staticrouter g instruction 3 8 Dynamicnetwork 1 crossbar 9 E u 2 Dynamicnetwork1 control 5 Dynamicnetwork 2 crossbar Test g net Work Dynamicnetwork 2 control Fuses Bypass network Corglljme39processor Integer multiply i Arithmetic Arithmetic logic units logic units critical CoerI BIST Floating 7 Simple POln unit Anlhmel39c Special Data gt logic units U ose E 3 medium P P Data cac e 8 registers cache control Q 3 Event tags 6 3 counters g 3 BIST Fuses Fuses BIST 4 E 3 Computeprocessor 59 Data cache ins ruc ion g SRAM Figure 8 Flaw tile oor plan IEEE MICHU the Raw motherboard to boot the gatelevel Verilog and run test simulations Applications With avery small two or three Way amount of instructionlevel parallelism generally do not bene t much from running on Raw This is because intertile latency is great enough that it is cheaper to compute locally than to distribute the computation to a neighbor tile A twoWayissue compute processor would have helped ll out our par allelism pro le for these applications espe cially Specint benchmarks For scalar codes with a moderate degree of instructionlevel parallelism We found that our C and Fortran compiler RaWCC5 is effective at exploiting parallelism by auto matically partitioning the program graph placing the operations and programming the routes for the static router We attain speedups ranging from 6gtlt to 11gtlt Versus a sin gle tile on Specfp applications for a 16tile Raw processor and 9gtlt to 19gtlt for 32 tiles When parallelism is limited by the applica tion We nd that RaWCC comes close to the handparallelized speedup but tends to use up to two times as many tiles For structured applications with a lot of pipelined parallelism or heavy data move ment like that found in software circuits careful orchestration and layout of operations and network routes provided us with maxi mal performance because it maximizes per tile performance For these applications and for the operating system code we developed a version ofthe Gnu C compiler that lets the programmer specify the code and communi cation on a pertile basis Although this seems laborious the alternative for these sorts of performanceoriented applications is an ASIC implementation which is considerably more work than programming Raw We are currently working on a new compiler that automates this mode ofprogramming4 The replicated tile design saved us consid erable time in all phases ofthe project design RTL Verilog coding resynthesis veri cation placement and backend flow ur design supports the glueless connec tion ofup to 64 Raw chips in any rec tangular mesh pattern creating virtual Raw systems with up to 1024 tiles We intend to use this ability to investigate Raw processors with hundreds of tiles We think that reaching the point at which a Raw tile is a relatively small portion of total computation could change the way we compute We can imagine computation becoming inexpensive enough to dedicate entire tiles to prefetching gather ing pro le data from neighbor tiles translat ing say for x86 emulation and dynamically optimizing instructions or even to simulat ing traditional hardware structures like video Ramdacs The idea of creating architectural analogs to pins gates and wires will ultimately lead to a class of chips that can address a great range of applications It takes a lot of imagination to envision a lZStile Raw processor how fast a fullcustom version would clock or how a more sophisticated compute processor design could affect the overall system It is our hope that the Raw research will provide insight for architects who are looking for new ways to u processors that leverage the vast resources and mitigate the considerable wire delays that loom on the horizon MlEll Acknowledgment Raw is funded by DARPA the National Sci ence Foundation and MlTs Project Oxygen References R Ho K Mai and M Horowitz The Future of Wires Proc IEEE lEEE CS Press Los Alamitos Calit April 200i pp 490504 Waingold et al Baring it All to Software Raw Machines Computer vol 30 no9 Sept i997 pp 86793 T Gross and DR O Halloron iWarp Anatomy of a Parallel Computing System MlT Press Cambridge Mass i998 B Thies et al Streamli A Compiler for Streaming Applications tech memo Mlii LCS7TM7620 Massachusetts lnst Technoloi gy Lab Comp Sci Cambridge Mass 200i httpwwwlcsmitedupublicationspubspdt MlTlCSTM620pdt current Feb 2002 Lee et al Spaceiiime Scheduling of N W P 01 lnstructionitevel Parallelism on a Raw Machine 8th lnt i Conf Architectural Support for Programming Languages and Operating Systems ASPLOSilIII ACM Press New York i998 pp 4657 J Kubiatowicz Integrated SharediMemory 0 and MessageiPassing Communication in the AieWil e Multiprocessor doctoral dissertai tion EECS Dept Massachusetts lnstTech7 nology Cambridge Mass i998 M Annaratone et al The Warp Computer Architecture implementation and Perfori N mance IEEE Trans Computers vol 36 no l2 l987 pp l523il538 D Shoemallter et al NuMesh An Architecture Optimized for Scheduled Communication J Supercomputing vol 70 no 3 7996 pp 2857302 W Dally A lLS Architecture for Concurrent Data Structures Kluwer Academic Publishers Boston i987 50 9 Direct questions and comments about this article to Michael Taylor MIT Laboratory for Computer Science 200 Technology Square Bldg NE43 Cambridge MA 02139 mtaylorcaglcsmitedu For further information on this or any other computing topic please visit our Digital Library at httpcomputerorgpubications dlib MARCH APRIL 2002 Eli