Adv Comput Architecture
Adv Comput Architecture ECE 6100
Popular in Course
Popular in ELECTRICAL AND COMPUTER ENGINEERING
This 0 page Class Notes was uploaded by Cassidy Effertz on Monday November 2, 2015. The Class Notes belongs to ECE 6100 at Georgia Institute of Technology - Main Campus taught by Hsien-Hsin Lee in Fall. Since its upload, it has received 8 views. For similar materials see /class/233866/ece-6100-georgia-institute-of-technology-main-campus in ELECTRICAL AND COMPUTER ENGINEERING at Georgia Institute of Technology - Main Campus.
Reviews for Adv Comput Architecture
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 11/02/15
The Microarchitecture of the Pentium 4 Processor Glenn Hinton Desktop Platforms Group Intel Corp Dave Sager Desktop Platforms Group Intel Corp Mike Upton Desktop Platforms Group Intel Corp Darrell Boggs Desktop Platforms Group Intel Corp Doug Carmean Desktop Platforms Group Intel Corp Alan Kyker Desktop Platforms Group Intel Corp Patrice Roussel Desktop Platforms Group Intel Corp Index words Pentium 4 processor NetBurstTM microarchitecture Trace Cache doublepumped ALU deep pipelining ABSTRACT This paper describes the Intel NetBurstTM microarchitecture of Intel s new agship Pentium 4 processor This microarchitecture is the basis of a new family of processors from Intel starting with the Pentium 4 processor The Pentium 4 processor provides a substantial performance gain for many key application areas where the end user can truly appreciate the difference In this paper we describe the main features and functions of the NetBurst microarchitecture We present the front end of the machine including its new form of instruction cache called the Execution Trace Cache We also describe the outoforder execution engine including the extremely low latency doublepumped Arithmetic Logic Unit ALU that runs at 3GHz We also discuss the memory subsystem including the very low latency Level 1 data cache that is accessed injust two clock cycles We then touch on some of the key features that allow the Pentium 4 processor to have outstanding oatingpoint and multimedia performance We provide some key performance numbers for this processor comparing it to the Pentium III processor INTRODUCTION The Pentium 4 processor is Intel s new agship microprocessor that was introduced at 15GHz in November of 2000 It implements the new Intel NetBurst microarchitecture that features signi cantly higher clock rates and worldclass performance It includes several important new features and innovations that will allow the Intel Pentium 4 processor to deliver industryleading performance for the next several years This paper provides an irrdepth examination of the features and functions of the Intel NetBurst microarchitecture The Pentium 4 processor is designed to deliver performance across applications where end users can truly appreciate and experience its performance For example it allows a much better user experience in areas such as Internet audio and streaming video image processing video content creation speech recognition 3D applications and games multimedia and multitasking user environments The Pentium 4 processor enables real tiIne MPEG2 video encoding and near realtime MPEG4 encoding allowing efficient video editing and video conferencing It delivers worldclass performance on 3D applications and games such as Quake 3 enabling a new level of realism and visual quality to 3D applications The Pentium 4 processor has 42 million transistors implemented on Intel s 018u CMOS process with six levels of aluminum interconnect It has a die size of 217 mm and it consumes 55 watts of power at 15GHz Its 32 GBsecond system bus helps provide the high data bandwidths needed to supply data to today s and tomorrow s demanding applications It adds 144 new 128bit Single Instruction Multiple Data SIMD instructions called SSE2 Streaming SIMD Extension 2 that improve performance for multimedia content creation scientific and engineering applications Other brands and names are the property of their respective owners The Microarchitecture of the Pentium 4 Processor Intel Technology Journal Q1 2001 OVERVIEW OF THE NETBURSTTM MICROARCHITECTURE A fast processor requires balancing and tuning of many microarchitectural features that compete for processor die cost and for design and validation efforts Figure 1 shows the basic Intel NetBurst microarchitecture of the Pentium 4 processor As you can see there are four main sections the inorder front end the outoforder execution engine the integer and floatingpoint execution units and the memory subsystem System Bus 0 Level 2 Cache Execution Units Memory Subsystem Trace Cache FetchDecode Mmmae ROM BTBBranch Prediction Integer and PP Execution Units Rmth I Ii torv Undate Front End Outeofeorder Engine Figure 1 Basic block diagram InOrder Front End The inorder front end is the part of the machine that fetches the instructions to be executed next in the program and prepares them to be used later in the machine pipeline Its job is to supply a highbandwidth stream of decoded instructions to the outoforder execution core once and placed in the Trace Cache and then used repeatedly from there like a normal instruction cache on previous machines The IA32 instruction decoder is only used when the machine misses the Trace Cache and needs to go to the L2 cache to get and decode new IA32 instruction bytes OutofOrder Execution Logic The outoforder execution engine is where the instructions are prepared for execution The outoforder execution logic has several buffers that it uses to smooth and reorder the flow of instructions to optimize performance as they go down the pipeline and get scheduled for execution Instructions are aggressively re ordered to allow them to execute as quickly as their input operands are ready This outoforder execution allows instructions in the program following delayed instructions to proceed around them as long as they do not depend on those delayed instructions Outoforder execution allows the execution resources such as the ALUs and the cache to be kept as busy as possible executing independent instructions that are ready to execute The retirement logic is what reorders the instructions executed in an outoforder manner back to the original program order This retirement logic receives the completion status of the executed instructions from the execution units and processes the results so that the proper architectural state is committed or retired according to the program order The Pentium 4 processor can retire up to three uops per clock cycle This retirement logic ensures that exceptions occur only if the operation causing the exception is the oldest nonretired operation in the machine This logic also reports branch history information to the branch predictors at the front end of the which will do the actual completion of the iu uuctiuu The front end has highly accurate branch prediction logic that uses the past history of program execution to speculate where the program is going to execute next The predicted instruction address from this frontend branch prediction logic is used to fetch instruction bytes from the Level 2 L2 cache These IA32 instruction bytes are then decoded into basic operations called uops microoperations that the execution core is able to execute The NetBurst microarchitecture has an advanced form of a Level 1 L1 instruction cache called the Execution Trace Cache Unlike conventional instruction caches the Trace Cache sits between the instruction decode logic and the execution core as shown in Figure 1 In this location the Trace Cache is able to store the already decoded IA 32 instructions or uops Storing already decoded instructions removes the IA32 decoding from the main execution loop Typically the instructions are decoded machine so they can train with the latest knowngood Yy r Integer and FloatingPoint Execution Units The execution units are where the instructions are actually executed This section includes the register les that store the integer and oatingpoint data operand values that the instructions need to execute The execution units include several types of integer and floatingpoint execution units that compute the results and also the L1 data cache that is used for most load and store operations Memory Subsystem Figure 1 also shows the memory subsystem This includes the L2 cache and the system bus The L2 cache stores both instructions and data that cannot fit in the Execution Trace Cache and the L1 data cache The external system bus is connected to the backside of the secondlevel cache and is used to access main memory when the L2 cache has a cache miss and to access the system IO resources The Microarchitecture of the Pentium 4 Processor Intel Technology Journal Q1 2001 CLOCK RATES Processor microarchitectures can be pipelined to different degrees The degree of pipelining is a microarchitectural decision The nal frequency of a specific processor pipeline on a given silicon process technology depends heavily on how deeply the processor is pipelined When designing a new processor a key design decision is the target design frequency of operation The frequency target determines how many gates of logic can be included per pipeline stage in the design This then helps determine how many pipeline stages there are in the machine There are tradeoffs when designing for higher clock rates Higher clock rates need deeper pipelines so the ef ciency at the same clock rate goes down Deeper pipelines make many things take more clock cycles such as mispredicted branches and cache misses but usually more than make up for the lower perclock ef ciency by allowing the design to run at a much higher clock rate For example a 50 increase in frequency might buy only a 30 increase in net performance but this frequency increase still provides a significant overall performance increase Highfrequency design also depends heavily on circuit design techniques design methodology design tools silicon process technology power and thermal constraints etc At higher frequencies clock skew and jitter and latch delay become a much bigger percentage of the clock cycle reducing the percentage of the clock cycle usable by actual logic The deeper pipelines make the machine more complicated and require it to have deeper buffering to cover the longer pipelines Historical Trend of Processor Frequencies Figure 2 shows the relative clock frequency of Intel s last six processor cores The vertical axis shows the relative clock frequency and the horizontal axis shows the various processors relative to each other Relative Frequency 286 386 486 Figure 2 Relative frequencies of Intel s processors P5 P6 P4P Figure 2 shows that the 286 lntel386TM lntel486TM and Pentium P5 processors had similar pipeline depthsi they would run at similar clock rates if they were all implemented on the same silicon process technology They all have a similar number of gates of logic per clock cycle The P6 microarchitecture lengthened the processor pipelines allowing fewer gates oflogic per pipeline stage which delivered significantly higher frequency and performance The P6 microarchitecture approximately doubled the number of pipeline stages compared to the earlier processors and was able to achieve about a 15 times higher frequency on the same process technology The NetBurst microarchitecture was designed to have an even deeper pipeline about two times the P6 microarchitecture with even fewer gates of logic per clock cycle to allow an industryleading clock rate Compared to the P6 family of processors the Pentium 4 processor was designed with a greater than 16 times higher frequency target for its main clock rate on the same process technology This allows it to operate at a much higher frequency than the P6 family of processors on the same silicon process technology At its introduction in November 2000 the Pentium 4 processor was at 15 times the frequency of the Pentium III processor Over time this frequency delta will increase as the Pentium 4 processor design matures Different parts of the Pentium 4 processor run at different clock frequencies The frequency of each section of logic is set to be appropriate for the performance it needs to achieve The highest frequency section fast clock was set equal to the speed of the critical ALUbypass execution loop that is used for most instructions in integer programs Most other parts of the chip run at half of the 3GHz fast clock since this makes these parts much easier to design A few sections of the chip run at a quarter of this fastclock frequency making them also easier to design The bus logic runs at 100MHz to match the system bus needs As an example of the pipelining differences Figure 3 shows a key pipeline in both the P6 and the Pentium 4 processors the mispredicted branch pipeline This pipeline covers the cycles it takes a processor to recover from a branch that went a different direction than the early fetch hardware predicted at the beginning of the machine pipeline As shown the Pentium 4 processor has a 20stage misprediction pipeline while the P6 microarchitecture has a lOstage misprediction pipeline By dividing the pipeline into smaller pieces doing less work during each pipeline stage fewer gates of logic the clock rate can be a lot higher The Microarchitecture of the Pentium 4 Processor Intel Technology Journal Q1 2001 Basic Pentium III Processor Misprediction Pipeline 10 Fetc h Exec Fetch 9 Decode Decode Decode Rename ROB Rd RdySch Dispatch Basic Pentium 4 Processor Misprediction Pipeline 1I2345 6 7 8 9 1o 11 12 13 14 15 19 20 Tc N xth Tc F etch Drive Alloc Ren ame Que Sch Sch Sch Disp Disp RF RF Ex Flgs Br Ck Drive Figure 3 Misprediction Pipeline FrontEnd BTB Instruction 64b1ts Wlde 4K Entries TLBIPrefetcher S t v m Instruction Decoder Microcode 39 ROM Wristsz I I 23253 P Pumped V I Allocator I Register Renamer 32 GBIs I Memory uop Queue I I InteerlFloating Point uop Queue I Bus Memo Scheduler SlowGeneral FP Scheduler Simle FP Intertiace UnIt Inte er Re ister FileB ass Network FP Re isterl Bpass I I AGU AGU 2x ALU 2x ALU Slow ALU Fp L2 Cache MMX FP 256K Byte Load Store Simple Simple Complex SSE Move 8wa Address Address Instr Instr IHStr SSEZ y I I 39 A L 48GBls I L1 Data Cache 8Kbyte 4way 256 bits I V Figure 4 Pentium 4 processor microarchitecture NETBURST MICROARCHITECTURE Figure 4 shows a more detailed block diagram of the NetBurst microarchitecture of the Pentium 4 processor The tople portion of the diagram shows the front end of the machine The middle of the diagram illustrates the outof order buffering logic and the bottom of the diagram shows the integer and oatingpoint execution units and the L1 data cache On the right of the diagram is the memory subsystem Front End The front end of the Pentium 4 processor consists of several units as shown in the upper part of Figure 4 It has the Instruction TLB ITLB the frontend branch predictor labeled here FrontEnd BTB the lA32 Instruction Decoder the Trace Cache and the Microcode ROM The Microarchitecture of the Pentium 4 Processor Intel Technology Journal Q1 2001 Trace Cache The Trace Cache is the primary or Level 1 L1 instruction cache of the Pentium 4 processor and delivers up to three uops per clock to the outoforder execution logic Most instructions in a program are fetched and executed from the Trace Cache Only when there is a Trace Cache miss does the NetBurst microarchitecture fetch and decode instructions from the Level 2 L2 cache This occurs about as often as previous processors miss their Ll instruction cache The Trace Cache has a capacity to hold up to 12K uops It has a similar hit rate to an 8K to 16K byte conventional instruction cache lA32 instructions are cumbersome to decode The instructions have a variable number of bytes and have many different options The instruction decoding logic needs to sort this all out and convert these complex instructions into simple uops that the machine knows how to execute This decoding is especially dif cult when trying to decode several lA32 instructions each clock cycle when running at the high clock frequency of the Pentium 4 processor A highbandwidth lA32 decoder that is capable of decoding several instructions per clock cycle takes several pipeline stages to do its work When a branch is mispredicted the recovery time is much shorter if the machine does not have to redecode the IA 32 instructions needed to resume execution at the corrected branch target location By caching the uops of the previously decoded instructions in the Trace Cache the NetBurst microarchitecture bypasses the instruction decoder most of the time thereby reducing misprediction latency and allowing the decoder to be simpli ed it only needs to decode one lA32 instruction per clock cycle The Execution Trace Cache takes the alreadydecoded uops from the lA32 Instruction Decoder and assembles or builds them into programordered sequences of uops called traces It packs the uops into groups of six uops per trace line There can be many trace lines in a single trace These traces consist of uops running sequentially down the predicted path of the lA32 program execution This allows the target of a branch to be included in the same trace cache line as the branch itself even if the branch and its target instructions are thousands of bytes apart in the program Conventional instruction caches typically provide instructions up to and including a taken branch instruction but none after it during that clock cycle If the branch is the rst instruction in a cache line only the single branch instruction is delivered that clock cycle Conventional instruction caches also often add a clock delay getting to the target of the taken branch due to delays getting through the branch predictor and then accessing the new location in the instruction cache The Trace Cache avoids both aspects of this instruction delivery delay for programs that t well in the Trace Cache The Trace Cache has its own branch predictor that directs where instruction fetching needs to go next in the Trace Cache This Trace Cache predictor labeled Trace BTB in Figure 4 is smaller than the frontend predictor since its main purpose is to predict the branches in the subset of the program that is currently in the Trace Cache The branch prediction logic includes a l6entry return address stack to efficiently predict return addresses because often the same procedure is called from several different call sites The TraceCache BTB together with the frontend BTB use a highly advanced branch prediction algorithm that reduces the branch misprediction rate by about 13 compared to the predictor in the P6 microarchitecture Microcode ROM Near the Trace Cache is the microcode ROM This ROM is used for complex lA32 instructions such as string move and for fault and interrupt handing When a complex instruction is encountered the Trace Cache jumps into the microcode ROM which then issues the uops needed to complete the operation After the microcode ROM finishes sequencing uops for the current lA32 instruction the front end of the machine resumes fetching uops from the Trace Cache The uops that come from the Trace Cache and the microcode ROM are buffered in a simple inorder uop queue that helps smooth the ow of uops going to the out oforder execution engine lTLB and Front End BTB The lA32 Instruction TLB and frontend BTB shown at the top of Figure 4 steer the front end when the machine misses the Trace Cache The lTLB translates the linear instruction pointer addresses given to it into physical addresses needed to access the L2 cache The lTLB also performs pagelevel protection checking Hardware instruction pre fetching logic associated with the frontend BTB fetches lA32 instruction bytes from the L2 cache that are predicted to be executed next The fetch logic attempts to keep the instruction decoder fed with the next lA32 instructions the program needs to execute This instruction prefetcher is guided by the branch prediction logic branch history table and branch target buffer listed here as the frontend BTB to know what to fetch next Branch prediction allows the processor to begin fetching and executing instructions long before the previous branch outcomes are certain The frontend branch predictor is quite large4K branch target entriesito capture most of the branch history information for the program If a branch is not found in the BTB the branch prediction hardware statically predicts the outcome of the branch based on the direction of the branch displacement forward or backward Backward branches are assumed The Microarchitecture of the Pentium 4 Processor Intel Technology Journal Q1 2001 to be taken and forward branches are assumed to not be taken lA 32 Instruction Decoder The instruction decoder receives lA32 instruction bytes from the L2 cache 64bits at a time and decodes them into primitives called uops that the machine knows how to execute This single instruction decoder can decode at a maximum rate of one lA32 instruction per clock cycle Many lA32 instructions are converted into a single uop and others need several uops to complete the full operation If more than four uops are needed to complete an lA32 instruction the decoder sends the machine into the microcode ROM to do the instruction Most instructions do not need to jump to the microcode ROM to complete An example of a manyuop instruction is string move which could have thousands of uops OutofOrder Execution Logic The outoforder execution engine consists of the allocation renaming and scheduling functions This part of the machine reorders instructions to allow them to execute as quickly as their input operands are ready The processor attempts to find as many instructions as possible to execute each clock cycle The outoforder execution engine will execute as many ready instructions as possible each clock cycle even if they are not in the original program order By looking at a larger number of instructions from the program at once the outoforder execution engine can usually find more readytoexecute independent instructions to begin The NetBurst microarchitecture has much deeper buffering than the P6 microarchitecture to allow this It can have up to 126 instructions in ight at a time and have up to 48 loads and 24 stores allocated in the machine at a time The Allocator The outoforder execution engine has several buffers to perform its reordering tracking an sequencing operations The Allocator logic allocates many of the key machine buffers needed by each uop to execute If a needed resource such as a register le entry is unavailable for one of the three uops coming to the Allocator this clock cycle the Allocator will stall this part of the machine When the resources become available the Allocator assigns them to the requesting uops and allows these satisfied uops to ow down the pipeline to be executed The Allocator allocates a Reorder Buffer ROB entry which tracks the completion status of one of the 126 uops that could be in ight simultaneously in the machine The Allocator also allocates one of the 128 integer or oatingpoint register entries for the result data value of the uop and possibly a load or store buffer used to track one of the 48 loads or 24 stores in the machine pipeline In addition the Allocator allocates an entry in one of the two uop queues in front of the instruction schedulers Register Renaming The register renaming logic renames the logical lA32 registers such as EAX onto the processors 128entry physical register file This allows the small 8entry architecturally de ned lA32 register file to be dynamically expanded to use the 128 physical registers in the Pentium 4 processor This renaming process removes false con icts caused by multiple instructions creating their simultaneous but unique versions of a register such as EAX There could be dozens of unique instances of EAX in the machine pipeline at one time The renaming logic remembers the most current version of each register such as EAX in the Register Alias Table RAT so that a new instruction coming down the pipeline can know where to get the correct current instance of each of its input operand registers As shown in Figure 5 the NetBurst microarchitecture allocates and renames the registers somewhat differently than the P6 microarchitecture On the left of Figure 5 the P6 scheme is shown It allocates the data result registers and the ROB entries as a single wide entity with a data and a status eld The ROB data field is used to store the data result value of the uop and the ROB status field is used to track the status of the uop as it is executing in the machine These ROB entries are allocated and deallocated sequentially and are pointed to by a sequence number that indicates the relative age of these entries Upon retirement the result data is physically copied from the ROB data result eld into the separate Retirement Register File RRF The RAT points to the current version of each of the architectural registers such as EAX This current register could be in the ROB or in the RF The NetBurst microarchitecture allocation scheme is shown on the right of Figure 5 It allocates the ROB entries and the result data Register File RF entries separately The Microarchitecture of the Pentium 4 Processor Intel Technology Journal Q1 2001 Pentium ROB NetBurst Frontend RAT Retirement RA ROB Status Figure 5 Pentium quotI vs Pentium 4 processor register allocation The ROB entries which track uop status consist only of the status field and are allocated and deallocated sequentially A sequence number assigned to each uop indicates its relative age The sequence number points to the uop s entry in the ROB array which is similar to the P6 microarchitecture The Register File entry is allocated from a list of available registers in the 128entry RFinot sequentially like the ROB entries Upon retirement no result data values are actually moved from one physical structure to another Uop Scheduling The uop schedulers determine when a uop is ready to execute by tracking its input register operands This is the heart of the outoforder execution engine The uop schedulers are what allow the instructions to be reordered to execute as soon as they are ready while still maintaining the correct dependencies from the original program The NetBurst microarchitecture has two sets of structures to aid in uop scheduling the uop queues and the actual uop schedulers There are two uop queuesione for memory operations loads and stores and one for nonmemory operations Each of these queues stores the uops in strict FIFO first in rstout order with respect to the uops in its own queue but each queue is allowed to be read outoforder with respect to the other queue This allows the dynamic outoforder scheduling window to be larger than just having the uop schedulers do all the reordering work There are several individual uop schedulers that are used to schedule different types of uops for the various execution units on the Pentium 4 processor as shown in Figure 6 These schedulers determine when uops are ready to execute based on the readiness of their dependent input register operand sources and the availability of the execution resources the uops need to complete their operation These schedulers are tied to four different dispatch ports There are two execution unit dispatch ports labeled port 0 and port 1 in Figure 6 hese ports are fast they can dispatch up to two operations each main processor clock cycle Multiple schedulers share each of these two dispatch ports The fast ALU schedulers can schedule on each half of the main clock cycle while the other schedulers can only schedule once per main processor clock cycle They arbitrate for the dispatch port when multiple schedulers have ready operations at once There is also a load and a store dispatch port that can dispatch a ready load and store each clock cycle Collectively these uop dispatch ports can dispatch up to six uops eac 39 clock cycle This dispatch bandwidth exceeds the front end and retirement bandwidth of three uops per clock to allow for peak bursts of greater than 3 uops per clock and to allow higher flexibility in issuing uops to different dispatch ports Figure also shows the types of operations that can be dispatched to each port each clock cycle The Microarchitecture of the Pentium 4 Processor Intel Technology Journal Q1 2001 Exec Port 0 ALU ALU Integer FP Memory Memory Double FP Move Double 0 t t L d St Speed Speed pera ion execu e oa ore AddSub FPSSE Move AddSub Shiftrotate FPSSEAdd All loads Store Address Logic FPSSE Store FPSSEMul LEA Store Data FXCH FPSSEDiv SW prefetch Branches MMX Figure 6 Dispatch ports in the Pentium 4 processor Integer and FloatingPoint Execution Units The execution units are where the instructions are actually executed The execution units are designed to optimize overall performance by handling the most common cases as fast as possible There are several different execution units in the NetBurst microarchitecture The units used to execute integer operations include the lowlatency integer ALUs the complex integer instruction unit the load and store address generation units and the L1 data cache FloatingPoint x87 MMX SSE Streaming SIMD Extension and SSE2 Streaming SIMD Extension 2 operations are executed by the two oatingpoint execution blocks MMX instructions are 64bit packed integer SIMD operations that operate on 8 16 or 32bit operands The SSE instructions are 128bit packed IEEE singleprecision oatingpoint operations The Pentium 4 processor adds new forms of 128bit SIMD instructions called SSE2 The SSE2 instructions support 128bit packed IEEE doubleprecision SIMD oatingpoint operations and 128bit packed integer SIMD operations The packed integer operations support 8 16 32 and 64 bit operands See IA32 Intel Architecture Software Developer s Manual Volume 1 Basic 39 3 for more detail on these SIMD operations The Integer and oatingpoint register files sit between the schedulers and the execution units There is a separate 128entry register le for both the integer and the oatingpointSSE operations Each register le also has a multiclock bypass network that bypasses or forwards justcompleted results which have not yet been written into the register file to the new dependent uops This multiclock bypass network is needed because of the very high frequency of the design Low Latency Integer ALU The Pentium 4 processor execution units are designed to optimize overall performance by handling the most common cases as fast as possible The Pentium 4 processor can do fully dependent ALU operations at twice the main clock rate The ALUbypass loop is a key closed loop in the processor pipeline Approximately 60 70 of all uops in typical integer programs use this key integer ALU loop Executing these operations at 12 the latency of the main clock helps speed up program execution for most programs Doing the ALU operations in one half a clock cycle does not buy a 2x performance increase but it does improve the performance for most integer applications This highspeed ALU core is kept as small as possible to minimize the metal length and loading Only the essential hardware necessary to perform the frequent ALU operations is included in this highspeed ALU execution loop Functions that are not used very frequently for most integer programs are not put in this key lowlatency ALU loop but are put elsewhere Some examples of integer execution hardware put elsewhere are the multiplier shi s ag logic and branch processing The processor does ALU operations with an effective latency of onehalf of a clock cycle It does this operation in a sequence of three fast clock cycles the fast clock runs at 2x the main clock rate as shown in Figure 7 In the rst fast clock cycle the low order l6bits are computed and are immediately available to feed the low l6bits of a dependent operation the very next fast clock cycle The highorder 16 bits are processed in the next fast cycle using the carry out just generated by the low 16bit operation This upper 16bit result will be available to the next dependent operation exactly when needed This is called a staggered add The ALU ags The Microarchitecture of the Pentium 4 Processor Intel Technology Journal Q1 2001 are processed in the third fast cycle This staggered add means that only a 16bit adder and its input muxes need to be completed in a fast clock cycle The low order 16 bits are needed at one time in order to begin the access ofthe Ll data cache when used as an address input Fla s Bits lt31216gt Bits lt150gt Figure 7 Staggered ALU add Complex Integer Operations The simple very frequent ALU operations go to the high speed integer ALU execution units described above Integer operations that are more complex go to separate hardware for completion Most integer shift or rotate operations go to the complex integer dispatch port These shi operations have a latency of four clocks Integer multiply and divide operations also have a long latency Typical forms of multiply and divide have a latency of about 14 and 60 clocks respectively Low Latency Level 1 Ll Data Cache The Level 1 LI data cache is an 8Kbyte cache that is used for both integer and oatingpointSSE loads and stores It is organized as a 4way setassociative cache that has 64 bytes per cache line It is a writethrough cache which means that writes to it are always copied into the L2 cache It can do one load and one store per clock cycle The latency of load operations is a key aspect of processor performance This is especially true for IA32 programs that have a lot of loads and stores because of the limited number of registers in the instruction set The NetBurst microarchitecture optimizes for the lowest overall load access latency with a small very low latency 8K byte cache backed up by a large highbandwidth secondlevel cache with medium latency For most IA32 programs this con guration of a small but very low latency Ll data cache followed by a large mediumlatency L2 cache gives lower net loadaccess latency and therefore higher performance than a bigger slower L1 cache The 1 data cache operates with a 2clock loaduse latency for integer loads and a 6clock loaduse latency for oating pointSSE loads This 2clock load latency is hard to achieve with the very high clock rates of the Pentium 4 processor This cache uses new access algorithms to enable this very low load access latency The new algorithm leverages the fact that almost all accesses hit the rstlevel data cache and the data TLB DTLB At this high frequency and with this deep machine pipeline the distance in clocks from the load scheduler to execution is longer than the load execution latency itself The uop schedulers dispatch dependent operations before the parent load has nished executing In most cases the scheduler assumes that the load will hit the L1 data cache If the load misses the L1 data cache there will be dependent operations in ight in the pipeline These dependent operations that have left the scheduler will get temporarily incorrect data This is a form of data speculation Using a mechanism known as replay logic tracks and reexecutes instructions that use incorrect data Only the dependent operations are replayed the independent ones are allowed to complete There can be up to four outstanding load misses from the L1 data cache pending at any one time in the memory subsystem Store to Load Forwarding In an outoforderexecution processor stores are not allowed to be committed to permanent machine state the L1 data cache etc until after the store has retired Waiting until retirement means that all other preceding operations have completely finished All faults interrupts mispredicted branches etc must have been signaled beforehand to make sure this store is safe to perform With the very deep pipeline of the Pentium 4 processor it takes many clock cycles for a store to make it to retirement Also stores that are at retirement often have to wait for previous stores to complete their update of the data cache This machine can have up to 24 stores in the pipeline at a time Sometimes many of them have retired but have not yet committed their state into the L1 data cache Other stores may have completed but have not yet retired so their results are also not yet in the L1 data cache O en loads must use the result of one of these pending stores especially for IA32 programs due to the limited number of registers available To enable this use of pending stores modern outoforder execution processors have a pending store buffer that allows loads to use the pending store results before the stores have been The Microarchitecture of the Pentium 4 Processor Intel Technology Journal Q1 2001 written into the L1 data cache storetoload forwarding This process is called To make this storetoloadforwarding process efficient this pending store buffer is optimized to allow efficient and quick forwarding of data to dependent loads from the pending stores The Pentium 4 processor has a 24entry storeforwarding buffer to match the number of stores that can be in ight at once This forwarding is allowed if a load hits the same address as a proceeding completed pending store that is still in the storeforwarding buffer The load must also be the same size or smaller than the pending store and have the same beginning physical address as the store for the forwarding to take place This is by far the most common forwarding case If the bytes requested by a load only partially overlap a pending store or need to have some bytes come simultaneously from more than one pending store this storetoload forwarding is not allowed The load must get its data from the cache and cannot complete until the store has committed its state to the cache This disallowed storetoload forwarding case can be quite costly in terms of performance loss if it happens very often When it occurs it tends to happen on older P5core optimized applications that have not been optimized for modern outoforder execution microarchitectures The newer versions of the lA32 compilers remove most or all of these bad storetoload forwarding cases but they are still found in many old legacy P5 optimized applications and benchmarks This bad storeforwarding case is a big performance issue for P6based processors and other modern processors but due to the even deeper pipeline of the Pentium 4 processor these cases are even more costly in performance FPSSE Execution Units The FloatingPoint FP execution cluster of the Pentium 4 processor is where the oatingpoint MMX SSE and SSE2 instructions are executed These instructions typically have operands from 64 to 128 bits in width The FPSSE register file has 128 entries and each register is 128 bits wide This execution cluster has two 128bit execution ports that can each begin a new operation every clock cycle One execution port is for 128bit general execution and one is for 128bit registertoregister moves and memory stores The FPSSE engine can also complete a full 128bit load each clock cycle Early in the development cycle of the Pentium 4 processor we had two full FPSSE execution units but this cost a lot of hardware and did not buy very much performance for most FPSSE applications Instead we optimized the costperformance tradeoff with a simple second port that does FPSSE moves and FPSSE store data primitives This tradeoff was shown to buy most of the performance of a second fullfeatured port with much less die size and power cost Many FPmultimedia applications have a fairly balanced set of multiplies and adds The machine can usually keep busy interleaving a multiply and an add every two clock cycles at much less cost than fully pipelining all the FPSSE execution hardware In the Pentium 4 processor the FP adder can execute one ExtendedPrecision EP addition one DoublePrecision DP addition or two SinglePrecision SP additions every clock cycle This allows it to complete a 128bit SSESSE2 packed SP or DP add uop every two clock cycles The FP multiplier can execute either one EP multiply every two clocks or it can execute one DP multiply or two SP multiplies every clock This allows it to complete a 128bit IEEE SSESSE2 packed SP or DP multiply uop every two clock cycles giving a peak 6 GFLOPS for single precision or 3 GFLOPS for double precision oatingpoint at 15GHz Many multimedia applications interleave adds multiplies and packunpackshuf e operations For integer SIMD operations which are the 64bit wide MMX or 128bit wide SSE2 instructions there are three execution units that can run in parallel The SIMD integer ALU execution hardware can process 64 SIMD integer bits per clock cycle This allows the unit to do a new 128 bit SSE2 packed integer add uop every two clock cycles A separate shuffleunpack execution unit can also process 64 SIMD integer bits per clock cycle allowing it to do a full 128bit shuffleunpack uop operation each two clock cycles MMXSSE2 SIMD integer multiply instructions use the FP multiply hardware mentioned above to also do a 128bit packed integer multiply uop every two clock cycles The FP divider executes all divide square root and remainder uops It is based on a doublepumped SRT radix2 algorithm producing two bits of quotient or square root every clock cycle Achieving significantly higher oatingpoint and multi media performance requires much more than just fast execution units It requires a balanced set of capabilities that work together These programs often have many long latency operations in their inner loops The very deep buffering of the Pentium 4 processor 126 uops and 48 loads in ight allows the machine to examine a large section of the program at once The outoforder execution hardware o en unrolls the inner execution loop of these programs numerous times in its execution window This dynamic unrolling allows the Pentium 4 processor to overlap the longlatency FPSSE and memory instructions by finding many independent instructions to wor on simultaneously This deep window buys a lot more performance for most FPmulti media applications than more execution units would The Microarchitecture of the Pentium 4 Processor Intel Technology Journal Q1 2001 FPmultimedia applications usually need a very high bandwidth memory subsystem Sometimes PP and multi media applications do not fit well in the L1 data cache but do fit in the L2 cache To optimize these applications the Pentium 4 processor has a high bandwidth path from the L2 data cache to the L1 data Some FPmultimedia applications stream data from memoryino practical cache size will hold the data They need a high bandwidth path to main memory to perform well The long l28byte L2 cache lines together with the hardware prefetcher described below help to prefetch the data that the application will soon need effectively hiding the long memory latency The high bandwidth system bus of the Pentium 4 processor allows this prefetching to help keep the execution engine well fed with streaming data Memory Subsystem The Pentium 4 processor has a highly capable memory subsystem to enable the new emerging highbandwidth streamoriented applications such as 3D video and content creation The memory subsystem includes the Level 2 L2 cache and the system bus The L2 cache stores data that cannot fit in the Level 1 L1 caches The external system bus is used to access main memory when the L2 cache has a cache miss and also to access the system IO devices Level 2 Instruction and Data Cache The L2 cache is a 256Kbyte cache that holds both instructions that miss the Trace Cache and data that miss the L1 data cache The L2 cache is organized as an 8way setassociative cache with 128 bytes per cache line These l28byte cache lines consist of two 64byte sectors A miss in the L2 cache typically initiates two 64byte access requests to the system bus to ll both halves of the cache line The L2 cache is a writeback cache that allocates new cache lines on load or store misses It has a net loaduse access latency of seven clock cycles A new cache operation can begin every two processor clock cycles for a peak bandwidth of 48Gbytes per second when running at 15GHz Associated with the L2 cache is a hardware prefetcher that monitors data access patterns and prefetches data automatically into the L2 cache It attempts to stay 256 bytes ahead of the current data access locations This prefetcher remembers the history of cache misses to detect concurrent independent streams of data that it tries to prefetch ahead of use in the program The prefetcher also tries to minimize prefetching unwanted data that can cause over utilization of the memory system and delay the real accesses the program needs 400MHz System Bus The Pentium 4 processor has a system bus with 32 Gbytes per second of bandwidth This high bandwidth is a key enabler for applications that stream data from memory This bandwidth is achieved with a 64bit wide bus capable of transferring data at a rate of 400MHz It uses a sourcesynchronous protocol that quadpumps the 100MHz bus to give 400 million data transfers per second It has a splittransaction deeply pipelined protocol to allow the memory subsystem to overlap many simultaneous requests to actually deliver high memory bandwidths in a real system The bus protocol has a 64 byte access length PERFORMANCE The Pentium 4 processor delivers the highest SPECintibase performance of any processor in the world It also delivers worldclass SPECfp2000 performance These are industry standard benchmarks that evaluate general integer and floatingpoint application performance Figure 8 shows the performance comparison of a Pentium 4 processor at 15GHz compared to a Pentium III processor at lGHz for various applications The integer applications are in the 1520 performance gain while the PP and multimedia applications are in the 3070 performance advantage range For FSPEC 2000 the new SSESSE2 instructions buy about 5 performance gain compared to an x87only version As the compiler improves over time the gain from these new instructions will increase Also as the relative frequency of the Pentium 4 processor increases over time as its design matures all these performance deltas will increase N o 1GHZ O I 0 01 d U C IV E 8 E a 10 d 2 E d n 0390 ISPECZOOO FSPECZOOO ZDGnmmg Wen Emtmx Figure 8 Performance comparison For a more complete performance brief covering many application performance areas on the Pentium 4 processor go to httpwwwintelcomprocsperfpentium4 CONCLUSION The Pentium 4 processor is a new stateoftheart The Microarchitecture of the Pentium 4 Processor Intel Technology Journal Q1 2001 processor microarchitecture and design It is the beginning of a new family of processors that utilize the new Intel NetBurst microarchitecture Its deeply pipeliried design delivers worldleading frequencies and performance It uses many novel microarchitectural ideas including a Trace Cache doubleclocked ALU new low latency L1 data cache algorithms and a new high bandwidth system bus It delivers worldclass performance in the areas where added performance makes a difference including media rich environments video sound and speech 3D applications workstation applications and content creation ACKNO ampEDGMENTS The authors thank all the architects designers and validators who contributed to making this processor into a real product REFERENCES 1 D Sager G Hinton M Upton T Chappell T Fletcher S Samaan and R Murray A 018um CMOS IA32 Microprocessor with a 4GHz Integer Execution Unit International Solid State Circuits Conference Feb 2001 2 Doug Carmean nside the HighPerformance Intel Pentium 4 Processor Microarchitecture Intel Developer Forum Fall 2000 at ftpdownloadiritelcomdesignidffall2000presenta tionspdapdas01cdpdf 3 IA32 Intel 39 So ware Developer s Manual Volume 1 Basic Architecture at intel 39 39 architects of the Intel Pentium 4 processor He joined Intel in 1995 Dave also worked for 17 years at Digital Equipment Corporation in their processor research labs He graduated from Princeton University with a PhD in Physics in 1973 His email address is davesagerintelcom Michael Upton is a Principal EngineerArchitect in Intel39s Desktop Platforms Group and is one of the architects of the Intel Pentium 4 processor He completed BS and MS degrees in Electrical Engineering from the University of Washington in 1985 and 1990 After a number of years in IC design and CAD tool development he entered the University of Michigan to study computer architecture Upon completion of his PhD degree in 1994 he joined Intel to work on the Pentium Pro and Pentium 4 processors His email address is mikeuptoniritelcom Darrell Boggs is a Principal EngineerArchitect with Intel Corporation and has been working as a microarchitect for nearly 10 years He graduated from Brigham Young University with a MS in Electrical Engineering Darrell played a key role on the Pentium Pro Processor design and was one of the key architects of the Pentium 4 Processor Darrell holds many patents in the areas of register renaming instruction decoding events and state recovery mechanisms His email address is darrellboggsiritelcom Douglas M Carmean is a Principal EngineerArchitect with Intel39s Desktop Products Group in Oregon Doug was one of the key architects responsible for de nition of the Intel Pentium 4 processor He has been with Intel for 12 years working on IA32 processors from the 80486 to beyond Prior to and Lattice address is r r the Intel Pentium 4 processor and 24547039htm39 joining Intel Doug worked at ROSS Technology Sun 4 Intel Pentium 4ProcessorP Reference 39 Cypress 39 Manual at Semiconductor Doug enjoys fast cars and scary Italian p p intel 39 J 39 His email 248966htm douglasmcarmeaniritelcom AUTHORSHMOGRAPHHm Glenn Hinton is an Intel Fellow and Director of IA32 Microarchitecture Development in the Intel Architecture Group Hintonjoined Intel in 1983 He was one of three senior architects in 1990 responsible for the P6 processor microarchitecture which became the Pentium Pro Pentium Pentium III and CeleronTM processors He was responsible for the microarchitecture development of the Pentium 4 processor Hinton received a master39s degree in Electrical Engineering from Brigham Young University in 1983 His email address is glennhiritonintelcom Dave Sager is a Principal EngineerArchitect in Intel s Desktop Platforms Group and is one of the overall Patrice Roussel graduated from the University of Rennes in 1980 and LEcole Superieure dElectricite in 1982 with a MS degree in signal processing and VLSI design Upon graduation he worked at Cirnatel an IntelMatra Harris joint design center He moved to the USA in 1988 to join Intel in Arizona and worked on the 960CA chip In late 1991 he moved to Intel in Oregon to work on the P6 processors Since 1995 he has been the oatingpoint architect of the Pentium 4 processor His email address s atriceroussel intelcom Copyright Intel Corporation 2001 This publication was downloaded from httpdeveloperiritelcom Legal notices at intel The Microarchitecture of the Pentium 4 Processor