Computer Architecture Concepts
Computer Architecture Concepts CGS 3269
University of Central Florida
Popular in Course
Popular in Computer General Studies
This 60 page Class Notes was uploaded by Carmelo Connelly on Thursday October 22, 2015. The Class Notes belongs to CGS 3269 at University of Central Florida taught by Staff in Fall. Since its upload, it has received 18 views. For similar materials see /class/227613/cgs-3269-university-of-central-florida in Computer General Studies at University of Central Florida.
Reviews for Computer Architecture Concepts
Report this Material
What is Karma?
Karma is the currency of StudySoup.
Date Created: 10/22/15
The Tiny Von Neumann Architecture CGS 3269 Spring 2007 ADDRESS 000 MEMORY 001 002 100 CPU ACCUMULATOR I INSTRUCTION REGISTER I OP ADDRESS The Instruction Cycle FETCH Fetch an instruction from the memory address pointed to by the PC Program counter place it in the instruction register IR and increment the PC by one EXECUTE The OP tells the computer what is the instruction that must be executed NOTE The PC always points to the instruction that will be fetched When the program is loaded in memory the PC points to the rst instruction of the program CGS 3269 Spring 2007 Instruction Set Architecture LOAD ltXgt Loads the contents of memory location X into AC AC stand for Accumulator ADD ltXgt The data value stored at address X is added to the AC and the result is stored back in the AC STORE ltXgt Store the contents of AC into memory location X SUB ltXgt Subtracts the value located at address X from the AC and stored the result back in the AC IN ltDevice gt A value from the input device is transferred into the AC OUT ltDevice gt Print out the contents of the AC in the output device END The machine stops execution of the program JMP ltXgt Causes an unconditional branch to address X PCGX SKIPZ If the contents of the Accumulator 0 the next instruction is skipped Device Device 5 Keyboard 7 Printer 9 Screen For instance you can write 003 IN lt5gt 23 where 23 is the value you are typing in CGS 3269 Spring 2007 Instruction Set Architecture Codes 01 9 LOAD 02 9 ADD 03 9 STORE 04 9 SUB 05 9 IN 06 9 OUT 07 9 END 08 9 JMP 09 9 SKIPZ Instruction format assuming 6 digits 2 are used for the op code and 6 for the address for example LOAD lt0004gt will be represented as It can be written as 010004 using decimal numbers l v39 7 w r rennin W l7 r l l 7 it lll ll Jl L ll 2 n39gt llllllm 39 rllwll llll 7 gt IZlF39MVJlz ar fl i lfllfi lliijlbw an ll i Overview At this point we are considering only von Neumann machines which were introduced in the previous set of notes The von Neumann architecture is based upon three key concepts Data and instructions are stored in a single readwrite memory The contents of this memory are addressable by location without regard to the type of data contained in the location Execution occurs in a sequential fashion unless explicitly modified from one instruction to the next 00 The Central Processing Unit CPU is the computer39s brains All other components in within the computer lO system monitors etc are there basically to bridge the gap between the user and the CPU The CPU itself consists of three major components a register set an arithmeticlogic unit and a control unit These three components typically communicate exchange data amongst themselves using local buses The CPU communicates with the HO system and memory system using one or more system level buses local buses or expanded local buses The typical CPU configuration for a simple computer is shown in Figure 1 Arithmetic and Logic Unit Slams Shiller Figure 1 Highlevel View of CPU lnlcrnal CPU Bus Arithmetic Control Unit Paths CGS 3269 CPUs and Microprocessors 1 When the CPU exchanges data with the memory module it will typically make use of two internal registers a memory address register MAR that specifies the address in memory at which the next read or write operation will occur and a memory buffer register MBR that either holds the data that will be written into the memory location or will receive the data from the memory location The CPU will have many such registers such as an lO address register lOAR which specifies a particular lO device and an lO buffer register lOBR that will be used for the exchange of data between an lO device and the CPU We ll examine each of the three components of a CPU briefly and then go back and look at each in more detail once some background information is covered The ALU ArithmeticLogic Unit All computers have functional units that perform the arithmetic logical and shift operations required by the instruction set we ll use the terms functional unit and ALU interchangeably All the other components of the system are there mainly to bring data into the functional unit for it to process and then to take it back out In a sense we have reached the core of a computer in the functional unit Some computers have a single functional unit the ALU but more advanced CPUs will have several independent functional units Functional units themselves may contain functional units Early microprocessors that contained a single ALU were often augmented with a special floatingpoint unit FPU on another chip called a mathcoprocessor lntel39s early math coprocessor family x87 line is shown in Appendix C Current technology incorporates these coprocessors into the main chip Typically the ALUs will communicate with the control unit via dedicated control and status buses Local data buses will carry data to and from the CPU s registers Some computers have special floatingpoint coprocessors to implement more complex arithmetic operations Floatingpoint units generally arrange their registers in stacks Stack oriented floating point units are considered lean and mean RISC Reduced Instruction Set Computers we ll see these later commonly have several independent functional units that implement different types of instructions such as branch processing It is quite common for systems with more than one functional unit to employ pipelining techniques to supply the functional units with instructions and data The overall speed of a von Neumann machine depends largely on the speed of its computational circuitry in the functional units so design emphasis is placed on efficient functional units Data is sent to the functional unit via registers and the results of an operation performed by the functional unit are returned via storage into a set of registers These registers are located within the CPU and are connected via signal paths to the ALU The ALU may also set various flags as the result of an operation For example if an addition operation causes an overflow that results in the calculated CGS 3269 CPUs and Microprocessors 2 value being too large to fit into the results register The current values of all of the flags that can be set by the ALU are maintained in CPU registers The control unit provides the signals that control the operations of the ALU and directs the movement of data into and out of the ALU Figure 2 illustrates the ALU of a simple computer ie no branch prediction units no pipelining no floatingpoint processors etc mu darn mm m Figure 2 Highlevel View of ALU The Control Unit Figure 3 illustrates a von Neumann machine with a program in main memory While the CPU executes the current instruction the PC register programcounter register holds the address of the next instruction to be executed The job of the control unit is to control the von Neumann machine cycle which is 1 Fetch from the memory the next instruction to be executed place it in the IR instruction register shown as step 1 in part b of Figure 3 and increment the PC to contain the address of the next instruction in the memory shown as step 2 in part b of Figure 3 2 Decode and execute the instruction just fetched now in the IR In reality only the simplest of computer actually operate in such fashion We will use this model to understand the basic processes that occur within the CPU CGS 3269 CPUs and Microprocessors 3 Later we will examine some of the features of modern systems that make this sequence of operations much more complex but also more powerful l Main memory Main memory A program A program pi Curr instruction Nexlinstrubuun Next m1ilction i i r 74 l l A id L t his 3 i Data bus Address generation lt L l E 2 PC 3 Control unit eraiona 1 239 I g ll CPU 7 b Dara bus Figure 3a Simple von Neumann machine with program in memory Figure 3b Fetching instructions and incrementing register values The control unit is responsible for generating the signals that regulate the computer For the simple system that we are currently considering the control unit will typically send microorders individual signals sent over dedicated control lines which control individual components and devices An example might be a control signal that sets or clears a particular status flag A specific example ofthis might be a clear carry signal that tells the ALU to clear the addition carry status bit flag see the ALU above In modern systems it is much more common that the control unit will generate sets of microorders which operate concurrently rather than individual microorders A set of microorders issued by the control unit at one time are called microinstructions Whenever a computer executes a machine instruction from its instruction set the control unit issues a sequence of micro instructions This sequence of microinstructions is called a microprogram Although it is possible for a microprogram to consist of a single microinstruction typically it will consist of a sequence of microinstructions For example when an accumulatorbased machine executes an ADD instruction the control unit issues microinstructions for computing the address of the first operand in memory for reading that memory location and transferring the operand found at that address CGS 3269 CPUs and Microprocessors 4 into the ALU for transferring the second operand from the accumulator to the ALU for adding the two values and for transferring the result computed by the ALU back into the accumulator The exact number and type of microinstruction sequences that will be generated by the control unit depends upon many factors including the complexity of the addressing calculations required and the availability of different types of buses within the machine There are basically two different types of control units microprogrammed and conventional Most of the computers built during the 1970s and 1980s had microprogrammed control units whereas highspeed systems and RISC processor use the conventional hardwired form Microprogrammed control units are relatively easy to design and enable the design of complex instruction sets with relatively little cost They are however slower than conventional control units which makes them difficult to use them to generate the control signals that are needed for highperformance or RISC machines Appendices A and B give an overview of how RISC and CISC processors differ For many years the general trend in computer architecture and organization has been toward increasing processor complexity more instructions more addressing modes more specialized registers and so forth The RISC systems represent a complete departure from than trend In the mid1980 s in particular the debate between RISC and CISC architectures was a strongly debated topic in the computer architecture world with proponents on both sides extolling the virtues of their architecture In recent years this debate has for the most part died away This is because there has been a gradual convergence in the two architectures As chip densities and hardware speeds have increased RISC systems have become more complex trend is to become more CISClike At the same time in order to gain maximum performance CISC systems have adopted strategies such as increasing the number of general purpose registers and emphasizing and refining instruction pipelines trend is to become more RISClike A case in point would be Intel s Pentium family of processors whose instruction set is very much CISC oriented yet the processor employs many RISClike strategies such as outof order execution instruction pipelining multiple functional units and many RISC style commands CPU Operations and Instruction Sets Instructions are the basic units for telling the microprocessor what to do To execute a single instruction the microprocessor must carry out hundreds thousands or even millions of logic operations The instruction in effect triggers a cascade of logical operations How this cascade is controlled marks the great divide in microprocessor and computer design CGS 3269 CPUs and Microprocessors 5 In the hardwired design an instruction simply activates the circuits which carry out all of the steps required to execute the instruction The primary advantage to this design is that it provides for very fast execution since the hardwired direct connections present no overhead to the execution of the instruction The primary disadvantage to this design is that the hardware and software the set of instructions that can be executed on the machine become irrevocably tied together Changes in the hardware of the machine require changes in the code which will execute on the machine The hardwired approach to instruction sets is completely inflexible The instructions directly controlled the underlying hardware The need for more flexible instruction sets caused IBM to define the world39s first computer architecture in which a set of instructions that a computer based on this architecture would be able to execute was defined but the circuitry which would carry out each instruction was not defined IBM adopted an idea called microcode to handle this design Microcode technology means that an instruction causes the chip to execute a small program to carry out the logic instructions required by the instruction The collection of small programs for all of the instructions that the computer understands is its microcode Although additional layers of mircocode causes the machines to become more complex it added a great deal of flexibility to the design New technologies could be incorporated into the hardware and yet still run the same microcode This provides a backward compatibility for newer machines with older machines In effect the microcode inside a microprocessor is a secondary set of instructions that run invisibly inside the chip on a nanoprocessor essentially a microprocessor within a microprocessor Primary advantage to the microcode technique is that it makes creating a complex microprocessor easier than the hardwired approach The powerful data processing circuitry of the chip can be designed independently of the instructions that it must carry out The primary disadvantage of the microcode technique is that the microprocessor and the computers that use it becomes more complicated In a microprocessor the nanoprocessor must go through several of its own microcode instructions to carry out every instruction sent to the microprocessor More steps means more processing time taken for each instruction More processing time means slower operation To compensate for its performance penalty the microcode technique allows for very complex instructions to be formulated Very elaborate functions can be designed into the instruction set of a microprocessor using microcode A single instruction from the instruction set might do the work of half a dozen or more CGS 3269 CPUs and Microprocessors 6 simpler instructions Although each instruction would take longer to execute because of the microcode programs would need fewer instructions overall to accomplish the same task Moreover adding more instructions could further boost this speed gain One result of this is that most typical PC microprocessors have 7 different subtraction commands The Register Set The register set is the third of the three major components of a CPU that we will examine individually A computer system employs a memory hierarchy At higher levels in the hierarchy the memory is faster smaller in total number of bytes and more expensive per bit Within the CPU itself there is also a memory hierarchy The register set is at the highest level followed by main memory followed by the first of possibly several levels of cache memory The registers in the CPU serve two primary functions 1 Uservisible registers These enable the machine language or assembly language programmer to minimize main memory references by optimizing the use of the registers User visible registers fall into four general categories i General purpose registers may be assigned a variety of functions by the programmer Some may be dedicated to floatingpoint or stack operations In some machines the general purpose registers may be used for addressing functions ie register indirect displacement values ii Data registers Can be used only to hold data values and cannot be used in the calculation of an operand address iii Address registers may be somewhat general purpose as far as addressing modes are concerned or they may be devoted to a particular addressing mode For example some address registers may be index registers for indexed addressing modes others may be addresses to the top of stacks for machines which support uservisible stack addressing On machines that support segmented addressing segment registers will hold the base address of the segment iv Condition code registers also referred to as flags hold the condition code bits set by the CPU to indicate the status of operations it has performed For example an arithmetic operation may produce a negative zero positive or overflow result A condition code will be set by the CPU to indicate which result is produced this can be tested by examining the condition code for the operation CGS 3269 CPUs and Microprocessors 7 2 Control and status registers These are used by the control unit to control the operation of the CPU and also by privileged operating systems programs to control the execution of programs The exact number and use of these registers will vary from machine to machine but the following list is a fairly complete set of control and status registers that would be found on the average machine i Program counter PC contains the address of the next instruction to be fetched for execution ii Instruction register IR contains the address of the instruction most recently fetched and currently in execution most likely iii Memory address register MAR contains the address of some location in memory iv Memory buffer register MBR contains the word of data to be written to memory usually to the address contained in the MAR or the word most recently read from the memory v Program status word PSllo All CPUs include a register or a set of registers that contain status information about the processor Usually this register contains many condition codes as well as status information This register will contain an indication if the processor is executing in a supervisor or user mode certain instructions from the instruction set can only be executed when the processor is in supenisor mode similarly certain memory areas can only be addressed while in supenisor mode An indication of the interrupt status enableddisabled as well as arithmetic operation conditions will be included in this register Depending upon the machine several other control registers may be present such as a register which hold pointers to PCBs process control blocks Since the CPU is expected to work closely with an operating system much of the design of the control and status registers is focused on providing proper support for the operating system There is often not a clear distinction between these two sets of registers For example on many machines the PC register the program counter register is uservisible on other machines it is not The IIO System Even though the HO system is not part of the CPU it plays an important role in the overall performance of a computer system We will briefly discuss the HO system and examine its role in the interaction with the CPU CGS 3269 CPUs and Microprocessors 8 The set of all physical lO devices and HG interface devices make up the HO system Physical lO devices are those which actually perform lO such as printers video displays operator consoles etc lO interface devices communicate with the CPU on one side and with the physical lO device on the other side These interface devices isolate the CPU from the specific characteristics of the physical device The various lO devices that can be connected to a modern computer vary widely in the rate at which data can be input from the device or output to the device When computer systems began to evolve in the early 1950s the primary form of input was either from the operator s console card readers and magnetic or paper tape The operator consoles where essentially electromechanical typewriters which sent a particular electrical signal to the computer depending upon which key the operator pressed These devices were too costly for the typical user to use as an input device so punched cards or tapes were typically employed Today there are literally hundreds of different types of input devices ranging from terminals PC keyboards mice trackballs scanners digital cameras etc The amount of data that a user can produce via a terminal is limited to around a few hundred characterssecond Even with tens of thousands of terminals connected to a single computer the aggregate data rates are only comparable to the rates from a single disk drive Video cameras can digitize entire images in a fraction of a second and produce millions of bytes of information per second Few computers currently have either the computational power or enough memory to perform any significant processing of digitized images at these rates On the output side the fastest of the early printers could produce about 1000 lines per minute approximately 1300 characterssecond High speed laser printers today can print around 100 pagesminute approximately 8000 characterssecond The lO unit matches the signal levels and timing of the CPU39s internal solidstate circuitry to the requirements of the other components inside the computer The internal circuits of the CPU are designed to be very stingy with electricity so that they can operate faster and cooler These delicate internal circuits cannot handle the higher currents needed to link to external circuits Consequently each signal leaving the microprocessor goes through a signal buffer in the HO unit that boosts its current capacity As computer systems developed three different ways of handling lO developed that still exist in modern computer systems CPUcontrolled O memorymapped O and directmemory access DMA O We ll look very briefly at each of these CGS 3269 CPUs and Microprocessors 9 CPUcontrolled lO The designers of early computer systems paid little attention to lO processing and the CPU directly controlled the HO devices using very simple lO instructions remember the devices weren t very sophisticated either so the instructions didn t need to be too complex Instructions were of the form Write A to Device N where A was typically a register address and N designated an lO device address The instructions typically transferred one byte or one word at a time The computers ran one program at a time and each program executed its own instructions for lO As computer systems evolved an everwidening gap between the speed of the CPU and the speed of the HO devices began to occur which has become even wider today The challenge then has been to find ways to keep the CPU utilization high even though the HO devices are very slow Three different solutions have emerged which are all based upon other developments that have occurred whose potential performance gains have been exploited These three areas are 1 Multiprogrammed Operating Systems The operating system loads several different programs into the memory at the same time The CPU can then execute instructions from one of the programs while another waits for an lO operation to complete Multi ported Memory Systems This type of memory which we will examine more closely when we look at memory systems either allow several processors to access the memory simultaneously or they arbitrate requests for memory cycles amongst the competing processors and lO devices which allows for memory sharing amongst the processors and HG devices 0 processors These special lO interfaces which include devices called DMA channels and peripheral processing units PPUs can control the lO devices without CPU intenention N 00 Memorymapped lO Rather than have special instructions for handling lO some computers typically micros and minis use memorymapped lO ln memorymapped lO certain specific addresses within the memory space are reserved for controlling the HO interface device Essentially the HO interface devices take the place of a portion of the physical main memory and the CPU controls them using standard instructions that read and write to memory For example if the physical address space of a computer is 64K the designer may reserve the upper 16K for lO interface devices The lO interface devices decode addresses that appear on the address bus Each lO interface device responds to one or more specific lO port addresses typically there are control ports status ports input ports and output CGS 3269 CPUs and Microprocessors 10 ports IO port addresses are simply main memory addresses and are mutually exclusive in that no other IO interface device or memory will respond to the same port addresses A processor that uses memorymapped IO requires no special IO instructions A store operation to an output port sends the data to the attached IO interface device and a load operation from an input port will receive data from an IO interface device In fact the CPU cannot distinguish between a memory access and an access to an IO interface device A store operation to a control port will set an IO command to the interface device and a load operation from a status port will get status information from the device To output a value the CPU simply stores the output value in the correct output port address and to input a value the CPU loads it from the correct input port address CPUcontrolled IO and memorymapped IO are not exclusive concepts Computers may have both types of IO control the IBM PC was an example There are a large number of IO interface devices including DMA controllers programmable parallel interfaces PPls and universal asynchronous receiver transmitters DMA O Hardware devices that directly control the transfer of data to and from main memory are called directmemory access controllers Some systems will have very simple DMA controllers while other systems may have quite powerful DMA controllers For the simple type DMA controllers each transfer of data the CPU sends the DMA controller the memory address for the block of data the number of bytes to transfer and the direction of transfer input or output The DMA controller then performs the transfer without CPU intervention and interrupts the CPU when the transfer is completed signaling the CPU that the operation has completed For simple single bus systems the DMA controller will become the bus master during the transfer operation we ll see more about this when we discuss the memory system for now it means that the DMA controller is the only device that can use the bus while it is the master While the DMA controller is the bus master the CPU which is working on some other problem now may have to wait to use the bus The DMA controller is said to steal bus cycles from the CPU Cycle stealing refers to any situation in which an IO device causes the CPU to wait because the IO device currently has exclusive access to a shared resource IBM machines are typical of machines that employ relatively simple DMA controllers in the form of channels In IBM nomenclature a channel is an IO processor that executes DMA IO under the supervision of a channel program Current channels will employ large amounts of cache memory for data buffering CGS 3269 CPUs and Microprocessors 11 acting as a speed matching device between the CPU and the HO device A channel is essentially a small processor with a very limited instruction set primarily capable only of transfer operations Selector channels control multiple highspeed devices and at any specific point in time will be dedicated to the transfer of data with only one of the devices which it can control Multiplexor channels can handle HQ with several devices simultaneously For slow speed devices a byte multiplexor accepts or transmits bytes of data as fast as possible to multiple devices For highspeed devices a block multiplexor will interleave blocks of data from several devices simultaneously At the other end of the spectrum are machines that employ very sophisticated lO processors called peripheralprocessing units PPUs Main frame systems built by Control Data Corp are examples of this style The PPUs are complete though simple computers with their own memory and in addition to data transfer operations are capable of computations such as data formatting character translation buffering etc A More Detailed Look At The CPU Now possessing a better idea of generally what the CPU looks like and what is going on inside it at any point in time we will now focus more on the details In the remainder of this section of notes we ll focus more on current systems without to much concern for machines of the past although we won t exclude them entirely Once some background data is presented which describes the architecture of current stateoftheart processors we ll look closer at the design techniques employed in these processors that give them such highperformance The current production standard for high performance processors is lntel39s Pentium III processor This will very soon be supplanted by both the Pentium IV and the ltanium based on the lA64 architecture microprocessors See Appendix E for a brief description of the ltanium processor The Pentium III hereafter referred to as the P3 comes in both a quotstandardquot configuration and a Xeon configuration Xeon machines will support up to 2MB of L2 cache The L1 cache can be up to 16K Data moves between the L1 cache and the CPU at the clock speed of the processor 400 MHz The P3 chip contains approximately 95 x 106transistors in the core logic and another approximately 185 x 106 transistors in the onboard L2 cache for a total of about 281 x 106 transistors In the Xeon configuration the L2 cache can be expanded up to 2 GB The Pentium II hereafter referred to as the P2 was the immediate predecessor of the P3 The P2 chip contains about 75 x 106 transistors The L1 cache in a P2 is double the size of that in the Pentium Pro the Pro was the immediate CGS 3269 CPUs and Microprocessors 12 predecessor of the Pentium II and had a 8K L1 data cache and a 8K L1 instruction cache Both L1 caches in the P2 are 16K A major architectural difference in the Pentium architecture compared to the X86 architecture is that the Pentium Pentium Pro and P2 have two ALUs for integer calculations and a single FPU floating point unit The P3 has to FPUs one dedicated to the MMX unit and the other shared by the CPU and the Internet Streaming Instruction unit These additional execution units allow the Pentium family architecture to show superscalarspeedup over the 486 architecture P2 and P3 processors contain the MMX unit which is a special purpose unit processor designed to improve the performance of graphic and multimedia software In the P2 this is done via a set of 57 different instructions tailored to perform small repetitious operations that are commonly needed in multimedia The FPU is shared by the main processor and the MMX unit In the P3 the MMX unit gets its own FPU and the number of MMX instructions has been increased to around 70 The CPU isn39t the only microprocessor in most modern systems There are several coprocessors which handle specific types of tasks such as graphics and video AGP 3D accelerators DSP digital signal processors sound cards etc Clocked Logic Microprocessors do not carry out instructions as soon as the instruction code signals reach the pins that connect the microprocessor to the computer39s circuitry Electrical signals do not change state instantaneously instead they always go through a brief but measurable transition period during which the state of all signals voltage levels stabilize to their final values The microprocessor must wait for an indication that all signals are valid ie that it has a valid command to execute This indication is the ticking ofthe system clock At each quottickquot of the clock the microprocessor checks the instructions given to it if it is not already processing an instruction Early microprocessors could not execute 1 instructionclock cycle Many instructions required as many as 100 clock cycles In current systems which employ many RISClike features many instructions take less than 1 clock cycle to execute via multiple ALUs pipelined ALUs and SIMD techniques inside the MMX units on Pentiums Clock Multipliers allow the microprocessor circuitry to run faster than the system clock The system clock is multiplied by the clock multiplier typically 15 20 25 CGS 3269 CPUs and Microprocessors 13 30 etc In this fashion the CPU is allowed to operate internally at a rate which is faster than the system clock would allow The lack of correspondence between the clock cycles and the instruction execution speed means that the clock speed of the system alone is not a good metric for relating the performance of two different processors See the example below Clock speed only gives a reliable indication of relative performance when two identical microprocessors operating at different frequencies are compared Example Suppose that processor P1 requires an average of six clock cycles per instruction and the system clock runs at 400 MHz Processor P2 requires an average of two clock cycles per instruction and the system clock runs at 200 MHz For P1 400 MHz 6 cycles instruction e 67 For P2 200 MHz 2 cycles instruction e 100 Thus 67 10067 z 50 or P1 is 50 slower than P2 even though it39s clock speed is twice as fast as P1 The Advanced Technologies of Modern CPUs Overview Today39s higher clock speeds make circuit boards and integrated circuits more difficult to design and manufacture Designers have a strong incentive to get their microprocessors to process more instructions at a given speed Most modern microprocessor designs intend to do just that One way to speed up execution of instructions is to reduce the number of internal steps the microprocessor must take to execute an instruction Reducing the number of steps can be accomplished with two basic techniques make the processor more complex so that internal steps can be combined or by making each instruction simpler so that fewer steps are needed The former technique is the one taken by CISC designers while the lattertechnique is employed by RISC designers Another way of reducing the number of clock cycles required to execute a program is to operate on more than one instruction simultaneously Two basic approaches to operating on more than one instruction simultaneously are pipelining and superscalar architecture Both CISC and RISC designs take advantage of these techniques as well as other techniques CGS 3269 CPUs and Microprocessors 14 UCF CGS 3269 COMPUTER SYSTEMS ARCHITECTURE SPRING 2002 Basic Components of the PC Architecture There are five key parts to a computer These include 0 The Processor also called the Central Processing Unit or C The Memory of which there are several types The Input Output circuitry 0 Disk storage 0 Programs wumuwm msxs Figure 1 The major components of a PC There also are other components that form part of the packaging and support for these basics such as the power supply the motherboard and the peripheral cards The Processor The purpose of the processor is to carry out a series of steps called a program To carry out this job the processor has certain capabilities The first capability is the ability to read and write information in the computer s memory This is critical because both the program instructions that the processor carries out and the data on which the processor works are stored in the computer memory The next capability is to recognize and execute a series of commands or instructions provided by the programs The last is the capability to tell the other parts of the computer what to do so that the processor can orchestrate the operation of the computer Memory Memory is where the computer s processor finds programs and data when it is doing its assigned task The computer s memory is just a temporary space like a scratch pad or a chalkboard where the computer scribbles while work is being done Unlike our memories the computer s memory is not a permanent repository Instead the computer s memory simply provides a place where computing can happen While the computer s processor makes a vital distinction between programs and data the computer s main memory does not To the computer s memory and to many other parts of the computer there is no difference between programs and data both are information to be recorded temporarily Note Most of today s systems come with dedicated processor cache memory which for reasons of speed optimization does distinguish between data that is program code and data that is the user s content However your computer s main RAM which is what memory refers to in general makes no such distinction InputOutput Devices The processor and the memory by themselves make up a closed world IO devices open that world and enable it to communicate with us An IO device is anything other than memory with which the computer communicates These devices include the keyboard the display screen the mouse the printer a telephone line connected to the computer and any other channel of communication into or out of the computer It also includes the circuitry that manages the video images on your monitor or plays sounds on your computer speakers even if that circuitry is built onto the motherboard Taken together 10 is the computer s window on the world the thing that keeps the processor and memory from being a closed and useless circle Disk Storage Disk storage refers to the non volatile memoryr that does not change or disappear when the power goes off The processor can write to it and read from it at will but nonvolatile memory will keep whatever data is stored in it for months or even longer without any external power whatsoever Note Memory in general can be classified into two categories 0 Volatile memory and 0 Nonvolatile memory In volatile memory the computer system must use power and dedicated circuitry to constantly rewrite or refresh every piece of data that is stored in memory If this refreshing did not occur the data in memory would simply fade away This susceptibility39 to lose data in a power loss or even a power drop such as a brownout is called volatility An example of volatile memory would be your system s RAM In nonvolatile memory the data stored does not change or disappear when the power goes off The nonvolatile memory will keep whatever data is stored in it for months or even longer without any external power whatsoever An example of nonvolatile memory would be your computer s hard drive Programs Programs tell the computer what to do There are two categories of programs 0 Systems Programs 0 Application Programs All programs accomplish some kind of work Systems programs help operate the computer itself in fact the inner systems of a computer are so complex that you can t get them to work without the help of systems programs An application program carries out a task which you the user wants done whether it s composing a document or surfing the Internet A few of the systems programs that the PC needs to manage its operations are permanently built into it These can be called the ROM programs or firmware because they are permanently stored in read only memory unlike re writeable memories like RAMs or hard drives These kinds of system programs do the most fundamental kind of supervisory and support work such as providing essential services that all the application programs use These service programs are called the Basic InputOutput System also referred to as the BIOS Other systems programs build on the foundation created by the BIOS program and provide a higher level of support services Operating systems such as Linux and Microsoft Windows are examples of these higher level systems programs that are not built into the computer Appendix A How a CISC Processor Works The numbers below refer to the numbers in the diagram above 1 N Built into a CISC microprocessor s ROM is a large set of microprograms each of which contain microinstructions or microoperations that must be carried out to perform a single instruction such as the adding of two numbers or moving a string of text from one location in memory to another Whenever the operating system or application program requires the CPU to perform an instruction the details of what the CPU will need including the name of the instruction to be executed will be sent to the CPU as a CISC instruction These commands will be of varying size Because CISC instructions are not all the same size the control unit will examine the instruction and determine the number of bytes of memory the command will require and will set aside that much memory CISC instructions will provide many different ways that an instruction can be loaded and stored and the control unit must determine the correct technique 368 3269 CPUs and Microprocessors 37 00 5 01 for each instruction Both of these tasks will require time that ultimately slows down CISC processors The control unit sends the instruction received from the OS or the application to a decode unit which will translate the complex command into microcode to be executed by the nanoprocessor The nanoprocessor is like a processor within a processor and is specifically designed to handle microinstructions Since an instruction may depend upon the results of another instruction the instructions are performed one at a time All other instructions stack up until the current instruction is completed The nanoprocessor executes each of the microinstructions from the microprogram through circuitry that can be quite complex A particular instruction may need to pass through several different steps before it is completely executed Moving through complex digital circuitry also requires time remember our earlier discussion about gate delays CISC processors typically require between four and ten clock cycles to execute a single instruction from the machine s instruction set As an extreme case the Intel 80386 microprocessor contains several mathematical operation instructions that require 43 clock cycles to execute CGS 3269 CPUs and Microprocessors 38 Appendix B How RISC Processors Work The numbers below refer to the numbers in the diagram above 1 Command functions builtin to a RISC processor consist of several small discrete operations that perform only a single task Application software which must be recompiled especially for a RISC processor performs the task of telling the control unit which combination of its smaller operations to execute in order to complete the execution of the instruction All RISC instructions are the same size and there is only one way in which they can be loaded and stored In addition each operation is already in the form of microcode each operation does only a single simple task so the RISC processors don t require the extra step of passing the instructions through a decoder to translate complex instructions into simpler microcode result of these three differences RISC instructions are loaded for execution far more quickly than CISC instructions can be loaded N During the compilation of software specifically designed for a RISC chip the compiler will determine which instructions will not depend on the results of other instructions Because these instructions will not have to wait on other instructions to complete the control unit can simultaneously execute many 068 3269 CPUs and Microprocessors 39 JO instructions in parallel Current implementations of this will allow as many as 20 instructions to be executed in parallel Since the RISC processor is dealing with instructions which are simpler than those found in its CISC counterpart its circuitry can also be kept simple RISC instructions pass through fewer total transistors on shorter circuits and therefore are executed more quickly than the CISC instructions The result is that RISC processors usually only require a single clock cycle for each operation The total number of clock cycles required to execute an instruction is dependent upon the number of the smaller operations that make up the execution sequence for that instruction However for a comparable operation ie comparing and ADD instruction the time required to interpret and execute RISC instructions is far less than the time required to load and decode a CISC command and then execute each of its components CGS 3269 CPUs and Microprocessors 40 Appendix C Early Intel Math Co processors microprocessor math coprocessor 8086 8087 8088 8087 80286 80287 386DX 387DX 3868X 387SX 486DX none required 4868X 487 CGS 3269 CPUs and Microprocessors 41 Appendix D Intel39s Family of Microprocessors This appendix briefly describes the main design characteristics of the Intel X86 line of microprocessors on which the CPU of virtually all PCs are based More specific details of each processor are available from Intel wwwintelcom 8088 CPU First CPU for IBM PC and XT models 1981 8 bit data bus 16 bit internal calculations could address only 1 MB of memory in 64 KB segments operated at 477 MHz some clones operated at 10 MHz optional 8087 math coprocessorwas available 80286 CPU First used in IBM AT Advanced Technology 1982 16 bit data bus 16 bit internal calculations could access up to 16 Mbytes of memory various operating speeds possible 6 8 10 16 20 MHz protected mode allows multitasking but little software supported it DOS is based on 8088 architecture and can39t run protected mode optional 80287 math coprocessorwas available 80386 CPU 386 DX processes 32 bits at a time internally has 32 bit data bus 386 SX processes 32 bits at a time internally but has 16 bit data bus could access up to 4 Bytes memory wo using 64KByte segments various operating speeds 16 20 25 33 40 MHz supported both real and protected modes added virtual 8086 mode can multitask DOS 16 bytes prefetch cache quotscratchpad memoryquot for CPU optional 80387 math coprocessor was available 80486 CPU includes all features of 386 DX 32 bit data bus and 32 bit internal calculations includes builtin cache for instructions and data 8KBytes for each 486 DX operating speeds 25 33 40 50 MHz 486 DX2 CPU runs at twice clock speed 50 66 80 MHz 486 DX4 CPU runs at three times clock speed 75 100 120 MHz 486 DX has builtin math coprocessor 486 SX does not 486 contains 12 millions transistors 80586 CPU Pentium provides both real and protected modes 64 bit data bus faster system bus 50 60 66 MHz CPU operating speeds 60 66 75 90 100 120 133 MHz CGS 3269 CPUs and Microprocessors 42 pipeline allows overlapping execution of instructions superscalar System Management Mode SM M provides more hardware control contains 31 million transistors with 273 connections to system board builtin 8 KB data and 8 KB instruction caches builtin math coprocessor builtin errors on old Pentium chips 80686 CPU Pentium Pro all of the features ofthe Pentium with better pipeline 55 million transistors operating speeds 150 200 MHz internal cache as large as 512 Kbytes builtin math coprocessor actually runs slower than Pentium with 16 bit software DOS Windows 31 CGS 3269 CPUs and Microprocessors 43 Appendix E Intel Itanium IA64 based Microprocessor Overview Intel has essentially reached the end of development of the IA32 ISA and hence the Pentium family of processors Newer Pentiums will benefit from advances in manufacturing technology smaller transistors and hence faster clock speeds however finding new ways to achieve significant speedup will prove harder and harder under the restrictions of the IA32 ISA To achieve truly significant speed up will require an entirely new ISA This is where IA64 enters the picture Jointly developed by Intel and HewlettPackard this new architecture is a full 64bit machine from beginning to end The earliest implementations of IA64 were dubbed Merced but Intel has seemingly decided on Itanium for the first real implementations of this new architecture While we are focusing here on Intel s version of a 64bit chip keep in mind that the architecture is not proprietary to Intel The basic ideas behind IA64 are well known amongst microprocessor designers and will undoubtedly surface in other manufacturer s designs several of which already have experimental versions running Problems with the IA32 ISA that the IA64 ISA will not face The basic problem with the IA32 is the now ancient ISA upon which it is based which has all the wrong properties for modern processors geared toward high speed The IA32 ISA is a CISC ISA with variablelength instructions and a wide variety of instruction formats that are difficult to decode quickly Current technology works best with RISC ISAs that have one instruction length and an easily decoded fixed length opcode The IA32 instructions can be broken down into RISCIike microoperations quS at execution time but doing this requires hardware chip area takes time slower and adds complexity to the overall design The IA32 is also a twoaddress memoryoriented ISA Most of the instructions in the instruction set reference memory and most programmers and compilers think nothing of referencing memory all the time Current technology favors loadstore ISAs that only reference memory to fetch operands into registers but othenvise perform all their calculations using threeaddress memory register calculations As CPU clock speeds continue to increase at a much faster rate than the speed ofthe memory this problem will only get worse for the IA32 The IA32 has a fairly small and irregular register set This causes great problems for compilers particularly for optimizing compilers but even worse is that the CGS 3269 CPUs and Microprocessors 44 small number of general purpose registers between four and six requires intermediate results to be spilled into memory all the time which generates extra memory references even when they are not logically needed The small number registers causes many dependencies particularly WAR dependencies writeafter read because results have to go somewhere after they are produced and there are no extra registers available Getting around this problem has been handled in the Pentium design via renaming registers see earlier section of the notes This is basically means that versions of the register contents are present in the reorder buffer this is a hack if there ever was one To avoid blocking on cache misses too often as a result instructions have to be executed out of order However the lA 32 s semantics specify precise interrupts so the outoforder instructions must be retired in order To do all of this requires some very complex hardware further complicating chip design and of course occupying space To do all of this work quickly requires a very deep 12stage pipeline This means that instructions are entered into the pipeline 11 clock cycles before they will be finished Consequently very accurate branch prediction is required to make sure that the right instructions are actually being entered into the pipeline An inaccurate prediction can prove very costly to the overall performance as the pipeline will need to be flushed and refilled To alleviate the problems that an inaccurate branch prediction can cause the processor must to speculative execution with all of its ensuing problems this can be particularly troublesome when a memory reference along an incorrect path causes an exception On top of all the problems mentioned above the lA 32 has fundamental problems that when it was introduced did not seem to be problems at all For example the 32bit addresses limit individual programs to 4 GB of memory which is a very real concern on highend servers The problem with lA 32 is analogous to the problems that faced celestial mechanicsjust prior to the arrival of Copernicus At the time the main theory that dominated astronomy was that the Earth was fixed and motionless in space and that the planets moved in circles with epicycles around it However as observations got better and more deviations from the model could be clearly observed epicycles where added to epicycles until the whole model just collapsed from its internal complexity This is essentially what Intel AMD and other microprocessor designers are facing today A huge fraction of the transistors on the Pentium II and III are devoted to decomposing CISC instructions figuring out what can be done in parallel resolving conflicts making predictions repairing the consequences of incorrect predictions and other bookkeeping operations leaving surprisingly few transistors for doing the real work the user asked for in the first place The ultimate conclusion that CGS 3269 CPUs and Microprocessors 45 microprocessor designers have been faced with is junk the lA32 and start all overwith a clean slate which is the lA64 Features of the lA64 The starting point for the lA64 was a highend 64bit RISC processor of which the UltraSPARC II is one of many current examples Since HewlettPackard has contributed heavily to lA64 their own PARISC architecture also influenced the ultimate design of lA64 The ltanium version of lA64 will be a dual mode processor capable of running both lA32 and lA64 programs manufacturers couldn t convince themselves to sever all ties to 32bit architectures The lA64 is a loadstore architecture with 64bit addresses and 64bit wide registers There are 64 general purpose registers available to lA64 programs All instructions have the same fixed format an opcode two 6bit source register fields a 6bit destination register field and an additional 6bit field mentioned later Most instructions take two register operands perform some computation on them and place the result into the destination register Many functional units are available for doing different operations in parallel One ofthe novel ideas originating in the lA64 design is the concept of a bundle of related instructions Instructions come in groups of three called a bundle Each 128bit bundle consists of three 40bit instructions and an 8bit template see figure below lNSTFlUCTlONt l INSTRUCTIONZ l lNSTFlUCTION3 TEMPLATE lnstvuclions canb lNSTRUCTlONl lNSTHUCTIONZ lNSTRUCTlONB TEMPLATE fgga wd er lNSTRUCTlONl l lNSTHUCTlONQ INSTRUCTIONS TEMPLATE PREDlCATE REGISTER Bundles can be chained together using an endofbundle bit so more than three instructions can be present in one bundle The template contains information about which instructions can be executed in parallel This technique along with the presence of a large number of general registers allows the compiler to isolate blocks of instructions and tell the processor that they can be executed in parallel C68 3269 CPUs and Microprocessors 46 Thus it is the compiler that does the reordering of the instructions checks for dependencies and makes sure there are functional units available instead of the hardware The basic idea is that by exposing the internal workings of the processor and telling the compiler writers to make sure that each bundle consists of compatible instructions the job of scheduling the RISC instructions is moved from the hardware a runtime environment to the compiler a compiletime environment For this reason the model is called EPIC Explicitly Parallel Instruction Computing There are several reasons why instruction scheduling at compiletime provides a performance improvement compared to runtime scheduling First since the compiler is now doing all the work the hardware can be much simpler This alone can save millions of transistors that can be shifted to other more important functions such as larger Level 1 caches Second for any particular program the scheduling has to be done only once at compile time rather than every time the program is to be executed Third since the compiler is doing all the work it will be possible for a software vendor to use a compiler that may spend hours optimizing the program and have every user who purchases the software benefit every time the program is run Fourth the processor can begin scheduling instructions from a new bundle before all the instructions from the previous bundle have completed While it will still need to be sure that there are sufficient registers and functional units to do so it will not need to check if any of the instructions in the new bundle are in conflict with instructions from the already executing bundle because the compiler has already guaranteed that this is not the case Another important feature of lA64 is how it deals with conditional branches If you could do away with conditional branches altogether then CPUs could be simpler and faster At first glance it would seem that it would be impossible to get rid of them altogether since programs are often full of if statements a conditional branch instruction However lA64 makes use of a technique called predication that can greatly reduce the number of conditional branches that appear in a given program In current machines all instructions are unconditional in the sense that when the CPU hits an instruction it simply executes that instruction In contrast in a predicated architecture instructions contain conditions predicates which indicate when the instruction should be executed and when it should not be executed This shift from unconditional instructions to predicated instructions makes it possible to eliminate many conditional branch instructions from code Instead of needing to make a choice between one sequence of unconditional instructions or another sequence of unconditional instructions all the instructions are merged inot a single sequence of predicated instructions using different predicates for different instructions A simple example should clarify the concept of predicated instructions CGS 3269 CPUs and Microprocessors 47 Example Consider the following if statement ifR1 R2 R3 Converted to generic assembly language this ifstatement becomes CMP R1 0 ENE L1 MOV R2 R3 L1 Notice that this code contains a comparison a conditional branch and a move instruction The conditional form of this code is CMOVZ R2 R3 R1 In the conditional form the conditional branch is removed and replaced by a new instruction CMOVZ which is a conditional move instruction What it does is check to see if the third register R1 is equal to 0 If this is true it then copies R3 to R2 lfthis is not true it does nothing Once a conditional instruction such as this has been developed it becomes easy to develop similar ones such as CMOVN which is a conditional move when the compared register is NOT equal to zero Consider the following more complex example ifR1 CMP R1 0 CMOVZ R2 R3 R1 R2 R3 BNE L1 CMOVZ R4 R5 R1 R4 R5 MOV R2 R3 CMOVN R6 R7 R1 else MOV R4 R5 CMOVN R8 R9 R1 R6 R7 BR L2 R8 R9 L1 MOV R6 R7 MOV R8 R9 L2 highlevel code generic assembly version conditional code In the conditional execution version of this code there are no conditional branch instructions The instructions can even be reordered the only catch is that the condition must be known by the time the conditional instructions need to be retired near the end of the instruction execution pipeline In the CGS 3269 CPUs and Microprocessors 48 lA64 all instructions are predicated This means that the execution of every instruction can be made conditional The template field that was shown earlier selects one of 64 onebit predicate registers Thus an if statement will be compiled into code that sets one of the predicate registers to 1 if the condition is true and to 0 if it is false Simultaneously and automatically it sets another predicate register to the inverse value Using predication the machine instruction forming the then and else clauses will be merged into a single stream of instructions the former ones using the predicate and the latter ones using its inverse Now consider the final if statement and its generic assembly language equivalent ifR1 R2 CMP R1 R2 R3 R4 R5 BNE L1 else MOV R3 R4 R6 R4 R5 ADD R3 R5 BR L2 L1 MOV R6 R4 SUB R6 R5 L2 highlevel code generic assembly code Converted to predicated instructions the code above becomes CMPEQ R1 R2 P4 ltP4gt ADD R3 R4 R5 ltP5gt SUB R6 R4 R5 Where the CMPEQ instruction compares two registers and sets the predicate register P4 to 1 if they are equal and to 0 if they are not equal It also sets its paired register P5 to the inverse condition Now the instructions for the then and else parts can be put one after the other each one predicated on some predicate register In the lA64 architecture the idea of predication is taken to the extreme with comparison instructions for setting the predicate registers as well as arithmetic and other instructions whose execution is dependent on some predicate register Predicated instructions can be pushed into the pipeline in sequence with no stalls and no problems The way the lA64 actually implements this technique is to execute every instruction At the end of the pipeline when it is about to retire an instruction a check is made to see if the predicate is true If so the instruction is retired normally and its results are written back to the destination register If the predicate is false no writeback is done so the instruction has no effect CGS 3269 CPUs and Microprocessors 49 Appendix F Other Microprocessor Manufacturers AMD Advanced Micro Devices was founded in 1969 and first began to produce microprocessors in 1975 when it released a reverseengineered version of the Intel 8088 chip Although this put the two companies into direct competition they entered into a patent crosslicensing agreement in 1977 to take advantage of each other39s designs The two companies became even closer when IBM demanded a second source for lntel39s 8088 microprocessor before they would agree to put it into their PCs IBM was not sure that Intel would stay in business and would not build a product without a chip supplier Thus lntel agreed to let AMD to second source the 8088 Intel also granted AMD the right to produce the 80286 chip By the time the 80386 was unveiled lntel had secured its place in the semiconductor industry and never granted AMD the right to produce the 386 chips Nevertheless AMD developed its own version of the 386 chip using its own hardware design and lntel39s microcode The circuit design that AMD developed was slightly more efficient and used less power than lntel39s chips lntel quickly sued According to AMD contracts between the two enabled AMD to use all of Intel39s designs and patents through 1995 a right for which AMD paid lntel roughly 350000 or 35 of its profits in 1975 when the contract was signed lntel claimed that the contract gave AMD the right only to copy lntel39s microcode not to distribute it and further claimed a trademark on the designation quot386quot On March 1 1991 the court ruled that the numbers were generic enabling AMD and other companies to call their clone chips 386s On February 24 1992 an arbitrator awarded AMD the right to use lntel39s 386 microcode without royalty or dispute but AMD got no rights to lntel39s technology The AMD K5 chip a 586 generation chip unlike the Pentium adopts a RISC core and uses translation logic to convert lntel instructions into RISC operations The K5 is socket compatible with the Pentium and consequently the control for L2 cache is left to external circuitry The FPU is integrated into the K5 silicon and uses the same core logic that was developed for the highly regarded AMD 29000 RISC processor The K5 contains a sixstage pipeline that can simultaneously process four instructions The six stage pipe contains two ALUs one FPU two loadstore units and a branch unit Although the K5 was supposed to outperform the Pentium at the same clock speed the commercial chips have proven somewhat disappointing The K5 chips have only been able to match the 133 MHz Pentiums Current models of the K5 operate with processor speeds of 15 or 20 times that of the external bus speed CGS 3269 CPUs and Microprocessors 50 The next generation chip from AMD is the K6 which actually introduced lntel39s MMX technology before lntel under the various agreements that had previously been reached between the two companies The K6 fits into lntel39s Socket 7 The logic core of K6 was actually developed by NexGen AMD acquired NexGen The K6 contains 2 sixstage pipelines both of which support branch prediction and speculative execution which are fed by a group of four instruction decoderswhich again translate lntel CISC instructions into RISC instructions to match the chips core logic Although the K6 cannot process 32bit instructions quite as quickly as the Pentium Pro it does not suffer the slowdown if suffers none of that chip39s slowdown when processing in 16bit mode The K6 quadruples the cache size of the Pentium going to a full 32 KB for both the l and D cache The D cache is a dualported set associative writeback design which allows it to load and store data simultaneously on a single clock cycle Introduced in 1998 the K62 is an enhanced version of the K6 which primarily supports an expanded instruction set designed to speed up the execution of three dimensional graphics routines AMD called this extended set of graphics instructions SD Now The K62 also supports the full MMX instruction set The K62 uses AMD Prating system which specifies not the actual operating speed of the chip but the speed of a Pentium chip that would deliver the same performance The actual clock speed of a K62 is substantially lower that it39s Prating Given the form quotrated clock speed in MHz actual clock speed in MHzquot some typical P ratings are 30090 333100 350117 The AMD K63 code named Sharptooth was introduced on February 22 1999 extends the K6 architecture line primarily through the improvements in semiconductor technology The most significant changes are a TriLevel Cache consisting of a 64 KB L1 cache a 256 KB L2 cache integrated into the chip silicon and integral support for up to a 1 MB external L3 cache Both L1 and L2 cache operate at the full speed of the core logic while the optional L3 cache couples through a 100 MHz frontside bus The caches feed the K6339s seven state pipelines 2 of them AMD states that performance is equivalent to a Pentium III rated one speed higher thus an AMD K63 running at 400 MHz is equivalent to a Pentium III running at 450 MHz The large on chip caches require 213 x 106 transistors on the chip and AMD uses standard 025 micron design rules Both 400 MHz and 450 MHz options have been offered AMD announced its first seventh generation microprocessor in October 1998 named the K7 Based upon a new core logic the chip promises a quantum boost in performance based upon a nine instruction superscalar design Meaning that when operating in optimal conditions the chip can simultaneously process nine instructions The nine instructions are divided into three classes The chip CGS 3269 CPUs and Microprocessors 51 contains three parallel decoders which recognize the standard X86 instruction set These three feed superscalar outoforder integer pipelines and three superscalar outoforder multimedia pipelines essentially FPUs which recognize MMX and 3DNowl instruction sets The L1 cache is 128 KB split 6464 through a dedicated 64bit backside bus L2 cache is supported in the range from 512 KB up to 8 MB The front side bus can operate at speeds up to 200 MHz through the Alpha EV6 bus slot lnitial K7s used 025 micron technology and ran at 500 MHz subsequent production shifted to 018 micron technology in later chip steps to increase to between 700 and 800 MHz Current production puts K7 processor speeds over 1 GHz as of late summer 2000 AMD s response to lA 64 is the Sledgehammer processor which will be produced initially only at their new fabrication plant in Dresden Germany called FAB30 Cyrix Corporation Cyrix was founded with the intention of developing compatible chips clones lt entered the market with several series of math coprocessors compatible with lntel39s 386 line When lntel shifted to integral floatingpoint units Cyrix shifted to designing lntel clones using its own core logic Cyrix was a design and marketing company and did not have its own fabrication facilities and contracted primarily with IBM and Texas Instruments to manufacture chips to its specifications In November 1977 the company was acquired by National Semiconductor as a wholly owned subsidiary The Cyrix flagship processor is the M II This chip recognizes all of Intel39s MMX commands however it uses Cyrix39s own version of these based upon public disclosures that Intel has made The quotMMXquot pipe is 10 stages long and capable of handling only one such instruction at a time The Jalapeno was announced in late 1998 by Cyrix which adds 3D graphics anew floating point unit This chip is being manufactured for Cyrix by their parent company National Semiconductor Facilities IDTlCentaur Centaur Technology was founded in 1995 by IBM fellow Glenn Henry in Austin Texas and introduced its first lntel compatible microprocessor the WinChip C6 in May 1997 Centaur is now a wholly owned subsidiary of Integrated Device Technology IDT of Santa Clara California CGS 3269 CPUs and Microprocessors 52 Aggendix G Pentium II Logic Level Diagram CGS 3269 CPUs and Microprocessors 53 Before examining CISC and RISC architectures in detail we ll first examine the details of how Pentium II and Ill processors work and then look at some of the techniques they utilize in isolation How A Pentium II Microprocessor Works In order to follow the next twelve numbered paragraphs better look at the picture of the Pentium II microprocessor in Appendix G The Pentium III processor operates in essentially the same fashion the primary difference is the addition of a second FPU which is dedicated to the MMX commands and Streaming HO 1 l 00 A Information enters the CPU through the Bus Interface Unit BIU The Pentium II microprocessor contains approximately 75 x 106 transistors in the CPU and approximately 15 x 106 transistors in the separate L2 cache This L2 cache contains 512 KB of storage built from off the shelf components The L2 cache is not part of the CPU but is built onto the same circuit board with the CPU This circuit board called a MultiChip Module or MCM then plugs directly into the motherboard of the computer While this design is cheaper to manufacture the tradeoff is that data flows between the L2 cache and the CPU at only 12 the speed of the CPU Thus ifthe CPU is clocked at 400 MHz data travels to and from the L2 cache to the CPU at only 200 MHz This speed disadvantage is somewhat compensated for by doubling the L1 cache from 8K to 16K in each of the lcache and Dcache This larger cache cuts roughly in half the time it takes to access memory and provides faster access to the most recently used data and instructions Since the CPU is limited to moving data in and out of the CPU at the speed of the main data bus the front side bus in Intel nomenclature the Pentium II extends the design philosophy begun with the Pentium Pro in that the L1 and L2 caches are designed to alleviate the effects of the bus bottleneck by minimizing the number of instances in which a clock cycle passes without the processor be able to complete an operation blocked by the slowness of the bus The Pentium II extends this philosophy primarily by doubling the size of the L1 cache The BIU duplicates the information sending one copy to the pair of L1 caches and one copy to the L2 cache If the incoming data is an instruction it is sent to the l cache and if it is data for an instruction it is sent to the Dcache While the FetchDecode Unit is pulling instructions from the lcache another component called the Branch Target Buffer BTB determines if a particular instruction has been used before by comparing the incoming code with a CGS 3269 CPUs and Microprocessors 15 01 CD I record maintained in the separate lookaside buffer The BTB is looking in particular for instructions that involve branching where the program being executed can take different possible paths If the BTB finds a branch instruction it predicts which path the program will take based upon what the program has done at similar branches lntel39s branch predictor units maintain a successful prediction rate of better than 90 The fetch portion of the FetchDecode Unit continues to pull instructions 16 bits at a time from the cache in the order predicted by the BTB Then 3 decoders working in parallel break up the more complex instructions into uops microoperations that the DispatchExecution Unit can process faster than the more complex instruction resident in the lcache Note that the three decoders are not identical units two are called restricted decoders and can only decode CISC instructions that each translate into a single uop the other unit is called a general decoder or complex decoder and can handle CISC instructions that translate into four or fewer uops All CISC instructions which reference memory must be decoded by the general decoder If the CISC instruction involves more than four uops then it is sent to a special microcode instruction sequencerMlS unit which is not shown on our diagram Programs that make many memory references tend to frustrate the mulitpart decoder scheme of the Pentium II processor Operating at maximum speed with code optimized for them the three decoders can generate 6 uopsclock cycle one from each restricted decoder and four from the general decoder with an average for all code of about three uopsclock cycle All uops in this architecture are 118 bits long The decode unit sends all uops to the ReOrder Buffer ROB also called the Instruction Pool This is a circular buffer with a head and a tail that contains the uops in the order in which the BTB predicted that they would be needed The ROB can store up to 40 entires each 254 bits long Each entry in the ROB contains the 118 bit uop plus two operands the result and processor information that the uop might affect status bits The ROB can prepare up to three uopsclock cycle for processing All register renaming is handled by the ROB There are 40 such registers in the Pentium II not shown on the diagram As the decode unit passes uops to the ROB it also sends them to a special unit called the Reservation Station RS not shown on the diagram but think of it as inside the fetchdecode unit The RS senes two purposes 1 it is the conduit that passes uops to a suitable execution unit as one becomes available and 2 it acts as another buffer storing up to 20 uops and their data This buffering effect prevents slowdowns in the decoders from staying the CGS 3269 CPUs and Microprocessors 16 00 3 processors and also prevents the decoders from stalling when the processors are fully engaged The RS connects five ports linking to six execution stations that actually carry out the manipulations The DispatchExecute Unit checks each uop in the ROB to see if it has all of the information necessary to process that uop If a uop still needs data from memory which hasn39t arrived yet the execute unit skips that uop and the processor looks for the missing information first in the L1 Dcache If the data isn39t there it tries the L2 cache recall that the L2 cache is 24 times faster than going to RAM for the information Instead of remaining idle while missing information for a uops is loaded the execute unit continues inspecting each uop in the ROB When it finds a uop that has all of the information needed to process it the unit executes it in either one of the two ALUs the MMX unit orthe FPU and stores the results in the uop itself marks the code as completed and moves on to the next uop in the ROB This is called speculative execution because the order in which the uops appear in the ROB is dependent upon the prediction made by the BTB When the execution unit reaches the tail of the buffer it starts at the head again rechecking all of the uops to see if any has received the data it needs to be executed 10When a uop that has been delayed finally receives its data and is processed the execute unit compares the results with those predicted by the BTB If the BTB has failed to correctly predict the proper execution order a component called the Jump Execution Unit JEU moves the tail marker from the last uop in the ROB to the uop that was predicted incorrectly This signals that all uops behind it in the ROB are invalid and should be ignored and may be ovenvritten by new uops The BTB is told that its correction was incorrect and that information becomes part of its future predictions in order to enhance the probability that the next prediction will be correct 11Meanwhile the ROB is also being inspected by the Retirement Unit RU The RU first checks to see if the uop at the head of the ROB has been executed If it hasn39t the RU keeps checking until it has been executed Once the uop at the head of the ROB has been executed the RU then checks the second and third uop in the ROB and if they have all been executed it simultaneously sends all three results to the store buffer Three is the maximum number of uop results that can be sent to the store buffer simultaneously 12While in the store buffer the results are checked one more time before they are sent to the L2 cache to await their trip to RAM CGS 3269 CPUs and Microprocessors 17 Pipelining In older microprocessor designs the processor chip worked singlemindedly lt read an instruction from memory carried it out stepbystep and then advanced to the next instruction Each step required at least one clock cycle so that execution of the entire instruction would take many clock cycles in total Pipelining allows for many instructions to be executed in one clock cycle In the paragraph and example below we ll illustrate how pipelining works Pipelining is a technique for decomposing a sequential process into a series of subprocesses with each subprocess being executed in a special dedicated segment that operates concurrently with all other segments A pipeline can be visualized as a collection of processing segments through which binary information flows Each segment performs partial processing dictated by the fashion in which the original process was partitioned The results output produced by one segment are passed as input to the following segment in the pipeline The final result is produced by the last segment of the pipeline The easiest way to view a pipeline structure is to think of each segment consisting of an input register followed by a combinational logic circuit which performs a specific subprocess the results of which are passed to the input register of the following segment A clock line is connected to all registers and is pulsed after enough time has elapsed to perform all segment operations The following example will clarify how the pipeline operates Example Suppose that we want to perform a combined multiply and add operation with a stream of numbers such as A i x B i C i where i 127 Each suboperation can be implemented as a segment within a pipeline Each segment has one or two registers as well as a combinational logic circuit devoted to performing a specific function The figure below illustrates the basic setup that will be required for the pipeline Registers R1 through R5 receive new data on every clock pulse The multiplier and the adder are combinational circuits The suboperations performed in each segment of the pipeline are Segment l R1lt Ai R2lt Bi lnputAiand Bi Segment 2 R3 lt R l x R2 R4 lt C i Multiply and input C i Segment 3 R5 lt R3 R4 Add C i to product CGS 3269 CPUs and Microprocessors 18 Am jBi CH 34 Figure showing example 3stage pipeline SEE Segment 1 Segment 2 Segment 3 Number R1 R2 R3 R4 R5 1 A1 B1 2 A2 B2 A1 x B1 c1 3 A3 B3 A2 x B2 c2 A1 x B1 cm 4 A4 B4 A3 x B3 cs A2 x B2 C2 5 A5 BS A4 x B4 c4 A3 x B3 C3 6 A6 B6 A5 x BS cs A4 x B4 C4 7 A7 B7 A6 x B6 06 A5 x BS C5 8 A7 x B7 cm A6 x B6 C6 9 A7 x B7 cm Table 1 Contents of Pipeline Registers for 3stage pipeline Since the five registers are loaded with new data on every clock pulse the effect of each clock pulse is shown in Table 1 The first clock pulse transfers the operands A1 and B1 into R1 and R2 segment 1 The second clock pulse transfers the product A1 x B1 into R3 and cm into R4 segment 2 and A2 CGS 3269 CPUs and Microprocessors 19 and B2 into R1 and R2 segment 1 The third clock pulse operates on all three segments of the pipeline simultaneously This clock pulse transfers the sum of R3 and R4 into R5 this value represents A1 X B1 C1 segment 3 transfers the produce A2 X B2 into R3 and C2 into R4 segment 2 and finally transfers A3 and B3 into R1 and R2 segment 1 At this point the pipeline is filled and the first output from the pipeline has been produced From this point fonvard a result will be produced by the pipeline with each clock pulse As long as new data continues to flow into the pipe on each clock pulse the pipe will remain full When no more input data is available the clock must continue to pulse until the last output has emerged from the pipeline This may result in an empty pipeline however it would be very common for the pipeline to already be accepting input from another stream and thus continue to produce output albeit for another instruction Synchronization of the output streams will be left to the control unit logic The Pentium II and Ill microprocessors use 12 stage pipelines in the integer ALUs In order to achieve maximum effect from pipelining techniques microprocessor designers strive to make all of the instructions in the microprocessor39s instruction set execute in the same number of clock cycles This prevents any one segment from becoming a bottleneck to the overall speed of the pipeline Similarly the length of the pipe should be kept small to prevent long delay times to fill and flush the pipeline The 12 stage pipelined ALUs of the Pentium II and III are actually a liability to their performance see Appendix E for a discussion of the inherent problems with the lA32 ISA compared with the lA64 ISA Branch Prediction Most current microprocessor architectures are heavily pipelined Pipelining works best on linear code so that the fetch unit can simply read consecutive addresses from memory cache and send them off to the decode unit in advance of their being needed by the execution units The only problem with this model is that it is not the slightest bit realistic Programs are not linear code sequences They are full of branch instructions Pipelining fails to work efficiently if the program steps require branching operations For example the pipeline can be loaded up with instructions from one program branch before it has determined that the other branch is the one which will actually be executed in this case the pipeline contents must be dumped and reloaded with the instructions from the correct branch As an example consider the following program fragment CGS 3269 CPUs and Microprocessors 2O ifi 0 CMP i 0 compare i to O k BNE else branch to else if not equal else then MOV k 1 move 1 to k k 2 BR next unconditional branch else MOV k2 move 2to k next code fragment same code fragment in generic assembly language Notice in the code fragment above assembly language version that two of the five instructions are branches Furthermore one of the branches BNE is a conditional branch which means that the branch is only taken if some condition is met in this case the two operands in the previous CMP instruction are not equal The longest linear code sequence in this example is two instructions As a consequence fetching instructions at a high rate to feed the pipeline is very difficult to do The reason this is a problem lies in the very nature of pipelining In a typical instruction pipeline the fetch unit is typically the first segment of the pipe followed by the decode unit as the second segment This means that the fetch unit has to decide where to fetch the next instruction from before it even knows what kind of instruction itjust fetched Only one cycle later can it learn that it just picked up an unconditional branch instruction only by then it has already started to fetch the instruction which follows the unconditional branch instruction There are several techniques that have been employed to get around this problem including simply executing the instruction that immediately follows an unconditional branch even though logically it should not be executed This is called a delay slot Optimizing compilers will sometimes attempt to find a useful instruction to put into this slot but very often there will not be one so a simple NOP nooperation instruction is inserted into the delay slot While unconditional branches cause headaches they are nothing compared to the problems that conditional branch instructions cause since now the fetch unit does not even know where to fetch from until much later in the pipeline Early pipelined machines just stalled until it was known whether or not the branch would actually be taken or not Stalling for three or four cycles on every conditional branch especially if 20 of the instructions are conditional branches deteriorates performance quite rapidly Modern processors take another approach they predict whether or not the conditional branch will be taken or not It would be nice to stick a crystal ball into a free PCI slot to help out with the prediction but so far this has not proven very effective One simple predictive technique that is commonly applied is to assume that all backward conditional branches will be taken and that all fonvard conditional branches will not be taken This works pretty well because backward branches are typically found at the end of a loop most loops are executed multiple times so guessing that a branch will take you CGS 3269 CPUs and Microprocessors 21 back to the top of the loop is a pretty reasonable bet The second part of this technique isn t quite as good since it is very common to find fonvard branches occurring when an error is detected in the software calling an error handler Errors are supposed to be rare so most of the branches associated with them should not be taken The problems occur when there are fonvard branches that are not related to errors this model will not give a very good prediction rate for such branches If a branch is correctly predicted there isn t anything special to do execution continues at the new target address without interruption The trouble comes when the prediction is incorrect Figuring out where to go correctly and how to get there is the easy part The hard part is undoing the instructions that have already been executed and shouldn t have been There are two basic ways of handling this problem The first way is to allow instructions fetched after a predicted conditional branch to execute until they attempt to change the machine s state ie attempt to store something in a register Instead of ovenvriting the register the value is placed into a secret scratch register and is only copied into the real register after it is known that the branch prediction was correct The second method is to record the value of any register about to be ovenvritten put this value into a secret scratch register so that the machine can be rolled back to the state it had at the time the branch was mispredicted Both of these solutions are complex and require heavy duty bookkeeping to get them right If a second conditional branch is encountered before it is known if the first one was predicted correctly things can get really messy Clearly having the branch predictions be as accurate as possible is extremely important in allowing the CPU to proceed at full speed As a consequence a great deal of current research is devoted to improving branch prediction algorithms lntel s branch prediction units have a better than 90 accuracy rate Due to the very long pipelines in the Pentium ALUs an accuracy of 90 or better is required to prevent overall performance from degrading too far Superscalar Architectures Program steps are usually sequential in nature for most common imperative programming languages however execution of these instructions does not have to be carried out in the same sequential order For example if you wish to compare the average speed of two cars over a certain distance and time you need to calculate the average speed for both of them and then make your comparison Which average speed you calculate first makes no difference If you had two brains you could calculate the two average speeds simultaneously Superscalar architectures do just that they provide two or more execution paths CGS 3269 CPUs and Microprocessors 22 for instructions to be processed simultaneously The world39s first superscalar computer design was the Control Data Corporation CDC 6600 mainframe in 1964 The CDC 6600 was designed for intense scientific calculations and contained ten functional units Every 100 nsec an instruction was fetched and passed to one of the functional units for parallel execution while the CPU went off to fetch the next instruction A typical dual pipeline CPU configuration is shown in the figure below 81 82 83 S4 85 instruction Operand Instruction Write decode fetch execution back unit unit unit unit Instruction Ope rand Instruction Write fetch execution back unit unit Ll nit gt deco e unit In this model a single instruction fetch unit fetches pairs of instructions together and puts each one into its own pipeline complete with its own ALU for parallel operation To be able to run in parallel the two instructions must not conflict over resource usage eg registers and neither must depend on the result of the other no dependencies As with a single pipeline either the compiler must guarantee this in advance of execution or conflicts must be detected and resolved in hardware Pipelines such as this are common on RISC processors but did not appear on Intel processors until the 486 which had one pipeline The Pentium architecture has two fivestage pipelines similar to this model one is called the u pipeline the other the vpipeline Rather complex rules determine whether a pair of instructions were compatible for parallel execution Pentiumspecific compilers producing compatible pairs were capable of producing faster running code than older compilers Typically a Pentium running integerbased code optimized for it was twice as fast as a 486 running at the same clock speed This gain was entirely attributable to the second pipeline You might think that if two pipelines are twice as good as a single pipeline then four pipelines would be twice as good as two pipelines Unfortunately the progression is sublinear since too much hardware must be duplicated and control becomes too complex Indeed the Pentium II and Ill processors as well as most highend processors use a different approach which reverts to a single pipeline but with multiple functional units within the pipeline This is shown in the next figure gt Instruction unit CGS 3269 CPUs and Microprocessors 23 1 2 Instruction instruction fetch decode unit unit Implicit in the idea of a superscalar processor is that the 33 stage see figure above can issue instructions considerably faster than the S4 stage is able to execute them For example if the 33 stage issued an instruction every 10 nsec and all the functional units could do their work within 10 nsec no more than one would ever be busy at once which would negate the whole concept In reality most of the functional units of stage S4 take appreciably longer than 1 clock cycle 10 nsec to execute certainly the ones that access memory or floatingpoint calculations By placing more than one functional unit in the S4 stage they can theoretically all be busy at one time The Pentium II for example has two integer ALUs and a FPU in stage 4 The Pentium III has two integer ALUs and two FPUs in the S4 stage Out of Order Execution It is a very difficult problem to dynamically ensure that a program is divided up in such a way that the pipelines of a superscalar processor share equal amounts of work Typically one pipeline will still be working while another has finished and stands idle The chip logic can initiate another instruction for the idle pipeline if another instruction is ready However if the next instruction depends on the results of the instruction before it a dependency and that happens to be the one still stuck in another pipeline then the free pipeline stalls and potential processor power is wasted as the idle pipeline is free but can do no work Obviously the simplest scenario is to execute every instruction in the order in which it is fetched sequentially based upon the program code assuming that branch prediction is never wrong However this inorder execution does not CGS 3269 CPUs and Microprocessors 24 typically give optimal performance due to dependencies that exist between various instructions If an instruction needs a value that is produced by a preceding instruction then the second instruction cannot begin executing until the first one has produced the required result This is a RAW readafterwrite dependency In order to boost the performance of the processor many modern processors allow dependent instructions to be skipped over to get to future instructions that are not dependent However the internal instruction scheduling algorithm must guarantee that the end result is exactly the same as would have resulted if the program had been executed in the exact order it was written Since the microprocessor is no longer executing the code in the order it was written anomalies may result and therefore the results of out of order execution are not written to the internal registers as soon as they are available Instead these results are held in an internal buffer and when the other instructions the ones prior to the out of order ones finish the microprocessor put the results into the proper order and checks for any anomalies that might have resulted only then are the results posted to the registers Register Renaming Out of order execution has the potential for executing two different instructions essentially simultaneously that refer to the same register Execution of the program in the normal order would not encounter this problem as first one instruction would execute and then the other The conflict over register access and the values that would result would force the superscalar architecture to resort to sequential processing of such instructions slowing it dramatically and losing the advantage of the superscalar design To avoid this problem advanced microprocessors use a technique called register renaming Instead of a small number of registers with fixed names a large bank of dynamically named registers is used These dynamically allocated and named registers are not visible to programmers and are sometimes called secret registers The logic of the chip converts a register reference made by an instruction into a reference to one of the dynamically named registers The chip must then remember which register is used for which instruction as the program executes so that the proper results are loaded when called RISC Reduced Instruction Set Computer Developed by John Cocke at IBM in 1974 Cooke did research on mainframe computers there weren39t any PC then remember and discovered the fact that in CGS 3269 CPUs and Microprocessors 25 a computer with a set of 200 instructions roughly 23 of the processing involved as few as 10 of the instructions Cocke therefore designed a computer than ran very fast but consisted of only a very few instructions the RISC computer Cocke39s research also showed cases where a few simple instructions could perform a complex task faster than a single complex instruction This is commonly known as the 8020 rule About 20 of the computer39s instructions do about 80 of the work The primary objective of the RISC architecture is to optimize the computer39s performance for those 20 of the instructions by speeding up their execution as much as possible The remaining 80 of the instructions can be duplicated where necessary by combinations of the quick 20 Analysis and practical experience since it39s inception has shown that the 20 can be made so much faster that the overhead required to emulate the remaining 80 was no handicap at all Important characteristics of RISC Singlecycle or better execution of instructions Most instructions on a RISC computer can be carried out in a single clock cycle if not faster through the use of pipelining techniques The chip doesn39t processing a single instruction in a fraction of a clock cycle but rather processes several instructions simultaneously as they move down the pipeline For example a chip may work on 4 instructions simultaneously each of which requires 3 clock cycles to execute The net result is that the chip requires 34 clock cycle per instruction N Uniformity of Instructions The RISC pipeline operates best if all of the instructions are of the same length number of bits require the same syntax and execute in the same number of clock cycles Most RISC computers are exclusively 32bit machines In contrast the CISC command set used by Intel microprocessors use 8 16 or 32 bit instructions 00 Lack of Microcode RISC computers either entirely lack microcode or have very little of it relying instead on hardwired logic Operations handled by microcode in CISC computers require sentences of simple RISC instructions Note that if these complex operations are performed repeatedly the series of RISC instructions will be loaded into the highspeed cache and then act like microcode that is automatically customized for the running of that program A Loadstore design Accessing memory during the execution of an instruction often causes a delay because the RAM cannot be accessed as quickly as the CGS 3269 CPUs and Microprocessors 26 microprocessor can run Therefore most RISC machines do not have immediate instructions those that work on data in memory rather than in registers and they attempt to minimize the number of instructions that affect memory Data must be explicitly loaded into a register before it can be accessed by the program The optimizing compiler can then organize the sequence of instructions so that the delay on the pipeline can be minimized U l The hard work is in the software The RISC design shifts most of the work in achieving top performance to the software The optimizing compiler must examine and modify the code rearrange the order of instruction execution in order to keep the pipeline running optimally CD Design simplicity Simplicity is the overall design criteria for RISC machines For example the Intel 80486 microprocessor contains lx lO6 transistors the RISCbased MIPS M2000 contains only about 120000 transistors yet the two are comparable in performance To better understand how a RISC processor works see Appendix B MicroOps and CISCRISC Computers Many microprocessors that look like CISC chips and execute the classic lntel CISC instruction set are actually RISC chips inside Initially chip makers seeking to clone Intel39s microprocessors were the first to use such designs but lntel adopted the same technique beginning with the Pentium Pro The basic technique involves converting the classic lntel instructions into RISC like instructions to be processed by the chip39s internal circuitry lntel calls the internal RISClike instructions microops this is commonly abbreviated uops or simply uops Also note that NexGen now a part of AMD used the term RISC86 instructions and AMD itself used the term Rops or ROP39s Uops sidestep the primary shortcomings of the Intel instruction set by encoding all of the instructions more uniformly converting all instructions to the same length for processing and eliminating arithmetic operations that directly change memory by requiring the loading of memory data into registers before processing This translation to RISClike instructions allows the microprocessor to function internally like a RISC machine The code conversion occurs in hardware and is completely invisible to the applications CGS 3269 CPUs and Microprocessors 27 Ver Lon Instruction Word VLIW A counterpart to RISC technology is that of Very Long Instruction Word VLIW technology While it may seem to be the opposite of RISC techniques VLIW is actually a refinement of RISC designed to take advantage of superscalar architectures Each very long instruction is made up from several RISC instructions In a typical implementation eight 32bit RISC instructions combine to make one VLIW Ordinarily combining RISC instructions would not contribute greatly to the overall speed As with RISC technology the advantage is mostly in the software The compiler that produces the object code must chose which instructions to combine into a VLIW carefully so that they all execute at the same time or as close as possible in parallel processing units inside the superscalar microprocessor Thus the VLIW system takes advantage of the preprocessing done in the optimizing compiler to make the final code and microprocessor more efficient VLIW technology also takes advantage of the wider bus connections on the latest generation microprocessors Existing chips link their support circuitry with 64bit buses and many have 128bit internal buses The 256bit VLIWs push a little bit further and allow the microprocessor to load several cycles of work in a single memory cycle There are currently no implementations of VLIW systems although there have been implementations in the past Trace 200 family of systems Trace 7200 14200 and 28200 capable of executing 7 14 or 28 parallel operations respectively Single Instruction Multiple Data SIMD An analogy will help to identify what SIMD is all about Consider an Army drill sergeant facing an entire platoon If the sergeant wants the soldiers to turn around he could give the same instruction quotAbout facequot to each soldier one at a time But drill sergeants are naturally SIMD oriented devices The sergeant doesn39t give the order to each soldier individually he gives the same command to every soldier in the platoon at the same time and each of the soldiers execute the command simultaneously Thus the sergeant is the SI single instruction and the soldiers are the MD multiple data Intel has adapted SIMD technology into both the MMX units of the Pentium MMX Pro P2 and P3 as well as in the Streaming SIMD Extensions of the P3 to enhance their 3D processing power As its name implies SIMD allows a single CGS 3269 CPUs and Microprocessors 28 microprocessor instruction to operate across several bytes or words or even larger blocks of data In the MMX scheme of things the SIMD instructions are matched to the 64bit data buses Regardless of the original format of the data whether it be a byte a word etc it is packed into the 64bit package that is loaded into a 64bit register inside the MMX unit One MMX instruction recall that there are 57 different instructions in the MMX unit then manipulates the entire 64 bit block Streaming SIMD Extensions SSE The P3 microprocessor has been augmented with 70 new instructions formerly known as Katmai New Instructions or KNI to go along with the original code name for the P3 which was Katmai which allow for elaborate threedimensional processing functions to be included in a single command These 70 commands also are capable of streaming audio and video from the Internet as well as speech recognition capabilities The SSE extensions to the P3 push the number of transistors in this microprocessor to about 95 X 106 l CGS 3269 CPUs and Microprocessors 29 A brief overview of Streaming Audi and Video While the Internet began as a textonly medium it has quickly evolved into a multimedia medium Compared to your CDROM and DVD players the Internet is slow The problem with audio video and graphics is that typically very large amounts of data are required in order to process the medium on your local system The basic problem with this is the bandwidth that is available via the Internet The bandwidth basically refers to how much data you can push across a network or a bus or any other data path that you can think of Basically a wider bandwidth means that more information can flow across or around the network Compression techniques can be used to enhance the bandwidth since the files seem smaller their compressed However compression techniques that are suitable for your CDROM or DVD are not suitable for the slow Internet Thus content providers who ever puts out the video or audio clip typically provide less quality that what you would get from a similar CDROM This means that they might reduce the range of sound available restrict the number of high and low pitches that you can hear or reduce the number of colors and frames for video Also typically the size of the window is reduced and the length of the multimedia clip is shortened The first audio and video clips that were available on the Internet were short clips because your computer had to completely download the file to its hard drive before it could begin playing the file A newer technology called streaming extends the length of a multimedia clip from a few seconds to hours Streaming enables your computer to begin playing the file as soon as the first bytes begin arriving instead of forcing it to wait until the entire file has been downloaded Streaming does not use the same file transmission protocol The protocol is the set of rules that two computers use to govern how they connect to one another how they will break up the data into packets and how they will synchronize sending them back and forth Most data textual uses the Transmission Control Protocol TCP however streaming uses the User Database Protocol UDP The basic difference in the two protocols is in how they check for transmission errors For example if you are downloading a game from the Internet and a passing electrical interference garbles one of the packets then TCP suspends the download while it asks the sending computer to resend the bad packet However with video and audio if you miss a frame or two or a work hear and there the loss is not very crucial you probably won39t even notice it However you would notice it if the protocol took the extra time needed to ask for and receive a retransmission Thus UDP allows the connection to occasionally loose packets without a problem CGS 3269 CPUs and Microprocessors 3O Operating Modes Overview As Intel developed the various microprocessors in the X86 line one underlying factor in the development of each new processor was it39s compatibility with it39s predecessors For example the instruction set from one generation was carried over the next Another example is that of the registers wherein a newer generation processor can use half of a 32bit register as if it were a 16bit register It is this desire to maintain the backward compatibility that has led to the rather odd structure and operation of many modern microprocessors For example Intel39s use of segmented memory for its first generation microprocessors was to maintain this backward compatibility The used of segmented memory was to allow the 1 MB address range to look like sixteen 64 KB found on the earlier generation chips Unfortunately addressing became convoluted and program writing was complicated since the chips treated memory as sixteen separate blocks instead of a single broad range of addresses With subsequent generations Intel39s backward compatibility and memory problems was to create different operating modes As a result modern lntel microprocessors have three primary operating modes real mode protected mode and virtual 8086 mode All of these operating modes have been available in every lntel microprocessor since the introduction of the 80306 in 1987 In the next several paragraphs we39ll look at how each of these modes operates Real Mode This is the basic operating mode of the Intel microprocessors the only mode available on the first generation of Intel processors Even in the most advanced lntel chip currently available for PCs the Pentium III Xeon real mode still emulates the 8086 processor and all of its limitations Real mode derives its name from its exact correspondence between physical memory and the logical addresses used by the processor to address it The logical addresses specified in programs that operate in real mode indicate the actual physical addresses in the memory determined by the design of the computer hardware In real mode the processor can directly address up to 1 MB of memory the limit which is imposed by the 20bit memory addresses generated by the first generation processors note 220 1048576 1000 KB 1 MB To effect backward compatibility lntel segmented the memory accessed in real mode Instead of a single wide range of addresses the processor locates memory in 64 KB segments To specify a location the processor uses the segment as an offset and the location in the memory as a base address Note more current versions of Intel39s microprocessors extend the range of real mode addressing using a quirk of the architecture In addition to the 20bit address values there is also a carry bit which can be used to indicate another memory segment for a total of 1088 KB of real mode memory Transitional operating systems between DOS and modern CGS 3269 CPUs and Microprocessors 31 versions of Windows exploited this realmode feature to create the Highmemory area All Intel microprocessors since the 8086 boot up in realmode Software typically the operating system then switches the processor to a more advanced mode to take advantage of features such as greater memory addressability and memory protection Protected Mode lntel introduced the Protected Virtual Address Mode in 1982 to give the 80286 processor the ability to reach all 16 MB of its addressable range This new mode is more commonly referred to simply as the protected mode As the name implies protected mode operation allows the processors to protect ranges of memory so that when multiple tasks are running simultaneously think Windows here they do not interfere with each other39s memory When the processor is running in protected mode additional instructions are available most of which are aimed at multitasking ln protected mode software can be assigned one of four priority levels which prevent applications of lower priority accessing the memory assigned to an application of higher priority For example the operating system would normally be assigned the highest priority in order to prevent the crash of a program with lower priority from affecting it While operating in protected mode the processor first checks all memory references against the protection levels if the access is not allowed the processor will not carry out the instruction and will signal an exception causing an interrupt in the program making the out of range memory request In protected mode the memory remains segmented but the segments are now used to manage the memory and tasks rather than acting as a constraint on the addressing possibilities Segments can be any size each defined by a special descriptor block that tracks how much memory is allocated to the segment where in the overall address space the segment resides and what level of protection has been assigned to the segment actually to the application resident in the segment Protected mode also supports a flat memory model which treats the entire memory as a single contiguous expanse of addressable memory helps support UNIX OS Protected mode also supports demand paging Demand paging is basically a technique that allows an application a program that requires a lot of memory to run with a lesser amount of physical memory Demand paging works by slicing the memory into small sections called pages Intel systems fix a page at 4 KB which are managed individually Pages of code and data belonging to one CGS 3269 CPUs and Microprocessors 32 application are swapped in an out of memory as it is needed on demand by the application via the processor Protected mode imposes no addressing limits on the processor Virtual 8086 Mode To accommodate the old DOS operating system in the protected mode lntel added the virtual 8086 mode to the 80386 processor and to all subsequent processors in the line this mode is also commonly referred to as virtual X86 mode and virtual mode The name of this mode arises from the fact that in this mode the single processor39s operation is divided into several virtual processors each capable of running a separate DOS task as if it were a dedicated 8086 chip Each virtual processor running under virtual mode can access up to the full 1 MB addressing limit of real mode and can use the same instruction set and facilities as would be available with a dedicated 8086 chip Unlike real mode virtual mode operates as part of protected mode and affords the same isolation between applications Thus one or more protected mode applications can run at the same time as one or more virtual mode applications An operating system running under protected mode usually manages the applications which are running in virtual mode Virtual mode makes multitasking control software simple because all of the hard work is done in hardware Off the shelf DOS programs will work without modification in virtual mode using any Intel processor from the 80386 onward As the number of realmode applications such as those written for DOS steadily decline the importance of virtual mode will also decline However it will probably remain a part of the Intel microprocessor family until lntel discards this architecture which will probably happen with the lA 64 machines Microprocessor History and Competition quotMicroprocessor power doubles every 18 monthsquot states Moore39s Law named for longtime chairman of Intel Gordon Moore Over the thus far roughly 20 year history of the PC the power increase factor is currently 4096 comparing the 8088 and the P3 The designs of the processors however have far exceeded this factor in that the first microprocessor contained 2300 transistors while the P3 contains over 95 X 106 transistors Improvements in semiconductor fabrication technology made the increasing complexity of the modern microprocessor both practical and affordable In the three decades since the first microprocessor was introduced the linear dimensions of semiconductor circuits have decreased to 150 their original size This is the result of going from 10 micron design rules to CGS 3269 CPUs and Microprocessors 33 the current limit of 018 micron design rules This means that microprocessor designers can now squeeze 2500 transistors into the same space where originally only one fit This size reduction also facilitates higher speeds with today39s current state of the art microprocessors boasting speeds nearly 10000 times faster than the first chip out of the Intel factory 1 GHz compared to the 108 KHz ofthe first chip Thanks to the unbelievable rise in popularity of the PC Intel Corp is now the largest manufacturer of microprocessors and the largest independent manufacturer of semiconductors in the world They are however not alone Today there are three major competitors which both clone Intel products as well as produce their own designs These three companies are AMD Cyrix and lDT A brief history and discussion of their current microprocessors appears in Appendix F The Future at least some possible futures In 1997 Gordon Moore made a revision to his widely known quotMoore39s LaW39 stating that the transistor miniaturization would hit a wall in about the year 2017 The wall he is referring to is a literal one namely the atoms stacked between adjacent traces in a microchip Traces are the microchip equivalent of the wires that carry electrical current in your home Narrow beams of light are used to draw the pattern of the traces on a disk of silicon coated with a photosensitive film A chemical bath dissolves the film where the light struck the disk and it is replaced with aluminum With current microchip technology the narrowest trace that can currently be made is about 010 microns roughly the distance across several hundred atoms A micron is one millionth of a meter The Pentium III currently is using 018 micron traces and Intel has plans to begin using 013 micron traces by the year 2001 The Tiawanese Semiconductor Corporation announced recently that they will begin production using 013 micron technology in October 2000 Chip makers have experimented with Xrays to draw the traces instead of light beams since Xrays are much narrower than light beams As the traces grow smaller the problems begin to increase as the physical limit is approached Gaps in the metal increase the electrical resistance blocking the flow of electrons Copper whose conductivity is greater than that of aluminum and is therefore more efficient is subject to corrosion and scratching As some point the distance between the traces becomes as important as the width of the traces themselves If the traces are too close together the electromagnetic fields accompanying the CGS 3269 CPUs and Microprocessors 34 flow of electrons begin to sap and distort the signals carried by adjacent traces IBM recently introduced silicononinsuator SOI technology that protects transistors from stray electromagnetic fields Even if you do work out all of the kinks figuring out traces that will reliably carry the current and efficiently yet doesn39t melt from its own heat you will reach a point where the walls that separate the traces are only about 5 atoms across At that scale the insulating walls tend to spring electrical leaks as electrons spill out of the sides of the traces on narrow turns tunneling effect At this point you have gone just beyond the absolute physical limit to the number of transistors that will fit on a chip Such a chip will contain about 19 trillion transistors roughly 20 times the number of neurons in the human brain These hard physical limits to the number of transistors that will be able to reside on a single chip have more or less dictated that tomorrow39s super microprocessors will utilize parallel processing techniques Designer s are already at work on microprocessor systems that will combine any where from two to eight processors depending upon the processor in parallel In parallel processing pieces of a calculation are fed to different processors at the same time which work on their piece independently of the others and their results are combined together at the end to produce the final result Parallel processing requires overhead that a single processor does not experience First the calculations must be broken up and distributed across the various processors attempting to balance their loading and the individual results must be combined in a smooth fashion with no serious delays caused by any single processor Microchips will only be able to go so far in this direction Of course silicon is not the only answer Xerox39s Palo Alto Research Center has envisioned a microchip built from diamonds which complex stable crystals where their structure provides a theoretical basis for building the logical structures which make up transistors from just a few atoms Such a processor would contain more than one hundred billion bytes in the volume of about the size of a sugar cube Danish scientists at the Danish Institute of Technology have reported that using a tunnelingscanning microscope they have been able to remove from a hydrogen surface layer on a silicon chip a single hydrogen atom from the pair of hydrogen atoms attached to the one silicon atom This leaves the remaining hydrogen atom jumping back and forth between the two layers The possibility of storing information at the atomic level means that the ability to store the information which today is contained on 1 million CDROMs could be stored on a single CDROM using this technology Practical application of this technology appears to be about 2 decades away CGS 3269 CPUs and Microprocessors 35 Last month a breakthrough was announced in the development of a transistor controlled by a single molecule of carbon60 known as a buckybaI sandwiched between gold electrodes The transistor was built at Lawrence Berkley National Laboratory This will eventually lead to the development of nanocircuits which will further reduce the size and increase the speed of computers built from these nanoscale devices At an even smaller level the subatomic level we have the quantum processor In the quantum world the normal rules of science begin to breakdown A subatomic particle can be thought of as existing in several different states simultaneously Only through the act of quotmeasuringquot the particle is it quotforcedquot into a single state Researchers at NEC wanted to find out if quantum states rather than voltage levels could be used to encode and process information The state of a quantum bit or qubit is determined only when it is observed Physically the qubit exists in an ambiguous quotsuperposition of statesquot so a string of processor operations has the potential to simultaneously represent all possible strings of bits Thus a large enough quantum computer could potentially hold all answers simultaneously For example to factorize a 200digit number would take about one trillion years on today39s fastest microprocessor The same problem would require about an hour with a quantum processor What exactly the future holds for microprocessors is not clear however based on the past we will expect them to be smaller faster and more powerful than ever before CGS 3269 CPUs and Microprocessors 36