New User Special Price Expires in

Let's log you in.

Sign in with Facebook


Don't have a StudySoup account? Create one here!


Create a StudySoup account

Be part of our community, it's free to join!

Sign up with Facebook


Create your account
By creating an account you agree to StudySoup's terms and conditions and privacy policy

Already have a StudySoup account? Login here


by: Louisa O'Kon I


Louisa O'Kon I
OK State
GPA 3.58

Louis Johnson

Almost Ready


These notes were just uploaded, and will be ready to view shortly.

Purchase these notes here, or revisit this page.

Either way, we'll remind you when they're ready :)

Preview These Notes for FREE

Get a free preview of these Notes, just enter your email below.

Unlock Preview
Unlock Preview

Preview these materials now for free

Why put in your email? Get access to more of this material and other relevant free materials for your school

View Preview

About this Document

Louis Johnson
Class Notes
25 ?




Popular in Course


This 0 page Class Notes was uploaded by Louisa O'Kon I on Sunday November 1, 2015. The Class Notes belongs to ECEN 5253 at Oklahoma State University taught by Louis Johnson in Fall. Since its upload, it has received 10 views. For similar materials see /class/232905/ecen-5253-oklahoma-state-university in ELECTRICAL AND COMPUTER ENGINEERING at Oklahoma State University.

Similar to ECEN 5253 at OK State





Report this Material


What is Karma?


Karma is the currency of StudySoup.

You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!

Date Created: 11/01/15
I ECEN 5253 Digital Computer Design I Cache Memory The design of a memory for a general purpose computer must take into account that mem ory is not uniform but organized in a hierarchy according to cost and speed I register I cache I main memory costSm size spee I secondary memory disk network connection Typical costs and speeds for these components are shown on bottom p 469 In recent years the widespread use of pipeline techniques in processor design has made processor performance increase much more rapidly than memory performance fig 737 p 554 It is now necessary to have a cache hierarchy fig 73 p 473 to keep up with the processor It is now common practice to have several levels of on chip and off chip cache memory as shown below Note that there is a separate L1 cache for instructions and data 7 Increasing speed cost per bit L1 Instruction Cache L1 Data Cache L2 Cache L3 Cache Cache Memory November 23 2005 page 1 of7 processor chip J Increasing siziy I ECEN 5253 Digital Computer Design I The pipelined processor designs require one instructionclock which requires the data and instruction memories to respond within one clock Memories this fast are most economi cal to build inside the same silicon chip as the processor Both on chip caches can be lled over the same external data bus which helps reduce cost Note that this implies external caches and main memory has both instructions and data The on chip caches cannot be very large since they would take up too much of the chip area A larger external cache is usually provided that is still very fast and expensive Cheaper but slower memory can be used for main memory At first glance the cache memories may not seem to be very effective at increasing overall system speed We can retrieve data and instructions at a very high speed out of the cache memories and keep up with the CPU until we run out of space in the cache memories At this point we must read in instructions and data from the slower main memory The aver age throughput seems to limited by the slower main memory The cache memories actu ally do increase system throughput because of locality of reference Locality of Reference The French word cache pronounced like kawsh means a place to store things that will be used later The idea is that any instructions or data that are currently being used by the CPU are likely to be used again in the near future If they are used again the instructions and data will already be in cache and no main memory access delays will be necessary If the desired instructions and data are found in the cache most of the time a high cache hit rate then the average memory response time can be made to be close to the fast cache memory response time rather than the slower main memory response time The technical term for using the same data and instruction addresses over and over again is locality of reference temporal locality program loops produce similar sequences of addresses until the loop finishes spatial locality each segment of a program instruction data etc is contained in a set of addresses that are close to each other sequentiality if address n is used address n1 is likely to be used next This is true for instructions and arrays of data Locality of reference implies that a block of contiguous addresses should be stored in cache at one time Typical block sizes are from 4 to 32 words 32 to 256 bytes ifa word is 4 bytes Cache Addressing To store a block in cache it is necessary to choose a cache address for the block The pro cessor must then be able to find the block quickly whenever it is addressed again It is essential that the address translation process between real addresses and cache addresses Cache Memory November 23 2005 page 2 of7 I ECEN 5253 Digital Computer Design I be extremely high speed ie the entire cache access address translation data transfer must take place within as few clock cycles as possible Fully associative memory The memory is not addressed in the usual manner Instead it has a key word input which the memory compares to all of its stored keywords If there is a match the memory outputs the entire word where the key was found In our case we want the key word to be the real address and the rest of the word should contain the data or instruction stored in the cache address data out hit out com I ara 0 com I ara 0 The valid bit V indicates whether any data is stored at the cache address If the address field in a cache location matches the address in and the valid bit is set then a hit occurs indicating that the cache already contains the desired address The data can be read on the data out bus data writes will be discussed later This method allows storing the data anywhere in the memory as desired Note the large number of comparators needed which makes this type memory very expensive The num ber of comparators is reduced by storing a block in each cache location instead of a single byte or word Direct Mapped Cache A typical direct mapped cache is shown in fig 77 p 478 The cache address of a block is formed from a subset of the bits in the real address of the block The block address is usually divided into several fields as shown in fig 77 p 478 where the index bits are used as the cache address The block offset bits determine which byte is selected from within the block Cache Memory November 23 2005 page 3 of7 I ECEN 5253 Digital Computer Design I The tag field contains the leftover bits that form the rest of the real address not used in the index or block offset A direct mapped cache has only a single comparator making it much cheaper This com parator compares the tag of the data in cache with the tag bits in the address If the tag bits match and the cache has valid data at the index location then there is a hit Otherwise there is a miss and the cache does not contain the data for that address Note that addresses with the same cache index all map into the same physical location in the cache see fig 75 p 475 Addresses with different tags but the same index cannot be stored in memory simultaneously we can t put data anywhere we want which can sig nificantly decrease cache hit rate A direct mapped cache with several words per block more realistic is shown in fig 79 p 486 Part ofthe address is used as a block offset to control the MUX which selects the correct word to put on the data bus Note that hit signals are not used in the address decoding Thus the cache memory puts out data regardless of whether a hit occurred or not When reading data from cache the data and tag memories can operate in parallel This does not work so well for writing as we shall see Set Associative Cache This is a compromise between fully associative and direct mapped cache The cache is divided into several sets The bits in the real address are used to select a unique location in each set just like direct mapped cache Each set has a com parator that determines a hit from that set again just like direct mapped cache However since there are several sets it is possible to store many data simultaneously with the same address tags More words with the same address tag are allowed at the expense of adding additional comparators Fig 717 p 503 shows a fourway set associative cache which allows storing four data simultaneously with the same address tags If we define memory size as the number of data values stored don t count the tag store size then index bit size cache s1ze 2 block s1ze set s1ze ta bit size index bit size mam memory s1ze 2 g 2 block s1ze Direct mapped cache corresponds to a set size of 1 while fully associative cache corre sponds to an index bit size of 0 If the cache and main memory size is fixed the set associative cache must store more bits in the tag as the set size increases The above two equations can be used to eliminate Cache Memory November 23 2005 page 4 of7 I ECEN 5253 Digital Computer Design I index bit size and solve for the tag bit size 2tag bit size set size main memory size cache size Sector Organized Cache This is similar to a direct mapped cache except that all blocks in the same sector have the same tag This allows considerable savings in memory size since it is no longer necessary to store atag entry for each block as in the previous designs The price paid is that an entire sector is invalidated on any single cache miss for one of the blocks in the sector there is no place else to write the block but one sector so the stored tag for the sector must be changed The access time for the sector organized cache is the same as the direct mapped cache since only a single sub block is read in at a time The hit rate for the sector cache is lower unless the data has a high degree of sequentiality This is why sector organized cache is often used for instruction cache since instructions do have a high degree of sequentiality Cache Miss on Read When the desired address is not in the cache no hit a cache m occurs First a quick decision must be made about which block must be freed to store the new information Then the cache must be m by reading data from the next lower level of memory All the while this is occurring the CPU must be stalled until the memory data is available Block Replacement This is not a problem with the direct mapped cache since there is only one place to store each block If that one place is already occupied then that block is replaced For the set associative caches a decision must be made as to which member of the set to replace There are many block replacement algorithms which could be used to replace blocks in cache memory The decision algorithm must be very fast to reduce the time the CPU is stalled This short time interval requires that the decision algorithm be implemented in hardware Then the algorithm must be simple to keep the hardware expense down and the speed up I Random Replacement a block is chosen at random A hardware pseudorandom bit generator see the VLSI course can be used to decide which block in a set to replace I Least Recently Used LRU the block that has been unused for the longest time is cho sen for replacement With a twoway associative cache the LRU algorithm is easy to implement Since we only have two choices we simply choose to replace the entry that was not referenced last This requires only a single bit to keep track of which entry was referenced last LRU is much more complex to implement for larger set sizes Studies have shown that LRU does not do significantly better than random replacement for large caches Cache Memory November 23 2005 page 5 of7 I ECEN 5253 Digital Computer Design I Writes with Cache Memory Writes are much less frequent than reads For instruction cache a read occurs for every instruction fetch and there are no writes from the CPU only instruction cache ll writes For data cache loads read from memory have a much higher instruction frequency than stores writes to memory Writes to data cache should be implemented as ef ciently as possible without slowing down reads In a typical cache for example f1g 717 p 503 we are able to read a data block and read the tag in parallel since reading an incorrect block does no harm if there is a cache miss Unfortunately we must rst read the tag and get a hit before writing into the cache block If a miss occurs data would be written into an incorrect block which would be a disaster We must wait to write data into cache until after it is certain that the tags match This means that writes take longer read tag followed by write data than reads read tag and read data at once or we must slow reads down to match writes Cache Hit on Write When the CPU wants to write data to a cache a quick decision must be made about whether to write the data to cache lower level memory or to both memo ries Let us rst assume that the cache block containing the data address is already in cache what is done when the block is not in cache will depend on what we do if it is The block valid bit must be reset if the new data is not written into cache Resetting the block valid bit requires a cache write operation so we might as well write the data into cache it takes the same time Therefore a cache hit on CPU write always causes a cache write We have two choices about whether to write to lower level memory on a cache hit write through The data is written to both cache and lower level memory at once This is easiest to implement in hardware N write back The data is written to cache immediately The block is marked as modi ed dirty by setting a ag bit in the cache Later when it is necessary to replace the block dirty blocks will be written back to lower level memory Unmodi ed clean blocks do not need to be written back to lower level memory The write back scheme is attractive because it reduces the number of lower level memory write operations if more than one CPU write to the same block occurs and can increase system throughput The maximum bene t is achieved by leaving the block in cache as long as possible so that we can write repeatedly to it without having to go to lower level memory Thus when using write back cache lower level memory writes occur only when one of the following things happen A cache miss causes the replacement of a dirty block the replaced dirty block is written to lower level memory not the block that caused the miss N A process switch causes all dirty blocks in cache to be written back to lower level mem ory cache ush and all blocks marked as invalid This is necessary since the new pro cess will have different instructions and data at the same addresses used by the old process Cache Memory November 23 2005 page 6 of7 I ECEN 5253 Digital Computer Design I E The CPU issues an IO instruction There are no separate IO instructions in many pro cessors for example the MIPS instructions In this case the IO devices are wired to look like memory memory mapped IO To function correctly the communication between the CPU and the IO devices must be direct and not buffered by the cache 4 Any data shared between processes printer queues etc or between processors in a multiprocessor environment should usually be written to main memory immediately If it is not one must deal with the cache consistency problem which we will discuss next semester Cache Miss 0n Write Now we can consider what to do when a CPUwritecachemiss condition occurs nofetchonwritemiss nowrite allocate or write around The data is not written to cache but written to higher level memory only Since the data is not in cache it is not necessary to change any cache block valid bits This is the easiest to implement in hardware Most often used with write through cache N fetchonwritemiss write allocate The data word and its corresponding block is read from lower level memory and put into cache The data is then written into cache Filling the block into cache has the advantage that whatever block is written into will probably be read or written again soon Used most often with write back cache since future readwrites to this location will be to cache only saving lower level memory accesses Write Buffers Note that a cache miss on write always requires some kind of lower level memory operation whether or not fetchonwrite is done However it is not necessary for the CPU to wait for the write operation to complete in lower level memory quite different for read since the CPU must wait for the data it wants before it can proceed It makes sense then to store the write data in place called the write buffer and let the CPU con tinue with the next instruction As mentioned above the CPU must wait for memory reads but can proceed on memory writes by storing the write data temporarily in the write buffer CPU delays are reduced if we allow a memory read operation to take precedence over any un nished writes in the write buffer The trouble with write buffers is that it is difficult to insure that the correct data gets out of the write buiTer before it is used by a read We would have to check all of the data addresses in the write buffer before allowing a read to proceed This makes the write buiTer into another cache Since the write buffer usually has only a few words stored in it it is small enough to be implemented as fully associative cache Cache Memory November 23 2005 page 7 of7 I ECEN 5253 Digital Computer Design I Floating Point Pipelines In the MIPS pipeline that we have studied last semester only one pipeline stage EX is allowed for the numerical calculations to take place As long as the numerical calculations can be done by a simple ALU the amount of time available during one pipeline stage should be adequate However more complicated arithmetic operations such as multiply divide and oating point operations can require complex hardware with signi cantly longer delays than a simple ALU One solution might be to extend the super pipelining idea to the EX stage and have multi ple EX stages Well designed integer multiply hardware has about twice the delay of the carry chain in the ALU This could be incorporated with two EX stages INT MULT stage 2 EX2MEM The throughput for ALU instructions is unaffected by this scheme but the increased pipe line latency will require additional operand forwarding and longer instruction latencies The same technique is not practical for integer division Multiplication can be compressed into a few pipeline stages because most of the computation can be done in parallel Unfor tunately division algorithms are highly sequential and only a limited amount of work can be done in each pipeline stage which requires a large number of stages to complete the algorithm Floating point operations also require many pipeline stages Single Pipelines When all instructions run through a pipeline with the same number of stages it is usually not practical to make the pipeline long enough to accommodate the longest multistage instruction since it would take too many stages Instead the multi Floating Point Pipelines August 27 2008 page 1 of4 I ECEN 5253 Digital Computer Design I stage instructions are broken down into simpler parts The EX stage contains all of the hardware for each step and the EX stage is repeated the appropriate number of times until the multistage instruction finishes Meanwhile the IF and ID stages are stalled if EX takes more than one clock cycle This scheme preserves the simplicity of the simple linear pipeline and does not increase the latency of the simple ALU instructions only one EX cycle The multistage instruc tions cause many stall cycles the instructions become essentially unpipelined but as long as the instruction frequency of the multistage instructions is low stalls will be infrequent Typical instruction frequency data show that the multistage instructions integer multiply divide and oating point operations have very low frequencies except in the most intense oating point applications Division is frequently implemented using this scheme even in high performance superscalar processors Parallel Pipelines Different parallel pipelines for di erent instructions can reduce the number of stalls caused by the multistage instructions The IF and ID stages are not auto integer unit floating point unit matically stalled when a multistage instruction begins execution The multistage instructions still have high instruction latencies but the stalls are reduced since a new instruction can start executing every clock cycle The high instruction latency means that later instructions that need the results of previous multistage instructions have to be stalled until the latency period is over Division would require the longest pipeline but has the lowest instruction frequency of the multistage instructions The low instruction frequency makes it uneconomical to devote extensive hardware to a long division pipeline Instead the division instruction repeatedly uses the same EX hardware over and over again Since the pipelines for the di erent instruction types are in parallel ALU instructions can proceed before the multistage instructions finish If the ALU instructions use the results of a multistage instruction still in its pipeline a RAW data hazard is possible These can be avoided by stalling or operand forwarding similar to the fixed length pipelines studied previously The problem is that there are many more situations that can cause stalls which complicates the design If ALU instructions do not use results from any multistage instructions already in the pipeline then the ALU instructions might finish before a previous multistage instruction and multiplies might finish before previous divides etc Recall that we could ignore Floating Point Pipelines August 27 2008 page 2 of4 I ECEN 5253 Digital Computer Design I WAW and WAR data hazards only because instructions start and nish in order in a nor mal pipeline The variable length pipeline on the previous page does not cause WAR hazards because instructions start and read their operands in ID in order before writes of subsequent instructions Thus subsequent writes will automatically be after previous reads Another problem is that more than one instruction might get to MEM and WB at the same time It is possible to have duplicate MEM and WB stages on the end of each pipeline but this would require multiple ports on the memory and register le too expensive usually It is true that many processors provide a separate register le for oating point operations so that an integer and oating point instruction could be in WB at the same time How ever oating point operations could still cause structural hazards with other oating point operations A single MEM and WB stage as on the previous page makes a structural haz ard that must be avoided by stalling instructions so that only one instruction at a time enters MEM and WB Maintaining Precise Exceptions Since instructions can nish out of order in oating point pipelines it makes it much more dif cult to make precise exceptions Recall that our previous method of making precise exceptions was to handle them in order as the instructions reached the WB stage Now instructions do not necessarily reach WB in order We have already mentioned that precise exceptions are necessary for virtual memory sys tems so that page faults are handled in order The IEEE oating point standard also requires precise exceptions The standard defines several exceptions which allow the exception handler the opportunity to x the arithmetic result causing the exception Thus the oating point exception handlers must be insured to run before any following instructions use the result of the oating point instruction The solution to the problem of precise oating point exceptions falls into the four catego ries below Give the processor two modes of operation one with precise oating point exceptions and one without The imprecise exception mode is faster because more oating point instructions can be in the pipeline at the same time With precise exceptions new oat ing point instructions are started only when it is determined that previous oating point instructions will not cause exceptions This solution is used by the DEC alpha IBM powerl and power2 and MIPS R8000 N Make precise exceptions by storing the results of an instruction temporarily until all previous instructions have nished Since the instruction results register or memory writes are written in order only m exceptions for each instruction are handled the register file always has the correct values to restart the pipeline after the exception is handled This is similar to handling all exceptions in the WB stage in the simple MIPS pipeline This solution requires more sophisticated hardware which will be studied in more detail later This technique is used by most superscalar processors Floating Point Pipelines August 27 2008 page 3 of4 I ECEN 5253 Digital Computer Design I 3 Allow imprecise exceptions and leave it up to the exception handler software to make the exceptions precise The exception handler software must have access to sufficient information about every instruction in the pipeline In architectures with a single inte ger unit as the MIPS pipeline the integer instructions finish in order and only the oat ing point instructions in the pipeline need to be considered by the exception handler A queue of pending exceptions is needed to execute the exception handler software in order It is also difficult to restart the pipeline after the exception handler is finished Consider the following instruction sequence DIVF F0 F2 F4 MULF F10 F10 F8 ADDF F12 F12 F14 Suppose DIVF causes an exception after ADDF completes but before MULF com pletes When the pipeline is restarted MULF must be executed again but ADDF must not be executed This solution is used by the SPARC processors 4 Always require that previous instructions cannot cause exceptions before allowing a new instruction to start execution This is most effective when exception detection hardware is added early in the multiple EX oating point pipeline stages This solution is used in the MIPS R20003000 MIPS R4000 and the Intel Pentium Floating Point Pipelines August 27 2008 page 4 of4 I ECEN 5253 Digital Computer Design I Virtual Memory All of the addresses generated by the processor must be found in physical main memory at the bottom of the cache memory hierarchy Main memory size on current systems is usu ally limited about 1 GB currently by the expense of enlarging the memory Two prob lems with a nite amount of main memory can be handled with a virtual memory system design Programs can be bigger than the available main memory no matter how big memory is someone will want more Without a virtual memory programs must be divided into pieces called segments that fit into physical memory The segments are designed by the programmer to be independent so that they do not have to be in mem ory at the same time The programmer has to write code to bring in the segments from disk into main memory overlaying the segments as needed to fit into the available memory space N Memory must be shared between different processes more than one process must be in memory at the same time for context switching to be ef cient The operating system assigns di erent parts of memory to each process The program code for each process must be relocatable to allow the operating system to put the process code anywhere at any time To solve these problems in a manner transparent to users the address generated by the processor the logical address or virtual address is translated by the virtual memory into a physical address that can be anywhere in main memory fig 719 p 512 When there is not enough main memory disk storage is used to store main memory contents until needed In this way each process can generate addresses for a virtual machine consist ing of the entire address space With the following correspondence in terminology virtual memory acts very much like cache virtual memory miss page fault address fault block page segment another level in the cache hierarchy with disk as the next lowest level memory see fig 722 p 518 lt physical memory length gt segment 1 unused segment 2 segment 3 i K external fragmentation page 393 m 1 g 39 K internal fragmentation Virtual Memory December 8 2005 page 1 of 11 I ECEN 5253 Digital Computer Design I Information is transferred between disk and main memory either in variable sized seg ments or fixed size pages With segments external fragmentation limits memory usage efficiency With pages internal fragmentation is a problem Program Segments It has become common practice to divide program code into seg ments by function These segments are designed with memory protection in mind rather then overlaying Four common types of segments are code fetch an instruction absolute data accessed with absolute addressing mode stack data accessed with a stack base pointer register bP N data data accessed any other way The point ofthese segments is to restrict the ways di erent segments can be used It is taken to be an unintended error for example to write data into the program segment or fetch instructions from one of the data segments Many systems support the user defining as many segments as heshe likes and also the user is allowed to control the access of each segment by setting protection bits for each segment Segmented Virtual Memory Typical hardware support required for virtual memory with variable length segments is shown below segment registers code segment absolute segment stack segment data segment readwrite sserpprz tenurA comparator physical address protection address fault fault Intel processors allow a much larger number of segments than the 4 shown above The segment portion of the virtual address selects a segment descriptor from an area of mem ory called the segment descriptor table Segment descriptors take the place of the segment Virtual Memory December 8 2005 page 2 of 11 I ECEN 5253 Digital Computer Design I registers Intel processors also have an elaborate protection scheme that allows user pro cesses to call the operating system without using interrupts This has advantages for pipe lined processors Memory Paging The dif culty of dealing with variable sized segments has lead most hardware manufacturers to divide memory into xed size pages usually several KB in length Hardware support for a simple paged virtual memory is in g 721 p 517 Note that no adder is required since pages always begin at an address that ends in all 0 s No comparator is needed since all pages are the same length The total amount of memory taken up by the page tables must be kept to a reasonable size The memory address consisting of nu bits is divided into two elds nv bits in the virtual page number and no bits in the offset where nv no na 4 nv gtlt no gt virtual page offset lt n r a quota quotV quot0 number of words 1n vrrtual memory 2 2 X 2 number of pages in virtual memory 2 size ofpage in memory 2na Since there must be one page table entry for each page in the memory the page table itself looks like the following where np is the total number of bits in the page table entry PTE 2 V physical page protection quotp Virtual Memory December 8 2005 page 3 of 11 I ECEN 5253 Digital Computer Design I For large memories the page tables can get quite large For example if the page size is 8K bytes DEC Alpha AXP 21064 then no must be 13 8K 213 Ifthe address is 43 bits then nv must be 30 43 l3 and there will be 230 1G PTE s The size ofthe PTE depends on the maximum number of physical pages implemented in memory Around 8 bytes is usually suf cient for the physical memory address and the protection bits This gives page table size no of pages x PTE size 1G x 8 bytes 8 GB Thus just the page table would be about as large or larger than a typical physical memory Multi level Paging Multilevel paging can reduce the total size of the page table Con sider the following three level paging scheme with the same numbers as the previous example The original logical page number is divided into three elds each with its own page tables The page table register points to the level 1 page table The level 1 and 2 page table entries contain the physical page number of the next level page table The level 3 page table contains the physical page number of the desired memory location The lowest three bits of the PTE address are zero since each PTE is 23 8 bits in length lt 43 lt 10 N10 gtlt10 gtltl3 gt logical address level 1 level2 level3 offset page table register level 1 Page Table level 2 Page Table physical 30 address level 3 Page Table 4 30 Virtual Memory December 8 2005 page 4 of 11 I ECEN 5253 Digital Computer Design I The logical page address has been divided into three 10bit elds so that the page tables for each of the three levels will be the same size levell PTE s level2 PTE s level3 PTE s 210 1K PTE size 8 bytes therefore page table size 1K X 8 bytes 8KB A page table ts exactly into one page This simpli es the task of managing the page tables A single page table is much smaller than the 8GB page table with single level pag ing Unfortunately there is now more than one page table There is one page table at the rst level but each levell PTE points to a different level2 page table There may be as many as 210 level2 page tables For each level2 PTE there can be as many as 210 level 3 page tables The total memory taken up by page tables can be as much as mem for page tables tables X table size 1 210 210 x 210 x 8KB 1M X 8KB 8GB which does not save space However it is not necessary for all of the level2 and level3 page tables to be present if the program segments do not use all of the address space For eXample a program can fit into a single level3 page table if it is shorter than 210 X 213 8M bytes Having a single levell page table a single level2 page table and a single level3 page table would require 24KB which is acceptable Although the maXimum amount of memory is not reduced clearly the average amount of memory used by page tables is reduced by using multilevel paging This design has several advantages 1 Page tables t naturally into the memory page 2 The physical address is generated by concatenation not addition which is much faster LA The physical address in the PTE is concatenated not added as with segments with a eld in the virtual address to produce an address for the neXt level page table which allows the page tables to be anywhere in memory 4 The disk address can be stored in the PTE of a page not in memory invalid page V39 There are protection and ag bits available for each page including the page tables themselves In addition to read write and eXecute protection additional ag bits usu ally include a valid page bit which indicates whether the page is in memory vl l or on disk vl 0 Other ag bits might indicate whether the page is uncacheable blocks from this page always generate a cache miss or unpageable cannot be swapped back out to disk while the process is running Virtual Memory December 8 2005 page 5 of 11 I ECEN 5253 Digital Computer Design I Translation Lookaside Buffer A paged main memory requires the use of one or more page tables for each memory access In our previous example three page tables are used for each main memory access This seems to mean that a total of four main memory oper ations three page table reads and then the data transfer are needed each time there is a cache miss Note that these extra memory operations to read the page tables are done by the Memory Management Unit MMU the CPU just generates the viItual address Also these extra main memory cycles only effect the cache flll time T ll and it is still possible to have a short average memory access time Tmem if the cache hit probability is high The time for cache flll T ll can be considerably reduced if the page tables themselves can be in cache The page tables might not be in the same cache as the instruction or data cache since the they are designed to respond to addresses generated by the CPU not the MMU Furthermore it is not necessary to store all of the page tables in the cache All that is needed is a translation between the main memory page number in the viItual address called the vi1tual page number or VPN and the physical page number PPN This cache of translations between the VPN and PPN is called a translation lookaside bulTer TLB See f1g 723 p 522 The TLB is like single level page table permanently in cache but there is not enough room in the TLB to store all of the single level page table entries for all of main memory On a cache miss if one gets a TLB hit then it is not necessary for the MMU to read the page tables in main memory to nd the correct memory location for the CPU A TLB miss is the only time that the cache flll time would be increased by reading the page tables A high TLB hit probability would almost eliminate the added delays of reading the page tables A very high TLB hit rate is achieved with fairly small TLB s 1K entries Con ict misses in the TLB are minimal if the VPN s generated by the processor are ran dom Unfortunately the low bits of the VPN should be random but are not because of the way addresses are assigned various segments of the program Lower miss rates are achieved if the bits in the VPN are hashed mixed up to generate a pseudorandom cache address to store the PPN entry The hashing function randomizes TLB addresses to minimize contention for the same locations in the TLB The hashing function must be implemented in hardware to be very fast When a context switch occurs all of the TLB entries for the current process must be ushed just as the cache must be ushed However it is not necessary to ush those page translations that correspond to the operating system addresses It is common to use two di erent TLB s one for the current process and one for the operating system Then only the TLB for the process needs to be ushed when doing a context switch Cache with a TLB The question we have avoided up to now is what address is used for the cache the virtual address VA or physical address PA Virtual Memory December 8 2005 page 6 of 11 I ECEN 5253 Digital Computer Design I 1 Put the TLB in series with the data cache and store PA s as the data cache tag as in g 724 p 525 Data Cache PA tag advantages there is a one to one correspondence between cache and main memory a block in main memory is represented by only one block in cache If data in cache is changed CPU write it is not necessary to worry about updating other blocks in cache that correspond to the same main memory location PA tag disadvantages The cache must wait for the TLB to provide the physical address This puts the TLB on the critical timing path that determines the processor clock period This disadvantage can be overcome by putting the TLB in a separate pipeline stage of a superpipelined processor Virtual Memory December 8 2005 page 7 of 11 I ECEN 5253 Digital Computer Design I 2 Put the TLB and data cache in parallel store VA s in the data cache tag and use an RTB to remove alias VA s from the data cache CPU VA i i VPN offset tag blocklword ill Data Cache replace stored VA VA VA tag advantages If the VA is used to address the cache the cache does not wait for the TLB The Reverse Translation BulTer RTB does not have a signi cant impact on mem ory delay since the RTB is used only on cache miss VA tag disadvantages Unfortunately there is not necessarily a one to one correspondence between VA s and PA s For example the operating system and the current process may use dilTerent VA s to refer to the same physical address space The duplicate VA s are called aliases When a CPU write modi es data in the cache data consistency requires that cache locations corresponding to all aliases must also be updated One way to do this is to allow only one valid alias in cache at a time and thus avoid having to update aliases A reverse translation buffer RTB is used to stores the VA corresponding to each PA When a PA is generated by the TLB on cache miss the RTB is checked to see if another VA already corresponds to this PA If so then the old VA entry is invalidated before the new one is read into data cache Unfortunately the RTB would need an entry for each main memory location each PA which would be too large VA cache is not commonly used because of the dif culty of removing aliases Virtual Memory December 8 2005 page 8 of 11 I ECEN 5253 Digital Computer Design I Memory Design Example The memory hierarchy for the DEC Alpha AXP 21064 is shown below Note the following features of the design I The CPU generates a 43 bit viItual address and uses 64 bit instructiondata words I Memory page size is 8KB which means that least signi cant 13 bits of address is the page offset and the most signi cant 30 bits is the viItual page number I There is a separate onchip TLB for instructions ITLB and data DTLB The ITLB 12 entries can be smaller than the DTLB 32 entries because of the greater sequenti ality of instructions The ITLB and DTLB are small enough to be implemented as fully associative caches to eliminate con ict misses without the use of a hashing function for the TLB address The TLB s translate the 30 bit viItual page number to a 21 bit physical page number With 13 bits for the page offset this gives a 34 bit physical address sulTicient for a 16GB byte addressable physical memory There is a separate onchip direct mapped instruction cache ICACHE and data cache DCACHE that use physical addresses provided by the ITLB and DTLB respectively The caches are identical in size with 32 byte blocks and 256 lines for a total capacity of 8KB each The onchip caches are small enough that the 8 bit cache index can be determined totally from the high order 8 bits of the 13 bit page offset the lower 5 bits are the block offset This means that the caches do NOT wait for the TLB to provide the physical page number since the page offset is the same in the physical and virtual address The cache provides read data at the same time that the TLB provides the physical page num ber used to compare with the 21 bit tag in the cache I An onchip prefetch bulTer is provided for the ICACHE consisting of a single block A 256 bit bus connects the prefetch bulTer to the ICACHE so that the prefetched block can be loaded in a single clock cycle The prefetch buffer tag size is 29 bits 34 bits of physical address less the 5 bits for the block offset The DCACHE uses a writethrough strategy with a multiblock write buffer On write hit data is written to the write buffer and a delayed write bulTer On write miss data is written to the write bulTer only no write allocate Since blocks are stored as one entry in the write bulTer the write bulTer tag size is also 29 bits Although any offchip level 2 cache could be used the Alpha is designed to work with direct mapped cache with 29 bits of address and 256 bits 32 bytes or 1 block of data or instructions at each location Although the block size is 32 bytes 256 bits the external memory bus is only 128 bits 16 bytes so that it takes two bus cycles to send the complete block Virtual Memory December 8 2005 page 9 of 11 I ECEN 5253 DigitalComputer Design I CPU virtual instr address 420 instruc data out virtual data address 420 data in virtual page no offset tion bus bus virtual page no offset bus 4213 120 6310 6310 4213 120 630 J ITLB Fully associative cache 12 locations phys page no 3313 virtual tag 42 13 byte offset 20 word offset 43 index 125 4 ICACHE Direct mapped cache 256 locations phys tag data 3313 2550 Prefetch Buffer 1 location Fully associative cache data 2550 My tag 335 Sti address 335 2550 off chip interface J DTLB Fully associative cache 32 locations virtual tag 42 13 phys page no 3313 word offset 43 byte offset 20 index 125 4 DCACHE Direct mapped cache 256 locations data 2550 phys tag 3313 data 630 1 location tag Fully associative cache 333 Delaved Write Buffer tag t 335 data Write Buffer 4 locations Fully associative cache data 2550 Virtual Memory December 8 2005 page 10 of 11 I ECEN 5253 Digital Computer Design I Cache Controller In the event of a cache miss or TLB miss the processor must be stalled while the cache ll sequence takes place or the page tables are read Even under normal conditions TLB hit and cache hit extra hardware must check for access privileges and that there is a valid page in memory it might have been swapped out to disk This work is done by the cache controller hardware The cache controller along with the onchip cache is often called the Memory Manage ment Unit MMU The MMU includes all of the hardware necessary to fool the proces sor into thinking that it has a large very fast memory The MMU does this by stalling the processor when necessary while the cache controller in the MMU loads the TLB and caches The cache controller is usually implemented as a small state machine Moore machine A typical ow chart for design ofthe controller state machine is shown in fig 725 p 526 Part of the work can be done by the processor executing protected code in a special read only memory ROM To do this requires interrupting the processor as discussed in a pre vious section Memory System Examples A photograph of the AMD Opteron same instruction set as Intel Pentium IV is shown in fig 733 p 546 Note the huge L2 on chip cache smaller Ll instruction and data caches and the relatively small area for the Execution Unit Both of these processors are superscalar processors which will be discussed next semester The on chip TLB organization of the Intel Pentium IV and the AMD Opteron is compared in fig 734 p 547 and the on chip cache organization is compared in fig 735 p 548 Finally the on chip memory systems of several processors are compared in fig 736 p 553 Virtual Memory December 8 2005 page 11 of 11 I ECEN 5253 Digital Computer Design I Cache Memory Performance Cache Performance Equation System Performance To correctly evaluate cache performance we must examine the effect of cache on overall system performance rather than trying to optimize the cache per formance by itself Recall that over system performance is measured by total program execution time CPUtime Zacixcmixr l CPI ICPIZ SPIZ We will assume that the pipeline is designed not to stall on a cache hit Since the L1 cache must provide instructions or data every clock period the L1 cache access time must be less than one clock period A cache miss would require stalling until the cache can respond The number of stall cycles required is called the cache miss penalty Tmisspena1ty The cache miss penalty can be quite different for reading and writing For example if we use write around nofetch onwritemiss there is no write miss penalty as long as there is room in the write buffer Cache misses occur at a rate that depends on the program being run and the design details of the cache Note that larger caches have lower miss rates Lets de ne CPI CPI SP1 execution miss where CPIexecu on includes effects of all stalls except cache misses and SPImiSS includes stalls due to cache misses only Then CPUtime Zacixcmexecu on srrmiss lT l 1 which is a more accurate version of the equation in the middle of p 492 The evaluation of cache performance depends on the details of the hardware con gura tion If we assume a separate instruction cache and data cache we can nd the cache miss stalls per instruction for each instruction type as Cache Memory Performance November 27 2005 page 1 of6 I ECEN 5253 Digital Computer Design I SP1 missz39 instructionmiss readmisspenalty RPIi Pdatamiss Treadmisspenalty I 1139 Pdatwmiss Twritemisspenalty where Pinsmlctionmiss is the instruction cache miss rate miss probability Pdatamiss is the data cache miss rate miss probability Treadmisspemlty is the instruction miss and read data miss penalty in clock cycles Tmitemisspemlty is the write data miss penalty in clock cycles RPII is the number of data reads per instruction 1 for loads 0 otherwise WPII is the number of data writes per instruction 1 for stores 0 otherwise which is a more exact version of the equation on the top of p 493 Rewriting the performance measures by completing the summation over the instruction types 139 gives CPUtime ICCPI where SP1miss T execution TC z CPIexecution CPIexecution 139 139 IC 1 SPImiss SPImiss 139 i Pinstructionmiss Treadmisspenalty 1C load P TC datamiss readmiss store T 1C datamiss Wme39mlss The above formulas are for separate data and instruction cache It applies equally well for a single uni ed instruction and data cache In this case CPIexecu on is larger because of the stalls necessary to avoid the structural hazard caused by contention for the memory by load and store instructions Even when instruction cache and data cache are the same cache Pinstmctionmiss lt Pdatamiss because of the higher locality of reference for instruc tions The miss rates probabilities Pinsmlc onmiss and Pdatamiss have been written as indepen dent of the instruction type index i This is because a miss occurs when an instruction is Cache Memory Performance November 27 2005 page 2 of 6 I ECEN 5253 Digital Computer Design I fetched from or uses data at a certain location in memory and the miss does not depend on what type of instruction it is However when we run a benchmark program we will nd di erent miss rates for different instruction types just out of the chance occurrence of where di erent instructions happen to be in the program To get the miss rates we aver age over all instruction types which gives miss rates as shown in fig 730 p539 Instruc tion cache has a lower miss rate than data cache of the same size due to the greater sequential locality of instruction streams over data streams The miss Penalties Treadmisspenalty and Twritemisspenaltyv cycles the next level of memory takes to respond to the cache miss the same second level cache is assumed for both instructions and data A formula for the read miss penalty is given in the next section The write miss penalty can be quite difficult to determine since it depends on how often the write bulTer is full The write miss penalty is the same as the read miss penalty for a write allocate fetchonwritemiss cache plus any time lost wait ing for a slot in the write buffer The write miss penalty is zero for a write around no fetchonwritemiss cache if the write buffer is always available depend on how many clock Cache Fill and Read Miss Penalty The time T lll that it takes to fill level 1 cache from level 1 in the memory hierarchy depends on the block size at level I the bus width to memory the cycle time of the memory bus TmbusU 11 connecting level I and level 11 and the access time Tmemll of the next level of memory blocksizel T l T 1 1110 memlt buswidthl 1 1 Tmbusd 1 1 An additional cache read is usually required to get the correct word after the new block is written into the cache The read miss penalty Treadmisspenalty is the total time the proces sor must be stalled until the data transfer completes which is the fill time for the block plus one cache cycle to do the readwrite operation to the selected word in the block as if there was a hit Treadmisspenaltyl Tfilla Thita Memory Module Performance For higher level caches the primary impact on perfor mance is the miss penalty for the next higher level in the memory hierarchy This requires determining the average cache memory access time at level I Tmeml Tmema Thita Pmissa Tmisspenaltyl where Thit is the normal response time of the cache when a hit occurs and the miss pen alty Tmisspena1ty is determined by the fill time from the next highest level in the memory hierarchy Although acceptable for higher level cache it is dangerous to use the average access time as a measure of cache performance for level 1 cache This cache interfaces directly with Cache Memory Performance November 27 2005 page 3 of 6 I ECEN 5253 Digital Computer Design I the CPU The hit time of top level cache is usually more important since it is incorporated into the CPU pipeline and has an effect on the CPU pipeline clock period and number of pipeline stages The average access time includes the amount of time that the processor is stalled during the miss penalty and it determines the overall performance of the memory but does not determine the pipeline clock period The miss penalty is usually about 10 to 100 times the hit time 10ThitltT lt100Thit misspenalty This gives an average access time 110P ThitltTmemlt1100TmiSSThit miss T as a function of P is shown below mem miss Iquotmiss Tmem 01 20 Thu lt Tmem lt 110 Thu 001 11 Th lt Tmem lt 20 Th To get the average access time Tmem within a factor of 2 of the ideal cache memory per miss lt 1 to 10 Very low miss rates high hit rates are needed to get the performance benefits of cache memory formance Thit requires P Choosing Cache Design Parameters We must now try to design a cache that is going to have the very high hit rate required for high system speed Unfortunately it is dif cult to predict what the cache performance will be for any particular application program since the performance is highly dependent on the sequence of addresses actually generated by running the application We will use the results from the benchmark program but remember that the actual performance for any other program will be different The cache design parameters that we consider are 1 cache size 2 block size 3 set size where set size varies from 1 direct mapped to number of blocks fully associa tive cache Cache Size Increasing cache size reduces the miss rate good but also increases the hit time bad Since hit time must be as short as possible for top level cache large on chip cache is usually not a good choice Otherwise the increase in the hit time is a small price to pay for the reduced miss rate Cache Memory Performance November 27 2005 page 4 of 6 I ECEN 5253 Digital Computer Design I Different results are obtained for an instruction cache and a data cache running the same benchmark programs The greater degree of sequentiality in the instruction addresses makes even a very small instruction cache highly effective A larger cache is needed for data to get the same performance The general trend is that increasing the cache size always gives some improvement in the miss rate as shown in fig 731 p 544 However the amount of improvement decreases as the cache gets larger the law of diminishing returns strikes again Economic considerations usually limit the size of cache When the cache is large enough so that performance improvements are minimal we are less willing to pay the cost of increasing the size of cache To understand the effect of cache size on miss rate it helps to break down the misses into the three categories on p 543 I compulsory miss an infinite size cache has only these misses I capacity miss additional misses from a finite size fully associative cache I con ict miss additional misses caused by mapping P miss direct mapped fully associative capacity misses compulsory misses 1 gt s1ze Cache size cannot reduce compulsory misses thus it would not pay to increase cache size beyond the point where most of the misses are compulsory Block Size The number of compulsory misses is reduced by bigger block sizes since it is more likely that an address when it is rst referenced will have already been read in on a previous miss on an address in the same block Unfortunately a smaller number of large blocks will fit into a fixed cache size A smaller number of blocks leads to an increase in capacity and con ict misses The increase in capacity and con ict misses eventually swamps the reduced compulsory misses when the block size gets large See fig 78 p 481 Increasing the block size can also increases the miss penalty Recall that the primary con tribution to the miss penalty is the fill time Cache Memory Performance November 27 2005 page 5 of 6 I ECEN 5253 Digital Computer Design I T T blocksize ll mem buswidth mbus The expense of the required large memory bus connecting the cache levels makes it impractical to have a bus width much bigger than about 128 or 256 bits This means that block sizes bigger than 16 to 32 bytes will take extra memory bus cycles beyond the aver age access time Tmem of the next level memory module Set Size Increasing the set size degree of associativity reduces the miss rate for any given cache size See fig 715 p 502 Increasing the set size reduces con ict misses only and has no effect on compulsory or capacity misses See fig 73 l p 544 The data also show that an 8way associative cache drives the miss rate to the capacity miss rate for cache size above 16KB This means that the 8way associative cache per forms almost as well as afully associative cache for all except the very smallest caches A fully associative cache is not economical for large cache memories since performance is not improved enough to justify the added expense of this style of memory For larger caches changing the set size does not seem to make a significant difference in the miss rate However the set size does make a significant difference on the cache size necessary to achieve a given miss rate Using the data in fig 730 p 539 the cache size needed to achieve a miss rate of 002 is as follows 32KB cache size for 002 miss rate 20KB 16KB 14KB l 2 4 8 set size The performance results indicate that a twoway associative cache can achieve the same average performance as direct mapped cache with only about l2 the memory size Increasing the set size does increase the cache hit time and therefore direct mapped cache lway associative cache is usually used for on chip cache This can also cause the aver age access time to increase for large caches Summary of Cache Design Parameters The effect of the main cache design parameters is summarized in fig 732 p 515 Cache Memory Performance November 27 2005 page 6 of 6 I ECEN 5253 Digital Computer Design I Cadence Inverter Cell Design Tutorial Dr LG Johnson 10 Introduction The following tutorial to create a cmos inverter allows the user to familiarize themselves with the Cadence Virtuoso Layout tools used to generate transistorlevel cells Please read this tutorial carefully I First make sure that you have copied the pro le file from app lcadenceprof11e into your home directory Log out of tesla and then log back in after you ve copied the le 2 Make a directory that you will use for a design project which will be called the project directory hereafter If you are going to use the automatic grader this directory should be named inverter jroj It is very important that you cd into your project directory before starting the Cadence tools since some les are created by the tools here that will need to be sent to the automatic grader later E Type icfb amp at the command prompt The amp sign is used to run the process in the background This brings up the Command Intrepeter Window CIW which is the main control win dow for the Cadence software 4 Before you can create or open a design you must have a working library where all your cells will be stored Think of this as your cell library for this project Although it is usu ally not necessary you can make other libraries if you wish There are other libraries with predefmed cells in them available for you to use also see the lirary manager win dow V39 To create your working library cell library in the CIW choose File gt New gt Library Type a name for your library use inverterilib for compatibility with the automatic grader leave the path blank for now and then click on Attach to an existing techflle Click OK A new form should popup prompting you to choose a technology le to attach your library to Choose AM 0 60a C5N 3M 2P high res from the pulldown menu and click OK You have now attached your library to the American Microsystems Inc AMI 06 micron CMOS technology process ON For compatibility with the automatic grader the path to your library must be a relative path not an absolute path In the CIW choose Tools gt Library Path Editor Cadence Inverter Cell Design Tutorial February 10 2003 page 1 of 11 I ECEN 5253 Digital Computer Design I In the Library Path Editor window click on the path to your library inverterilib If you do not see your library it may be necessary to first scroll down the list Change the path to your library to a relative path use inverterilib for compatibility with the automatic grader In the Library Path Editor choose File gt Save and then File gt Exit 1 To start working on your design in the CIW choose File gt New gt Cellview The Create New File form appears Set Library Name to the name of your cell library inverterilib that you provided ear lier Type inverter in the Cell Name eld Press the Tab key and type layout for the View Name and press the Tab key again Because the view name layout is a recognized name for the layout editor the Tool eld changes to Virtuoso Click OK to create a new cellview and close the form 8 The LayoutEalitor window should pop open at this time Notice that the Layer Selec tion Window LSW also pops open displaying the layers available 3 metal layers in the AMI 06u process You can zoom in using Ctrl z and zoom out using Shift z 20 Checking Design Rules The Cadence Virtuoso layout tool does not automatically check for design rule violations It is a very good idea to save and verify your design frequently as you are doing your lay out Don t wait until you are done to nd that you have dozens of rule violations to X It will be much easier to x the errors if there are only a few of them The Design Rule Checker DRC uses the AMI 06 pm design rules de ned in the alivaDRCrul file to check for design rule violations The DRC ags the errors it nds by creating polygons around the errors Details about the errors are also displayed on the CIW Choose Design gt Save to save your design It is not necessary to save before verifying your design but you should save your design frequently to avoid losing your work if the computer crashes Choose Veri gt DRC Click OK to start the Design Rule Checker If you have errors you ll see a highlighted polygon marking the rst error Fix your errors and run DRC until there are no more errors in your design N DRC violations only tell you if you have made layers too small or too close together It does not tell you if you are wasting area by making things bigger or farther apart than nec Cadence Inverter Cell Design Tutorial February 10 2003 page 2 of 11 I ECEN 5253 Digital Computer Design I essary You may need to push things together until a design rule violation occurs and then move them out the minimum amount needed to remove the violation 30 Sizing Layout The Cadence Virtuoso tool measures all distances in um not 7 In the 06 pm technology 7 is 03 pm The default grid spacing in the layout window is 10 pm You can set the grid spacing to 03 pm using the Options gt Display command You can use Window gt Cre ate Ruler command to measure distances with a ruler 40 Creating Example Inverter Layout The following subsections contain the detailed steps necessary to create the inverter layout in Figure l 41 Creating the nMOS Transistor The nMOS transistor you will be designing is based on the 06 pm technology Make sure that layer spacings are in accord with the 06 pm technology rules See Figure 1 By default commands from the Create and Edit menu on the Layout Editor always repeat Press the Esc key to disable any repeating commands Click on poly on the LSW This means that the layer you are about to draw is the poly layer N In the layout editor window Choose Create gt Rectangle Draw your poly layer Click on nactive on the LSW Draw your nactive layer Click on nselect on the LSW Draw your nselect layer around the nactive layer E 4 V39 Now you will have to create the dilTusion contacts Click on metall and draw your metall layer and then click on cc on the LSW and create your di usion contact inside the metall layer you just created You have just created an nMOS transistor 42 Creating the pMOS Transistor The pMOS transistor is similar to the nMOS transistor you just created except that it needs a pdilTusion area instead of an ndiffusion area Repeat steps 1 through 5 but for step 3 click on pactive and for step 4 click pselect Cadence Inverter Cell Design Tutorial February 10 2003 page 3 of 11 EcEu 5251 nlgnI Cnmvuler Design RE 1 1mm Lzyuul Es key when you39re done mu Pres s the You have just created a pMOS thanshsmh 43 Connec nglhe inputs and outputs h n L y e ee11 e sisters This opens up the Create Path form 3 Change the Width to 0 6 meme CadzmdmmuCe DesxngumA Fehnxyl IIEE mum I ECEN 5253 Digital Computer Design I Go to your layout and click on the rst point of the path center of the poly line and then doubleclick when you reach the second point to complete the path 4 You can create a contact for the input polymetall contact by drawing your poly and metall layer and then placing your cc layer inside the two layers See Figure l 44 Adding the power and ground connections In the LSW click the metall layer N Use the Create gt Polygon command to create the power and ground connections Polygons are shapes de ned by any number of points Polygons must be closed that is the rst and the last point must be the same LA In the Create Polygon form set Snap Mode to L90XFirst The snap mode controls the way segments snap to the drawing grid as you create the polygon L90 creates two segments at right angles to each other between every pair of points you enter XFz39rst means the rst segment is parrallel to the X axis Draw your power and ground connections 45 Substrate contacts Add subtrate contacts to your inverter as shown in Figure 1 Your design will not be com plete if you do not add your substrate contacts Note that design rule errors will occur unless the substrate contact is directly adjacent to the power ground contact as shown 46 Adding the Well 1 In the LSW click on the nwell layer 2 Use Create gt Rectangle command to create the nwell layer around the pdiffusion layer Note Your VDD substrate contact must be included inside the well See Figure l 47 Creating pins You will need to create 4 pins Vdd gnd OUT and IN to de ne your teminal names for the inverter cell In the LSW click on the metall layer Choose Create gt Pin The Create Symbolic Pin form appears You ll use shape pins in this tutorial and throughout most of this class Click the button next to shape pin to open the Create Shape Pin form N In the Create Shape Pin form type the following Terminal Names valal gnal OUT IN IO Type is inputOutput Cadence Inverter Cell Design Tutorial February 10 2003 page 5 of 11 I ECEN 5253 Digital Computer Design I 3 Click Display Pin Name to associate the name with the pin 4 Set the Snap Mode to L90XFirst 5 Create the rectangle for the Vdd pin coincident with the power line at the top of the inverter The name of the pin Vdd appears near the cursor after you click on the second corner of your rectangle A dashed line extends from the rst comer of the pin to the name showing the pin name is attached to the pin Move the cursor and place the Vdd text on the power line The name vdd dissapears from the Create Shape Pin form The rst name listed is now grid the ground pin Note If you make a mistake delete the error and then retype the name for Terminal Names and try again 6 Create the rectangle for the grid connection coincident with the ground line of your inverter 7 For the output set the IO Type to output Create your output pin coincident with the output of your inverter 8 For the input click on the poly layer in the LSW Set the IO Type to input and create your input pin coincident with the input contact of your inverter You have now completed your inverter cell and it should look like Figure l 50 Merging the design Once you complete the DRC and have xed any errors in your design select the entire design using the left mouse button Then merge it by choosing Edit gt Merge This command combines adjacent and overlapping layers of the same type together 60 Extracting the design The design needs to be extracted to determine the parasitic capacitances and resistances Extract the design by choosing Veri gt Extract N Turn on Join Nets with Same Name E Click on Set Switch and choose Extract jarasiticicaps The switch names determine the type of extraction to be executed Click OK we Save the design and close the layout editor window Cadence Inverter Cell Design Tutorial February 10 2003 page 6 of 11 I ECEN 5253 Digital Computer Design I 70 Creating the Netlist Open up the extracted view of the circuit from the CIW You ll notice that this view con tains the parasitic capacitances and resistances of the circuit From here we will create the Netlist le Choose Tools gt Analog Environment This opens up the A irma Analog Circuit Design Environment window This tool can be used for many other applications but for now we will use it to create the netlist of your design N Choose Setup gt SimulatorDirectoryHost This opens up the Choose SimulatorDirectoryHost form Choose spectre for simulator and click OK E Choose Simulation gt Netlist gt Create This creates your netlist le called netlist and stores it in lthomegtcadencesimulationltdesign namegtspectreeXtractednetlist 80 Simulation using Spectre Spectre is a detailed circuit simulator similar to spice that accurately determines circuit voltages and currents as functions of time The calculations are so time consuming that this simulator should not be used on circuits with more than a few hundred transistors 81 Circuit File The netlist le just created is however not a complete Spectre leWe will need to add the MOSFET transistor models and the input and output instructions to perform the simula tion The MOSFET transistor model parameters model deck for the AMI 06 pm can be found in the following directory app 1 cadence spectre ami06uspectre Go into your lthomegtcadencesimulationltdesign namegtspectreeXtractednetlist directory N Add the model deck to the netlist le to create the circuit le by issuing the cat concat enation command cat app 1 cadencespectreami06uspectre netlist gtgt Cellnamescs Note The very rst line of your circuit le should either be blank or contain a comment only 3 You will need to edit the le to enter the input and output instructions Look at the sample circuit le in the next section 4 When nished your circuit le should look like the attached circuit le in this handout Cadence Inverter Cell Design Tutorial February 10 2003 page 7 of 11 I ECEN 5253 Digital Computer Design I 5 Save the circuit le 82 Circuit File Example 6 First line af le left blank simulator langspectre 6 Statement to read Spectre syntax model ami06N bsim3v3 type n version 31 tnom 27 tox 141E8 xJ 15E7 nch 17E17 vthO 07086 k1 08354582 k2 0088431 k3 414403818 k3b 14 w0 6480766E7 nlx 1E10 dvt0w 0 dvtlw 53E6 dvt2w 0032 dvt0 36139113 dvtl 03795745 dvt2 01399976 u0 5336953445 ua 7558023E10 ub 1181167E18 uc 2582756E11 vsat 1300981E5 a0 05292985 ags 01463715 b0 1283336E6 b 1408099E6 keta 00173166 a1 0 a2 1 rdsw 2268366E3 prwg 1E3 prwb 6320549E5 wr 1 wint 2043512E7 lint 3034496E8 xl 0 xw 0 dwg 1446149E8 dwb 2077539E8 voff 01137226 nfactor 12880596 cit cdsc 1506004E4 cdscd 0 cdscb eta0 3815372E4 etab 1029178E3 dsub 2173055E4 pclm 06171774 pdiblcl 0185986 pdiblc2 3473187E3 pdiblcb 1E3 drout 04037723 pscbe1 5998012E9 pscbe2 3788068E8 pvag 0012927 delta 001 mobmod 1 prt 0 ute 15 ktl 011 ktll 0 kt2 0022 ual 431E9 ubl 761E18 ucl 56E11 at 33E4 wl 0 Wln WW wwn 1 le 0 ll 0 lln 1 lw 0 lwn 1 lWl 0 capmod 2 xpart 04 cgdo 199E10 cgso 199E10 cgbo 0 cj 4233802E4 pb 09899238 04495859 cjsw 3825632E10 pbsw 01082556 mjsw 01083618 pvth0 00212852 prdsw 161546703 pk2 00253069 Wketa 00188633 lketa 00204965 model ami06P bsim3v3 type p version 31 tnom 27 tox 141E8 xj 15E7 nch 17E17 vthO 09179952 k1 05575604 k2 0010265 k3 140655075 k3b 23032921 w0 1147829E6 nlx 1114768E10 dvt0w 0 dvtlw 53E6 dvt2w 0032 dvt0 22896412 dvtl 05213085 dvt2 01337987 u0 2024540953 ua 2290194E9 ub 9779742E19 uc 369771E11 vsat 1307891E5 a0 08356881 ags 01568774 b0 2365956E6 b1 5E6 Cadence Inverter Cell Design Tutorial February 10 2003 page 8 of 11 I ECEN 5253 Digital Computer Design I keta 5769328E3 a1 0 a2 1 rdsw 2746814E3 prwg 234865E3 prwb 00172298 wr 1 wint 2586255E7 lint 7205014E8 xl 0 xvv 0 dwg 2133054E8 dwb 9857534E9 voff 00837499 nfactor 12415529 cit 0 cdsc 4363744E4 cdscd 0 cdscb 0 eta0 011276 etab 29484E3 dsub 03389402 pclm 49847806 pdiblcl 2481735E5 pdiblc2 001 pdiblcb 0 drout 09975107 pscbe1 3497872E9 pscbe2 4974352E9 pvag 109914549 delta 001 mobmod 1 prt 0 ute 15 ktl 011 ktll 0 kt2 0022 ua1 431E9 ub1 761E18 uc1 56E11 at 33E4 W1 0 Wln 1 WW 0 wwn le 0 ll 0 lln 1 lw 0 lwn 1 lWl 0 capmod 2 xpart 04 cgdo 24E10 cgso 24E10 cgbo 0 cj 7273568E4 pb 09665597 mj 04959837 cjsw 3114708E10 pbsw 099 mjsw 02653654 pvth0 9420541E3 prdsw 2312571566 pk2 1396684E3 Wketa 1862966E3 lketa 5728589E3 Library name inverterilib Cell name inverter View name extracted 6 Comments iinst0 OUT IN vdd vdd ami06P Wl 2e06 16e07 as18e12 ad18e12 ps42e06 pd42e06 m1 regionsat iinstl OUT IN gnd gnd ami06N Wl 2e06 16e07 as18e12 ad18e12 ps42e06 pd42e06 m1 regionsat iinst2 vdd 0 capacitor c20912e15 m1 VDD vdd 0 vsource dc50 Vgnd gnd 0 vsource dc00 V1 IN 0 vsource dc50 typepulse va100 vall50 Backslash to indicate continuation on period12n rise3n fall3n Width3n next line osutran tran stop100n 6 Transient analysis with the name osutran osudc dc devV1 paramdc start0 stop5 step1 6 DC analysis with the name osudc save OUT IN 6 Saves selected nodes for postsimulation analysis 83 Simulation To simulate your inverter design you will use a Spicelike simulator called Spectre 1 Run Spectre circuit analyzer from the command line by issuing the following com mand spectre ltcellnamegtscs Cadence Inverter Cell Design Tutorial February 10 2003 page 9 of 11 I ECEN 5253 Digital Computer Design I Spectre will create a ltcellnamegtraw directory where your results will be stored Note Issue this command if you are still in the lthomegtcadencesimulationltdesign namegtspectreeXtractednetlist directory Otherwise specify the full path and the le name 2 To view the graphical version of your analyses results you can use the Analog Wave form Display AWD tool Issue the following command awd dataDir ltcellnamegtraw E This will open up several new windows one of which is the Results Browser window with the name of your project directory You can use your mouse and click on ltcellnamegtraw This will expand the directory structure You should see two branches under the ltcellnamegtraw directory The two branches contains information from the two analyses tran and dc specified in the circuit file 4 Click on the dc analysis branch The highlighted signals are the nodes specified in the circuit file to be saved V39 To plot the signals directly click with the right mouse button on the appropriate node names You should see your plot on the Waveform Window Repeat for the second node Note You can plot both the dc and tran plots on the Waveform Window by choosing Win dow gt Subwindows ClickAdd and then click OK This splits the Waveform Window into two sections Click on the number on the upper right hand comer of each subwindow before you plot 0 Once finished plotting close all windows and return to the command prompt gt1 To look at the ASCII version of your output files run Spectre with the following com mand spectre f psfascii ltcellnamegtscs 00 The results of your transient tran and direct current dc analyses given in the circuit file can be found in 4 1 lthomegtsimulation mm 1 quot iuvmlclJctW The files osutrantran and osudcdc contain the transient and dc analyses results respec tively 90 Simulation using IRSIM To use IRSIM to simulate your designs the spectre netlist file called netlist that was generated after extraction of the layout needs to be converted to another file called Design sz39m which is in the format required by IRSIM to simulate A program called sp2sz39m is used to accomplish this task The command file required to test the design is named as lt lenamecmdgt All the steps followed to use IRSIM are listed below Cadence Inverter Cell Design Tutorial February 10 2003 page 10 of 11 I ECEN 5253 Digital Computer Design I Go to the directory where the spectre netlist of your layout is stored after extraction by using the command cd lthomegtcadencesimulationltdesign namegtspectreextractednetlist N Use the command sp2sim at the command prompt The spectre netlist gets converted to a new netlist le called Design sz39m which supports the format required by IRSIM to simulate Note that it may take some time of the order of few seconds for the this process to complete E Prepare the command le ltleenamecmdgt that has the simulation input and output commands This le should look like the example in the next section 4 Copy the le scmos30prm le in the current directory using the command cp app lcadenceirsimscmos30prm V39 Invoke IRSIM and simulate the design by using the command irsim scmos30prm Designsim lt lenamecmdgt 91 Sample IRSIM Command File w IN OUT h IN s 100 assert OUT 0 1 IN s 100 assert OUT 1 h IN s 100 assert OUT 0 1 IN s 100 assert OUT 1 exit Cadence Inverter Cell Design Tutorial February 10 2003 page 11 of 11 I ECEN 5253 Digital Computer Design I Cache Performance Improvement There are many hardware techniques for improving cache performance beyond manipulat ing the three basic design parameters of cache size block size and set size These tech niques can be broken down into categories for minimizing miss rate miss penalty and hit time Reducing Miss Penalty Processor speed has been increasing faster than memory speed fig 737 p 554 This means that a given miss penalty time corresponds to more clock cycles while the proces sor is stalled To get the full bene t of the high clock rates short clock periods of modern processors it is very important to reduce the miss penalty in modern high performance memory system designs Multilevel Cache As the first level cache is made faster by putting it inside the processor chip main memory becomes much slower relative to the processor and first level cache It takes many more processor cycles for the cache fill from main memory to complete making the miss penalty larger relative to the hit time The miss penalty can be reduced by inserting a second level cache usually external to the processor chip that can respond faster than main memory To see clearly how the second level cache increases performance lets compare two mem ory designs one with second level cache and one without 3 main memory 2ndcache main memory 2level cache llevel cache Recall that miss Pinstructionmiss Treadmisspenalty 1Cload T 1C datamiss39 readmisspenalty store T 1C datamiss writemisspenalty Cache Performance Improvement December 8 2005 page 1 of 9 I ECEN 5253 Digital Computer Design I Let us assume for simplicity that Preadmiss and Pwritemiss are the same a write allocate cache with no write buffer for example The stalls per instruction due to misses in top level cache can be expressed as SPImiss PmissLl 39TmisspenaltyLl where P P 1Cload 1Cstore P missLl instructionmiss 1C 1C datamiss TmisspenaltyLl Treadmisspenalty Twritemisspenalty Pmissil is the same in both systems but TmisspenaltyL1 depends on whether or not the second level cache is present T TmisspenaltyLl T P 2level cache hitL2 missL2 TmisspenaltyL2 main llevel cache Tmaln is the response time of main memory for reading a block assume for now that the data or instruction is always in main memory and not on disk Also in the 2level cache TmisspenaltyL2 Tmain It is now clear that the miss penalty Tmisspena1tyL1 is reduced with the 2level cache if P T lt T ThitL2 missL2 main main ThitL2 lt 1 PmissL2 Tmain which is easily satis ed for reasonably low miss rates Unfortunately Rmissiz may not be as low as one might at first think This is because the misses in first and second level cache are not independent If the word the processor wants is not in the first level cache it has a much higher probability than usual of not being in second level cache either Even so second level cache that is suf ciently larger than first level cache has miss rates that are low enough to reduce the miss penalty Just as for first level cache increasing the set size and block size of second level cache is going to reduce the second level miss rate As we have just shown this also reduces the average miss penalty Handling Read Miss After Write Miss We have already seen how a write buffer can be used to eliminate the miss penalty when a write miss occurs The write data are stored in a write buiTer while the data is being written into lower level memory Meanwhile the pro cessor continues operation Suppose that a subsequent read miss occurs before the write buiTer has a chance to empty This is very similar to the RAW data hazard but this time it is a memory location instead of a processor register that has the hazard Cache Performance Improvement December 8 2005 page 2 of 9 I ECEN 5253 Digital Computer Design I Since the processors is stalled during the read miss we want the cache fill sequence for the read to proceed as rapidly as possible The problem is that the data in lower level memory may not be correct until the write bulTer empties The simplest strategy is to wait for the write bulTer to empty before beginning the cache ll sequence for the read Unfortunately this means that we have to pay the write miss penalty after all We can still avoid paying the write miss penalty by implementing the write buffer as a small fully associative cache If we get a write buffer hit on the subsequent read miss then the data block for the read miss can be retrieved from the write bulTer instead of going to lower level memory If we get a write bulTer miss then we allow the cache ll sequence for the subsequent read miss to proceed ahead of whatever is still in the write buffer Giving read misses priority over previous write misses can in this manner avoid the write miss penalty The same technique can be used for writeback cache When a dirty block is replaced on a read miss the dirty block is placed in the write buffer and the cache ll sequence proceeds immediately without waiting to write the dirty block into lower level memory Sector 0r Sub Block Organized Cache Sector organized or subblock cache can be used to reduce the miss penalty If large block sizes are desired then a larger miss penalty is required simply because there are more data words that must be transmitted within each block With sector organized cache only one subblock is lled and then the processor proceeds resulting in a smaller miss penalty Since the other subblocks are not valid a higher miss rate occurs unless the other subblocks are loaded into a prefetch bulTer see below Early Restart and Requested Word First On a read miss the processor has requested a particular word within the block that is sent to ll the cache from lower level memory I Early restart Rather than waiting for the entire block to be lled from lower level memory restart the processor as soon as the requested word is received The rest of the cache ll sequence completes normally but the processor is not stalled Requested Word First Early restart is more effective if the words in the block are not lled in their usual order but rather the requested word is sent rst This is also called wrapped fetch because the words in the block are sent in a shifted order In the fol lowing example the words within the block would be sent in order word2 word3 word0 wordl block A requested word Nonblocking Cache The reason that we must pay the miss penalty is that the processor must stall until the correct data or instruction is provided If the processor were able to Cache Performance Improvement December 8 2005 page 3 of 9 I ECEN 5253 Digital Computer Design I start executing other instructions while waiting for the cache fill then a stall would be unnecessary This would require executing instructions out of order The simple pipe lined processors that we have studied do not allow this to happen Super scaler processors do allow execution out of order and nonblocking cache would be very useful for these processors Nonblocking cache allows the processor to continue to read and write to the cache while the cache waits for lower level memory to provide data from a previous miss The average miss penalty for nonblocking cache can be substantially lower than regular cache Reducing Miss Rate Victim Cache We have already seen the utility of a write buffer to save replaced dirty blocks that must be written back to lower level memory write back cache This write buffer for a write back cache must really be a small cache because we must check it to see if it has a block in it that we want to read A victim cache is very similar except that each replaced block is put in the victim cache regardless of whether it is dirty or not Locality of reference implies that the blocks in the victim cache are the most likely to be used again soon Victim caches are usually fully associative caches but very small 1 4 blocks so that vic tim cache hit time is short When a cache miss and victim cache hit occurs the block in the victim cache is swapped with the block in cache This swapping requires an extra cache write cycle which is a much smaller penalty than having to ll the cache from lower down in the memory hierarchy Prefetching While the processor is running normally with cache hits the memory system is free for other activities Whenever a cache miss occurs and a new block is fetched it is highly likely that more than one block will eventually be needed Additional blocks can be fetched after the processor starts running again The additional blocks are fetched before they are needed prefetched and stored temporarily in a prefetch buffer When a cache miss occurs there is a good chance that the desired block can be obtained from the prefetch buiTer The prefetch buiTer is really another small cache similar to the victim cache in parallel with the regular cache The difference is that the prefetch buiTer is loaded from lower in the memory hierarchy as part of the cache fill sequence whereas the victim cache is loaded with replaced blocks discarded from the cache When a cache miss and prefetch buffer hit occurs an extra cache write cycle is required to put the new block in the cache but this is a smaller penalty than filling the cache from lower in the memory hierarchy On cache miss the miss penalty is just one cache cycle if the prefetch buiTer hits Thus the average access time is Tcachepre Phit Thit Pmiss Phitpre Thit Pmisspre Tmisspenalty Cache Performance Improvement December 8 2005 page 4 of 9 I ECEN 5253 Digital Computer Design I so that both the cache and the prefetch have to miss before we pay the miss penalty Prefetching only makes sense if there is extra time on the memory bus to fetch more blocks that is the cache miss rate must be sufficiently low that it is unlikely that another cache miss occurs while the prefetching is going on It is not certain that the prefetched block will ever be used We certainly do not want prefetching to delay the cache ll sequence for a real cache miss block unless we can be more certain that the prefetch block will actually be needed The optimizing compiler can often predict when prefetching would be advantageous The compiler can control prefetching by using special prefetch instructions that are provided as part of the instruction set Reducing Hit Time The hit time of top level cache is important in determining the processor clock rate The hit time of lower level cache determines the miss penalty of higher level cache Small and Simple First Level Cache In general a smaller memory is faster so that smaller caches have shorter hit times Also most first level caches are on chip and must be small enough to fit in the chip along with the processor Direct mapped cache has the simplest hardware and is slightly faster that set associative cache As the set size increases there is extra delay with the additional hardware needed for the comparators and multiplexing the data from the different sets see fig 717 p 503 With a small on chip cache a high miss rate is expected Does it really do any good to have on chip cache or should we slow the processor down to work with off chip cache that can be big enough to have a low miss rate To answer this question let us compare mem ory systems with and without on chip cache ma1n memory main memory on chip cache off chip cache Assume that the same off chip cache 2ndcache is used in both systems and that read and write miss penalties are the same The clock period for the on chip cache system Tomchlp is faster than for the off chip cache system Toffchip Cache Performance Improvement December 8 2005 page 5 of 9 I ECEN 5253 Digital Computer Design I Tonchip ThitLl Toffchip Z ThitLZ Since the clock periods are different we must compare the performance of the two sys tems by comparing CPU time CPUtime IC CPI SPI T IC and CPI is the same for both systems since the same processor is used To simplify the analysis ignore the effect of stalls other than for the miss penalty To make onchip cache attractive it must have a smaller CPU time CPUt1me0nchip lt 1 CPUtime0 cchip CPI SPIonchip Tonchip lt 1 CPI SPIoffchip Toffchip CPI PmissLlonchip TmisspenaltyLlonchip ThitLl lt 1 CPI PmissL2offchip TmisspenaltyLZoffchip ThitL2 Since we are measuring the miss penalty in clock cycles TmisspenaltyLl onchip T h1tL2 13 T h missL2onchip TmisspenaltyL2onchip onc 1p ThitL2 Tmain missL2onchi Tonchip p T onchip P ThitL2 missL2onchip Tmain ThitLl Similarly T T Z lt L2 ff h mISSpena y 0 11 Toffchip Thit LZ Putting this back into the performance comparison gives CPI ThitLl PmissLlonchip ThitLZ P CPI ThitL2 P main Tmain missL2onchip Tmain lt 1 missL2offchip Tmain The product of PmissL1Onchip with PmissL2Onchlp is the global miss rate which experi mental results show is almost the same as PmiSSL2Offchip without first level cache because Cache Performance Improvement December 8 2005 page 6 of 9 I ECEN 5253 Digital Computer Design I of correlations between misses in L1 and L2 cache Then onchip cache is best when CPI ThitLl PmissLlonchip ThitL2 PmissL2offchip Tmain lt 1 CPI ThitL2 PmissL2offchip Tmain or CPI TmL1 P which is satisfied when ThitLl lt 1 39 Even small cache with a large miss rate for example 05 50 would give a performance improvement provided that ThitL1 is signi cantly smaller than ThitL2 missLlonchip ThitL2 lt CPI ThitL2 RmissLlonchip T C I hitL2 Delayed Write Buffer In a typical cache for example fig 717 p 503 we are able to read a data block and read the tag in parallel since reading an incorrect block does no harm if there is a cache miss Unfortunately we cannot write a data block and read the tag in parallel since writing an incorrect block would be a disaster if there is a cache miss A clever way around this problem is to use a delayed write buffer address CPU data i delayed write buffer hit Instead of writing directly into cache all writes stores are to the delayed write bulTer Since the cache is not altered the cache data is not corrupted if a miss occurs Thus the write to the delayed write buffer and reading the tag can be done in parallel Then the delayed write buffer contents is invalidated if the tag comparison indicates a miss If writes are to take the same amount of time as reads there is no time left at this point to write the data to cache from the delayed write buffer Instead the write data is left in the delayed write buffer that is why it is a delayed write bulTer The next time a write occurs the old contents of the delayed write buffer is written to cache provided it is valid while the tag for the new data is read Meanwhile new write data is written into the delayed write buffer and the cycle repeats Cache Performance Improvement December 8 2005 page 7 of 9 I ECEN 5253 Digital Computer Design I At any given time the data from the last write are in the delayed write buffer not the cache If a subsequent read is from the same location the data must be provided from the delayed write buffer For this reason the delayed write buffer is implemented as an asso ciative cache A hit on the delayed write buffer overrides the cache hit incorrect data is in the cache Pipelined Cache Memory As processor clocks get faster and faster it becomes increas ingly difficult to design cache memory that is fast enough to keep up Thit lt T even when it is on chip One way to handle this is to pipeline the cache memory just as we did the processor hardware This is called superpipelining We must divide the cache hardware for example the direct mapped cache in fig 77 p 478 The memory a1ray cannot be divided into stages but the address decoding for the cache index and block offset can be done in a separate stage just like the register file decoders Also the comparator that does the tag check can be in a separate stage Here is an example of how that might be done for instruction memory with the MIPS offset offset decoder N O I I a word a m 1ndex 1ndex a E decoder quot hit tag J The IF stage has been divided into two with the cache index decoder in the first stage The block offset could be decoded in this stage or put in parallel with the memory array in the next stage The index decoder provides the select lines for the cache lines and the offset decoder provides the select lines for the block to word multiplexer implemented with a single level of transmission gates to minimize delay The comparator provides a hit or not indication during the decode stage This means that the word is used to fetch operands out of the register file as usual during the ID stage even though it is not certain that there has been a hit This avoids an extra stage in the pipeline When a miss occurs the incorrect instruction that was started gets canceled in the EX stage and the processor stalls until the instruction cache fill is complete A similar scheme is used in the MIPS R4000 processor Pipelining data memory is more complicated because data writes stores must be accom modated as well as data reads loads Without a delayed write buffer an extra stage in the data cache pipeline is needed to write the data after the hit signal is known This is what is done in the MIPS R4000 processor With a delayed write buffer the data cache pipeline can be the same length as the instruction cache pipeline Cache Performance Improvement December 8 2005 page 8 of 9 I ECEN 5253 Digital Computer Design I Superpipelined processors can run at higher clock rates than processors without extra stages for cache access However the increased latency of the longer pipelines means there will be more pipeline hazards to avoid more operand forwarding paths When the hazards cannot be avoided except by stalling the processor the length of the stalls is longer For example the load stall which was 1 clock period in the original MIPS is now 2 clock periods in the MIPS R4000 Cache Performance Improvement December 8 2005 page 9 of 9


Buy Material

Are you sure you want to buy this material for

25 Karma

Buy Material

BOOM! Enjoy Your Free Notes!

We've added these Notes to your profile, click here to view them now.


You're already Subscribed!

Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'

Why people love StudySoup

Jim McGreen Ohio University

"Knowing I can count on the Elite Notetaker in my class allows me to focus on what the professor is saying instead of just scribbling notes the whole time and falling behind."

Allison Fischer University of Alabama

"I signed up to be an Elite Notetaker with 2 of my sorority sisters this semester. We just posted our notes weekly and were each making over $600 per month. I LOVE StudySoup!"

Steve Martinelli UC Los Angeles

"There's no way I would have passed my Organic Chemistry class this semester without the notes and study guides I got from StudySoup."


"Their 'Elite Notetakers' are making over $1,200/month in sales by creating high quality content that helps their classmates in a time of need."

Become an Elite Notetaker and start selling your notes online!

Refund Policy


All subscriptions to StudySoup are paid in full at the time of subscribing. To change your credit card information or to cancel your subscription, go to "Edit Settings". All credit card information will be available there. If you should decide to cancel your subscription, it will continue to be valid until the next payment period, as all payments for the current period were made in advance. For special circumstances, please email


StudySoup has more than 1 million course-specific study resources to help students study smarter. If you’re having trouble finding what you’re looking for, our customer support team can help you find what you need! Feel free to contact them here:

Recurring Subscriptions: If you have canceled your recurring subscription on the day of renewal and have not downloaded any documents, you may request a refund by submitting an email to

Satisfaction Guarantee: If you’re not satisfied with your subscription, you can contact us for further help. Contact must be made within 3 business days of your subscription purchase and your refund request will be subject for review.

Please Note: Refunds can never be provided more than 30 days after the initial purchase date regardless of your activity on the site.