Computer Architecture Concepts
Computer Architecture Concepts CGS 3269
University of Central Florida
Popular in Course
Popular in Computer General Studies
This 29 page Class Notes was uploaded by Carmelo Connelly on Thursday October 22, 2015. The Class Notes belongs to CGS 3269 at University of Central Florida taught by Staff in Fall. Since its upload, it has received 15 views. For similar materials see /class/227613/cgs-3269-university-of-central-florida in Computer General Studies at University of Central Florida.
Reviews for Computer Architecture Concepts
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/22/15
Cache MEmory Today39s high performance H croprocessors operate at speeds that far outpace even the fastest of the memory bus architectures that are commonly available The most common technique used to match the speed of the memory system to that of the processor is caching A memory cache interposes a block of fast memory commonly high speed static RAM between the processor and the main memory of the computer At any given moment in time the cache contains a copy of a portion of the contents of the main memory A special circuit called a cache controller attempts to keep the cache filled with the data or instructions that the processor is most likely to need next If the information that the processor requests is held in the cache it can be sent to the processor without a wait state This fastest possible operation is called a cache hit If the requested data is not in the cache it must be retrieved from the ordinary RAM at whatever speed is possible from the memory subsystem This is called a cache miss Figure 7 illustrates the relationship between the cache and the main memory Main memory consists of up to 2D addressable words with each word having a unique n bit address For mapping purposes this memory is considered to consist of a number of fixed length blocks of K words each Thus there are M 2 K blocks The cache consists of C lines of K words each and the number of lines is considerably less than the number of main memory blocks C ltlt M At any time some subset of the blocks of main memory resides in the lines in the cache If a word in a block of memory is read that block is transferred to one of the lines of the cache Since there are more blocks than lines an individual line cannot be uniquely and permanently dedicated to a particular block Therefore each line includes a tag that identifies which particular block of main memory is currently occupying that line of cache The tag is usually a portion number of bits of the main memory address described later When the processor generates a RA retrieve address for a word to be read this word will either be in the cache CGS 3269 Memory 31 cache hit or not cache miss In the event of a cache hit the word is simply delivered to the processor Otherwise the block containing the RA is loaded into the cache and the word is delivered to the processor In is illustrated in Figure 8 In this configuration the cache connects to the processor via data control and address lines which are attached to the system bus from which main memory is accessed When a cache hit occurs the data and address buffers are disabled and communication is Linc Memory Number Tag Block addmss o J o I m I Black a w wan Blvckcng1h KWurdsl aCuchs Bkmk Z LI Word Lang h Main Memory Figure 7 7 CacheMain Memory Structure C68 3269 Memory 32 Address Address Buffer Control Comm System Bus Figure 8 7 Typical Cache Organization only between the processor and the cache with no system bus traffic When a cache miss occurs the RA is loaded onto the system bus by the cache controller and the data is returned through the data buffer to both the cache and the processor in parallel Another common arrangement although becoming less and less common for reasons will see later is to physically impose the cache between the processor and the main memory for all data address and control lines In this case on a cache miss the contents of the RA is first loaded into the cache and then transferred by the cache controller from the cache to the processor Not all cache memories are created equal with line size overall size logical arrangement mapping functions location replacement algorithms write policies and number of caches being some of the parameters which govern the performance of a cache memory system In the next several pages we will look at each of these parameters separately Cache Size A major factor in determining how successful the cache memory will be is how much information it can contain The larger the cache the more information it will hold and the more likely a processor request will produce a cache hit CGS 3269 Memory 33 At the two extremes are a cache which is the same size as the main memory and thus duplicates its entire contents in which case you wouldn39t need the DRAM at all and every request is a cache hit The other end of the scale is a cache consisting of a single byte assuming a byte addressable machine which would virtually guarantee that every request was a cache miss In general you would like the cache to be small enough so that the overall average costbit is close to that of main memory alone and large enough so that the overall average access time is close to that of the cache alone There are several other reasons to minimize the size of the cache 1 The larger the cache the larger the number of gates involved in addressing the cache The result is that large caches tend to be slightly slower than small ones even when using the same IC technology and put into the same place on the chip Cache size is also limited by the available space on the chip Typical cache sizes today range from lK words to several mega words Multitasking systems tend to favor 256 K words as nearly optimal size for main memory cache The only real disadvantage of a larger cache is the cost Faster SRAM chips cost more than DRAM chips and add to the overall cost of the system Approximately 7 of the total cost of a PC in today39s market as of 10192000 is in the memory subsystem Some manufacturers have scalable cache systems which allow you to add SRAM as you can afford it The performance of the cache is very sensitive to the nature of the work load making it impossible to arrive at a single optimum cache size Cache Level Cache memory is typically described by its logical and electrical proximity to the microprocessor39s core design At one time there was only a single level of caching Newer microprocessors such as AMD39s K6 3 K7 and Intel39s Pentium III designs offer three levels of caching Ll or Level 1 cache is often called the primary cache It is usually the smallest cache and the one electrically closest to the microprocessor39s core logic and may be built CGS 3269 Memory 34 in to the processor itself it is in the AMD and Intel designs mentioned above It operates at the same clock speed as the processor Current microprocessors have Ll caches that range in size from 16 KB to 128 KB in size L2 or Level 2 cache is often called the secondary cache At one time all secondary caches were separate from the microprocessor and linked the processor to the memory system Modern microprocessors have subsumed the L2 cache and placed it directly on the chip with the processor logic This trend began with the Pentium Pro and has continued with Intel39s line of microprocessors as well as many others note that with the Pentium II and beyond however the L2 cache is not on the microprocessor chip but in the same module with the microprocessor Level 2 caches can operate any where in the range from core logic speed to the speed of the memory system with the faster obviously being the better In the Pentium II for example the L2 cache operates at one half the speed of the core logic Level 3 cache is will typically be and so far in all implementation is external to the microprocessor and operates at the same speed as the RAM memory but is fast enough to not impose wait states on the processor Why Cache Cache is primarily used as a buffer between fast processors and slow memory subsystems The problem with memory systems is illustrated by the following example Assume that we have a main memory system composed of 70 nsec 4 MB SRAM chips Assuming that the address bus drivers at the processor have a 40 nsec delay and bus propagation requires 10 nsec then the request arrives at the memory module 50 nsec after the request was issued With board select decoder chip time typically adding another 20 nsec and potentially another 30 nsec required to send the select signal to the memory chip board to chip driver delay and of course another 20 nsec delay at that board for decoding The memory request has now reached the correct memory module 120 nsec after the processor issued the initial request CGS 3269 Memory 35 The memory module has a 70 nsec access time This will mean that 190 nsec after the initial request the memory will have placed valid data on the memory modules output With the 30 nsec chip to board driver delay valid data will reach the data bus drivers 220 nsec after the request The memory bus drivers and bus propagation will take the same 50 nsec in this direction sending the data to the processor that it took to issue the request Thus 270 nsec have elapsed since the processor issued the request This is considerably longer than the chips 70 nsec access time While cache acts as a buffer between main memory and the processor for it to be truly effective the ratio of cache hits to cache misses must be predominantly in favor of cache hits Typical cache hit time requires l 2 clock cycles Typical cache miss time requires a few 1039s of clock cycles to bring in the required data fron1 main memory A main memory miss however incurs a terrible penalty for the processor A main memory miss will typically run from the hundreds of thousands of clock cycles to several million clock cycles depending upon processor speed and secondary storage speed As a result of this cache miss rates in the neighborhood of l 2 are tolerable but main memory misses must be 0001 or less or the systen1 will suffer serious performance degradation Cache Performance Quantifying the performance of a cache system is difficult since its performance varies with the demands of the software For example a small utility progran1 may fit entirely inside the internal cache of the microprocessor so that every request for data from the program would be a cache hit However the same cache system might suffer a cache miss with virtually every request when the processor is multitasking several Windows application The most common means of designating the performance of a PC39s memory system is by listing the number of clock cycles required for each access to transfer a line of memory For example 80486 memory systems required four transfers to read or write an entire line of memory recall a line of CGS 3269 Memory 36 memory describes the smallest chunk of memory which can be read or written on most 486 or better machines a line consists of 4 double words or 16 bytes Thus the performance of this systeH1 is described as a sequence of four numbers The best possible performance would be represented by l l l l which would represent one clock cycle for each double word Burst mode of PentiuHI class processors require two clock cycles for the first transfer one to set the address and one to read or write the data so the best performance in practice is 2 l l 1 Many systems are substantially slower some operating at rates 6 4 4 4 Since current processors all require a line at a time fron1 memory the total of these numbers is the important figure with five the current optimum and many operating at more than 20 cycles per line transferred Cache Mapping The logical configuration of the cache system involves how the cache is physically arranged how it is addressed and how the processor determines if the information requested is in the cache or not There are three basic choices for how the cache itself is arranged They are direct mapped fully associative and n way set associative We will look at each configuration separately Direct Abpped Cache divides the cache into small units called lines corresponding to the lines in the Intel memory line of four double words allowing for addressing in the l6 byte multiples 128 bits covered earlier each of which is identified by an index bit Main memory is then divided into blocks the same size as the cache and the lines in the cache correspond to the locations within such a memory block Each line can be drawn from a different memory block but only from the location corresponding to the location in the cache Which block the line is drawn from is identified by a special tag The cache controller can determine if a given byte is in the cache simply by checking the tag for a given index value Direct mapped cache has a problem if a program regularly moves between addresses with the same indexes in different blocks of memory In this situation the cache CGS 3269 Memory 37 must be continually refreshed which translates into cache misses Although such swapping is rare in a single tasking system in multi tasking systems it can occur quite often and thus slow down a direct mapped cache Figure 9 illustrates a direct mapped cache example In Figure 9 the main memory is byte addressable and consists of 213 8192 blocks each block 8 bytes arranged in a 256 block x 32 block array This means that a main memory block address requires 13 bits and a main memory byte address requires 8192 blocks x 8 bytesblock 65536 bytes 216 16 bits The 256 x 32 memory is mapped to a cache which consists of 256 lines each 8 bytes wide 1 block CGS 3269 Memory 38 Cache The Cache Main Memory Block Numbers one cache line 8 tags line bytes tag bl valid Cache Memory 9 3O 31 Tag 1 line 8 Tag numbers bytes memory 5 bits i 8bitgroup 3bit byte Valid bit Cache address T MM 5 bit tag 8 bit group 3 bit byte address Figure 9 Direct Mapped Cache System Notice in Figure 9 that no two main memory blocks with the same tag map into the same cache line If this were the case it would not be possible to uniquely identify which main memory block was occupying the cache line Columns in the main memory array contain 256 blocks rows and therefore 8 bits 28 256 are required to uniquely identify the row in the main memory this is the group number to which a cache line belongs Each row in the main memory CGS 3269 Memory 37 grid contains 32 blocks columns and therefore 5 bits 25 32 are required to uniquely identify the column in the main memory address this is the tag number of any current line in the cache Finally since each block contains 8 bytes to uniquely identify a specific byte within a block will require 3 bits 23 8 to select the byte which was addressed by the CPU The cache controller must keep track of which main memory block is currently occupying a specific cache line Consider the specific examples shown below Example 1 Address mapping with a direct mapped cache Suppose the CPU has generated the RA 2060 080CM 0000100000001100b An RA of 206010 will be located in block 257 block 256 contains byte addresses 2048 2055 block 257 contains byte addresses 2056 2057 2058 2059 2060 2061 2062 2063 block 258 contains byte addresses 2064 2071 and so on all in decimal Byte address 206010 then is located in block 257 and is therefore in main memory column number 1 and row number 1 group 1 tag 1 Note that the five MSB of the binary representation of the RA are 00001 which indicates a tag l the next eight bits represent the group number and are 00000001 which indicates a group 1 Finally the RA is the fourth byte starting from byte 0 within block 257 so the three LSB of the address are 100 which indicates byte address offset within the block of 4 Example 2 Address mapping with a direct mapped cache Suppose the CPU has generated the RA 2029 07EDM 0000011111101101b An RA of 202910 will be located in block 253 block 253 contains byte addresses 2024 2025 2026 2027 2028 2029 2030 and 2031 all in decimal Byte address 202910 is therefore in main memory column number 0 and row number 253 group 253 tag 0 Note that the five MSB of the binary representation of the RA are 00000 which indicates a tag 0 the next eight bits represent the group number and are llllllOl which indicates a group 253 Finally the RA is the fifth byte starting from byte 0 within block 253 so the three LSB of the CGS 3269 Memory 38 address are 101 which indicates byte address offset within the block of 5 Example 3 Address mapping with a direct mapped cache Suppose the CPU has generated the RA 32567 7F37M 01111111001101112 An RA of 3256710 will be located in row group 230 since 111001102 230 and column tag 15 since 011112 151M Byte address 32567 is therefore in main memory column number 15 and row number 230 group 230 tag 15 with a byte offset within the block of 1112 710 Note that the block address of the block containing byte address 32567 is obtained by using the 13 MSB of the RA which gives 01111111001102 40701m Block 4070 contains byte addresses 32560 thru 32567 so the RA is in byte offset 7 of block 4070 Fully Associative Cache allows for each line of the cache to correspond to or be associated with any part of the main memory Lines of bytes from diverse locations throughout main memory can be piled into the cache The cache controller must now check the address of every line in the cache to determine if a given memory address is currently resident in the cache This is not done using common searching techniques but rather through the use of associative memory also called Content Addressable Memory CAM CAM basically allows the entire tag memory to be searched in parallel Unlike typical RAM CAM associates logic with each memory cell in the memory A single clock cycle will allow all of the memory cells in the CAM to check the contents of their cell and compare the value with a mask register used to specify the search pattern Thus access to the CAM is based upon content not address as with ordinary RAM CAMs are considerably more expensive in terms of gates than ordinary access by address memories RAMs and this limits their use at least with current technologies to relatively small memory systems such as cache Using Figure 10 see next page as an example we see that for the fully associative cache there is only a tag field within the cache address and the group number has been dropped Any main memory block can now map into any cache CGS 3269 Memory 39 line The cache controller will need to uniquely identify every main memory block which will require 13 bits 213 8192 since there are 8192 blocks in the main memory As before within the block 3 bits will be required to uniquely identify a particular byte within a specific block Since any cache line can hold any main memory block at any time the cache controller needs to have fast access to the contents of the cache ie looking up what s in the cache this is where the CAM comes into play The entire contents of the cache can be compared simultaneously with the CAM A cache hit is indicated when one of the CAM cells contents matches the search address If none of the CAM cells contents matches the search address then a cache miss has occurred and the proper block from main memory will need to be loaded into one of the cache lines At this point the cache controller will invoke the algorithm which will select which cache line is to be replaced to make room for the new incoming block We will look at these line replacement algorithms in a later section CGS 3269 Memory 4O Cache The Cache Main Memory Block Numbers one cache line 8 tags line bytes tag bl valid Cache Memory Tag 1 line 8 Cache addr memory bytes I 13 bit tag I 3 bit byte mask I 13 bits MM address Valid bit 13 bit tag 3 bit byte mask Figure 10 Fully Associative Cache Mapping Example 4 Address mapping with a fully associative cache Suppose the CPU has generated the RA 6145810 F01216 llllOOOOOOOlOOlO2 An RA of 6145810 will be located in main memory block 7682 since llllOOOOOOOlO2 7682 Main memory block 7682 will contain byte addresses 61456 thru 61463 Byte address 6145810 is therefore in main memory block 7682 with a byte offset within the block of OlO2 210 Note that the block address of the main CGS 3269 Memory 4O memory block containing byte address 6l45810 requires the 13 MSB of the RA Set Associative Cache is basically a compromise between the easy control of the direct mapped cache and the less restrictive mapping of the fully associative cache In the set associative cache the total cache memory is divided into smaller direct mapped areas The cache is then described in terms of the number of ways in which it is divided For example a four way set associative cache is basically the equivalent of four smaller direct mapped caches Although this technique resolves the problem of moving between blocks with the same tag number it is also more complex which increases the cost of the cache Also the more quotwaysquot there are to the cache the longer the cache controller will take to determine if a given request is a cache hit or a Iniss Most PC manufacturers have determined that a four way set associative cache seems to be the optimal compromise between performance and complexity Figure ll illustrates a 2 way set associative cache mapping organization I 3 2 O Tag memory 1 9 2 x 5 bits I l I I 3 39 1 I I I I CGS 3269 Memory 41 Figure 11 7936 7937 7938 2 o 9 0 3O 31 Tag numbers 8 bit set l 3 bit byte mask l Cache address MM address 5 bit tag l8 bit set l3 bit byte mask l 2 way Associative Cache Mapping CGS 3269 Memory 38 Recall that the definition of a set is a collection of items without duplication The N way set associative cache organization divides the total cache memory into n distinct lines think of the direct mapped case with n times as much cache memory For now let s restrict our discussion to the 2 way set associative cache organization shown in Figure ll As was the case with the direct mapped cache the two way set associative cache organization divides the main memory into rows now called sets but called groups in the direct mapped cache and columns tags The number of cache lines that will be available will be equal to the two times the number of groups rows in the main memory For each main memory group the cache is capable of holding two different main memory blocks columns within the specific row simultaneously A four way set associative cache could hold up to four main memory blocks from the same group simultaneously while an eight way set associative cache would be capable of holding up to eight main memory blocks from the same group simultaneously and so on Whenever the CPU issues an RA the cache controller for the 2 way set associative cache will need to check the contents of two different cache lines for the particular group to which the RA belongs Once again the use of CAM will allow the contents of both cache lines to be checked in parallel the four way organization would require checking the contents of four cache lines simultaneously and so on If one of the two cache lines contains the RA a cache hit occurs otherwise a cache miss occurs and one of the two cache lines belonging to that group will need to be selected for replacement Example 5 Address mapping with a 2 way set associative cache Suppose the CPU has generated the RA 18444 4800M OlOOlOOOOOOOllOO2 An RA of 1844410 will be located in row set number 1 since OOOOOOOl2 1 and column tag number 9 since OlOOl2 9 Byte address 18444 is therefore in main memory column number 9 and row number 1 set 1 tag 9 with a byte offset within the CGS 3269 Memory 42 block of 1002 410 Note that the block address of the block containing byte address 1844410 is obtained by using the 13 tagset bits of the RA which gives 01001000000012 2305 Block 2305 contains byte addresses 18400 thru 18401 so the RA is in byte offset 4 of block 2305 In Figure 11 this block is currently in the cache in line 2 of cache set 1 so a cache hit will be registered for this RA Intel Pentium II Cache Structure The Intel Pentium II chip has separate instruction and data caches Each of the and D caches are twoway setassociative structures Each is 8K 213 bytes in size Each cache line is 32 bytes requiring a 5 bit byte eld This means that there are 21342 X 25 27 sets which requires a 7 bit set or group eld Since addresses are 32 bytes long this means that the tag eld must be 3212 20 bits wide tag 20 bits group or set 7 bits byte mask 5 bits 3 1 set 4 cache lines Tag memory 4x5bits 0 256 0 512 7680 1 o 1 257 2305 2 o 2 o 258 3 o o o o o o o o 7940 o o o o o o o o o o CGS 3269 Memory 43 39 emory Blocks 8 lquotne one block 1 b tes 2 3 0 256 512 7680 7936 4 1 257 513 7681 7937 Cache Memory 1 line 2 258 514 7682 7938 8 bytes 4 linesse 7939 t o 7940 o o 254 510 255 511 767 8191 tag 0 1 2 00 3O 31 Main memory and cache address same as for the 2way set associative mapping Figure l2 4 way Set Associative Cache Mapping Figure l2 illustrates a 4 way set associative cache mapping Note that the only difference between the 2 way and the 4 way mapping is the additional cache lines available and the additional tag memory required by the 4 way organization The advantage that the 4 way mapping has over the 2 way mapping is that twice as many lines 4 instead of 2 from CGS 3269 Memory 38 the same set of main memory blocks can be loaded into the cache simultaneously By far the most popular cache organization is the 2 way set associative mapping The hit to miss ratio for the 2 way set associative mapping is significantly better than for a direct mapped cache The 4 way set associative mapping provides a slight improvement in the hit to miss ratio but at the expense of an additional space requirement It is this space requirement that typically precludes its use in most desktop systems today where space on many of the chips is at a premium Most designers have decided that the additional improvements the 4 way mapping can provide are not justified in terms of the space expense Advanced Cache Techniques There have been many different techniques used to enhance the performance of cache memories Many different hybrid organization have been proposed and some implemented Sector mapping is one that is fairly commonly used and is based upon the set associative mapping but rather than mapping main memory blocks main memory sectors are mapped this organization tends to be more closely tied to the hardware than to the OS Another of these techniques is to utilize special memory for the cache Highlights of this technique are discussed below Burst Mode Caches Pentium class processors are the current standard of high performance microprocessors and thus require the fastest possible caches to minimize waiting on these high throughput processors One technique for speeding up the cache is to operate in a burst mode Just like in main memory burst mode in a cache eliminates the need to send a separate address for each memory read or write operation Instead the cache just like the main memory reads or writes a contiguous sequence of addresses in a quick burst Depending upon whether the system is reading or writing operation in burst mode can cut cache access time just over 50 The largest improvement occurs when performing write operations in a burst Ordinary static RAM chips do not CGS 3269 Memory 44 have fast enough response times to support cache operations in burst mode As a result two special types of SRAM have been developed to support burst mode cache operations Synchronous Burst SRAM uses an internal clock to count up to each new address after each memory operation Since the chip automatically increments the address it doesn39t require the processor to send it the next address assuming sequential addressing Since this type of cache chip must run in lock step with the processor the timing constraints on the chip are critical for fast error free operation This causes the Synchronous Burst SRAM to be inherently more expensive than conventional asynchronous cache designs Pipelined Burst SRAM The very tight timing constraints of Synchronous Burst SRAM not only make it more expensive but also more difficult to manufacture kind of goes without saying though doesn39t it Pipelined Burst SRAM achieves the same level of performance but without the need for a synchronous internal clock This type of cache chip includes an internal register that holds the next chunk of data in the sequence to be read While the register holds this value the chip continues to run internally and accesses the next address to load the pipeline As soon as the processor reads the output register the pipeline can unload the data from the next address and place this data into the register to be ready for the next read operation Since the pipeline in the chip keeps a supply of data always ready this form of memory can run as fast as the processor requires data limited only by the access time of the pipeline register Table 1 summarizes the improvement of burst mode cache compared to non burst mode cache for different types of requests to the cache C cle T e Non Burst Mode Burst Mode I rovement y yp Cache Cache mp burst read 3 2 2 2 3 l l l 33 CGS 3269 Memory 45 burst write 4 3 3 3 3 1 1 1 54 single read 3 3 none single write 4 3 25 back to back 3222 322 3lll 3ll 339 burst read 2 O Table 1 Non Burst and Burst Mbde Cache Performance Cache Line Replacement Algorithms When a new line is loaded into the cache one of the existing lines must be replaced Before the line can actually be replaced a check must be made to determine if the line has been changed since it entered the cache see the section below on cache write policies for more details For direct mapped caches there is only one possible line that can be used for any main memory block within the same group thus there is no choice to be made With the fully associative and set associative mappings a replacement algorithm is needed as there is a choice to be made and the criteria upon which the replacement is made must be well defined To achieve high speed the replacement algorithm must be implemented in hardware A large number of algorithms are possible and many have been implemented Four of the most common cache line replacement algorithms are l LRU Least Recently Used in which the cache line that was last referenced in the most distant past is selected for replacement 2 FIFO First In First Out in which the cache line from the set that was loaded in the most distant past is selected for replacement 3 LFU Least Frequently Used in which the cache line that has been referenced the fewest number of times is selected for replacement and 4 random selection of the cache line to be replaced For most applications the LRU tends to provide the best performance For a 2 way set associative mapping the LRU algorithm is easily implemented in hardware by adding a single bit commonly called a use bit to each cache line Whenever a cache line is referenced its use bit CGS 3269 Memory 46 is set to l and the use bit of the other cache line in the same set is set to O The line selected for replacement at any specific time is the line whose use bit is currently 0 The principle of the locality of reference means that a recently used cache line is more likely to be referenced again LRU tends to give the best performance The FIFO replacement policy is again easily implemented in hardware by the cache lines as queues The LRU replacement algorithm is implemented by associating with each cache line a counter which increments on every reference to the line Whenever a line needs to be replaced the line with the smallest counter value is selected as it will be the cache line that has experienced the fewest references While it may seem that the random selection of a cache line to replace would be a poor replacement line selection method in reality it performs only slightly worse than any of the other three algorithms that we mentioned The reason for this is easy to see if you consider a 2 way set associative cache Since there are only two cache lines per set any replacement algorithm must select one of the two therefore the random selection method has a 50 50 chance of selecting the same one that the LRU algorithm would select yet the random algorithm has no overhead ie there wouldn t be any use bit Cache Write Policies Before a cache line can be replaced it is necessary to determine if the line has been modified since its arrival in the cache Recall that the contents of the main memory block and the cache line that corresponds to that block are essentially copies of each other and should therefore look the same The question then becomes how to keep the two copies of the same data the same If cache line X has not been modified since its arrival in the cache and it has been selected for replacement then writing its contents back into the main memory block the line corresponds to is not required prior to removing it from the cache the incoming cache line can simply overwrite the existing cache memory On the other hand if the cache line has been modified since its arrival in the cache this indicates that at least one CGS 3269 Memory 47 write operation has been performed on the cache line and thus the corresponding main memory block must be updated There are many different possible strategies that can be employed to ensure that the cache and main memory contents look the same There are however two major problems to consider 1 more than one device may have access to the main memory DMA Direct Memory Access devices such as 10 modules and 2 in nmlti processor systems in which each processor has its own local cache the cache may contain different values for the same main memory address in the different processors the line may be altered in one of the processors but not the others nor main memory Handling these two problems will be discussed below Different types of cache designs will handle writes to memory in different ways Most cache systems make no attempt to speed up write operations Assuming a cache hit a write hit typically they will push write operations through the cache immediately writing both to the cache and main memory with its normal wait state delays at the same time This technique is called a write through cache This is a safe technique since it guarantees that the main memory and the cache are always in agreement with respect to the values in the memory Most Intel microprocessors use a write through cache design A faster alternative is the write back cache sometimes called a posted write cache which allows the processor to write changes to the cache and then immediately resume processing it does not handle the write to the main memory The cache controller will eventually write the changed data back into main memory when it finds the time to do so this will commonly occur only when the cache line is selected for replacement due to a cache miss The main problem with this technique is that there will be times when the contents of the cache and the main memory are not in agreement as to the value that a particular memory location holds This is the cache coherency problem as is an active research topic The cache coherency becomes an issue for example when a hard disk is read and information is transferred into the main memory through the DMA system CGS 3269 Memory 48 which does not involve the processor The cache controller must constantly monitor the changes made in the main memory and ensure that the contents of the cache properly track these changes to the main memory There are many techniques which have been employed to allow the cache controller to quotsnoopquot the memory system but once again these add complexity and expense In the PC environment there are special cache controller chips that can be added which basically handle all the responsibilities for supervising the cache system If a write operation has been requested and a cache miss results again one of two options for handling the write miss can be employed The line may be brought into the cache and then updated written which is termed a write allocate policy or the block may be updated directly in main memory and not brought into the cache which is termed a write no allocate policy Typically write through caches will employ a write no allocate policy while write back caches will utilize a write allocate policy In a bus architecture in which more than one device typically processors has a cache and the main memory is shared the cache coherency problem becomes acute If the data in one cache is altered this invalidates not only the corresponding word in main memory but also that same word in every other cache in which a copy of that word is currently resident Even if a write through policy is used other caches may contain invalid data To maintain the cache coherency in this environment more elaborate schemes will be necessary such as 39 Bus snoopers with write through Each cache controller will snoop the address lines watch the address bus to detect write operations to memory by other bus masters devices currently having control or use of the bus If another bus master writes to an address in the shared memory that resides in the snooper s cache the cache controller must invalidate its cache entry This strategy requires that all cache controllers on the same bus use a write through policy CGS 3269 Memory 49 39 Hardware transparency Additional hardware is utilized to ensure that all updates to main memory via cache are reflected in all caches Thus if one processor modifies a word in its cache this update is written to main memory In addition any matching words in other caches are similarly updated 39 Non cacheable memory Only a portion of the main memory is shared by more than one processor and this is designated as non cacheable In such a system all accesses to shared memory are cache misses because the shared memory is never copied into the cache The non cacheable memory is identified using chip select logic or high order address bits Line Size Another element in the design of a cache system is that of the line size When a block of data is retrieved from main memory and placed in the cache not only the requested word is loaded but also some number of adjacent words those in the same block are retrieved As the block size increases from very small to larger sizes the hit ratio will at first increase because of the principle of locality However as the block becomes even bigger the hit ratio will begin to decrease because the probability of using the newly fetched information will be less than the probability of reusing the information that has been replaced Two specific effects need to be considered 39 Larger blocks reduce the number of blocks that will fit into the cache Because each fetch overwrites older cache contents a small number of blocks in the cache will result in data being overwritten shortly after it has been loaded 39 As a block becomes larger each additional word is farther from the requested word and is therefore less likely to be needed in the near future principle of locality CGS 3269 Memory 50 The relationship between block size and the hit ratio is complex and depends heavily on the locality characteristics of a particular program No definitive optimum value has been found A size from two to eight words seems in practice to work reasonably close to optimum Number of Caches When cache systems were first introduced the typical system had a single cache More recently the use of multiple caches has become the norm There are two aspects to the number of caches that are important namely the number of levels of caching that are employed and whether the cache is unified or split As logic density has increased it has become standard to have some cache on the same chip as the processor This on chip cache is designated the Ll or Level 1 cache The Ll cache is reachable by the processor without external bus activity and therefore contributes to execution speed up minimization of bus traffic by the processor and increased overall system performance The on chip cache leaves open the question of whether off chip cache is still required Typically the answer is yes and most modern systems include both on chip and external caches The most common implementations are two level systems with an Ll on chip cache and an L2 external cache As we have seen the L2 cache is commonly implemented with SRAM which is typically fast enough to match the bus speeds and allow Ll cache misses to be handled by the L2 cache using a zero wait state transaction the fastest possible type of bus transfer While it is very difficult to determine quantitatively the improvement a two level cache represents compared to a single level cache studies have shown that in general the second level does provide a performance improvement When on chip cache first appeared many designs consisted of a single cache that held both data and instructions simultaneously This design is called a unified cache Contemporary systems will commonly split the Ll cache into two separate caches both still considered Ll caches one dedicated to instructions for the processor and the other CGS 3269 Memory 51 dedicated to the data on which the instructions will operate This approach is called a split cache also sometimes referred to as the Harvard architecture There are several potential advantages of a unified cache 39 For a given cache size a unified cache has a higher hit rate than split caches because it balances the load between instructions and data fetched automatically That is if an execution pattern involves many more instruction fetches than data fetches then the cache will tend to fill up with instructions increasing the probability that a cache hit will occur If the execution pattern consists of more data fetches than instruction fetches then the cache will tend to fill with data which again increases the probability that a cache hit will occur since the probability of a request for data is higher than the probability of a request for an instruction 39 Only one cache needs to be designed and implemented Despite the obvious advantages the trend is toward split caches particularly in superscalar machines such as the Pentiums and Athlons The key advantage provided by the split cache is that cache contention between the instruction processor and the execution units is eliminated This is very important in any system that relies on the pipelining of instructions Recall from our discussions on processors that typically the processor fetches instructions ahead of time to keep the pipeline filled with instructions to be executed If a unified cache were used when an execution unit performs a memory access to load or store data the request will be submitted to the unified cache If at the same time the instruction prefetcher issues a read request to the unified cache for an instruction that request will be blocked so that the unified cache can service the execution unit first enabling it to complete the currently executing instruction This cache contention can degrade performance by interfering with efficient use of the instruction pipeline The split cache structure overcomes this difficulty CGS 3269 Memory 52 In the case of the Pentium recall that the Ll cache is split into two caches called the I cache for Instruction cache and the D cache for Data cache Thus at the Ll level the Pentium uses the Harvard architecture even though it does not employ this architecture in the L2 cache which is a unified cache RISC machines that are based on the Harvard architecture split the caches based upon the assumption that the operating system separates code and data in main memory The Pentium processors do not make this assumption therefore no attempt is made to separate the instructions from the data and both will appear in both the I cache and the D cache The reason for splitting the cache in the Pentium is solely to eliminate the cache contention problem Errors in Mamory Our last section of notes concerning memory deals with the detection and correction of errors that occur during operations on semiconductor memory In general an error detectioncorrection scheme enhances the reliability of the memory at the expense of added complexity As we will see this added complexity comes primarily in two forms time and space The time aspect relates to the overhead in time that is required to perform the error detectioncorrection scheme we will see more details on this later The space aspect relates to the amount of additional logic and memory required to implement the error detectioncorrection scheme For example the IBM 30xx implementations utilize a scheme that requires and additional 8 bits for every 64 bits of actual main memory This means that the actual size of the memory is about 12 greater than is apparent to the user and represents a significant space constraint in the memory that must be incorporated Similarly Digital VAX systems employ a technique which requires an additional 7 bits for each 32 bits of memory This represents approximately a 22 overhead for implementing the error detectioncorrection scheme A number of contemporary DRAM chips require an additional 9 bits for the error detectioncorrection scheme for every 128 bits of data which represents roughly a 7 CGS 3269 Memory 53 space overhead In this section we will examine the different types of errors that occur in memory and some of the techniques that are employed to detect and correct these errors CGS 3269 Memory 54
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'