In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same row are stored contiguously. Assume each word is a 32-bit integer. \(\begin{array}{l}\text { for }(I=0 ; I<8 ; I++) \\ \text { for }(J=0 ; J<8000 ; J++) \\ \qquad A[I][J]=B[I][0]+A[J][I] ; \end{array}\)
Read more- Computer science / Computer Organization and Design 5 / Chapter 5 / Problem 5.2.2
Textbook Solutions for Computer Organization and Design
Question
Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses.
3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253
For each of these references, identify the binary address, the tag, and the index given a direct-mapped cache with two-word blocks and a total size of 8 blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.
Solution
The first step in solving 5 problem number 10 trying to solve the problem we have to refer to the textbook question: Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses.3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253For each of these references, identify the binary address, the tag, and the index given a direct-mapped cache with two-word blocks and a total size of 8 blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.
From the textbook chapter Large and Fast: Exploiting Memory Hierarchy you will find a few key concepts needed to solve this.
Visible to paid subscribers only
Step 3 of 7)Visible to paid subscribers only
full solution
Solved: For each of these references, identify the binary
Chapter 5 textbook questions
-
Chapter 5: Problem 5 Computer Organization and Design 5
-
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same row are stored contiguously. Assume each word is a 32-bit integer. \(\begin{array}{l}\text { for }(I=0 ; I<8 ; I++) \\ \text { for }(J=0 ; J<8000 ; J++) \\ \qquad A[I][J]=B[I][0]+A[J][I] ; \end{array}\) How many 32-bit integers can be stored in a 16-byte cache block?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same row are stored contiguously. Assume each word is a 32-bit integer. \(\begin{array}{l}\text { for }(I=0 ; I<8 ; I++) \\ \text { for }(J=0 ; J<8000 ; J++) \\ \qquad A[I][J]=B[I][0]+A[J][I] ; \end{array}\) References to which variables exhibit temporal locality?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same row are stored contiguously. Assume each word is a 32-bit integer. \(\begin{array}{l}\text { for }(I=0 ; I<8 ; I++) \\ \text { for }(J=0 ; J<8000 ; J++) \\ \qquad A[I][J]=B[I][0]+A[J][I] ; \end{array}\) References to which variables exhibit spatial locality?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same row are stored contiguously. Assume each word is a 32-bit integer. \(\begin{array}{l}\text { for }(I=0 ; I<8 ; I++) \\ \text { for }(J=0 ; J<8000 ; J++) \\ \qquad A[I][J]=B[I][0]+A[J][I] ; \end{array}\) How many 16-byte cache blocks are needed to store all 32-bit matrix elements being referenced?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same row are stored contiguously. Assume each word is a 32-bit integer. \(\begin{array}{l}\text { for }(I=0 ; I<8 ; I++) \\ \text { for }(J=0 ; J<8000 ; J++) \\ \qquad A[I][J]=B[I][0]+A[J][I] ; \end{array}\) References to which variables exhibit temporal locality?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise we look at memory locality properties of matrix computation. The following code is written in C, where elements within the same row are stored contiguously. Assume each word is a 32-bit integer. \(\begin{array}{l}\text { for }(I=0 ; I<8 ; I++) \\ \text { for }(J=0 ; J<8000 ; J++) \\ \qquad A[I][J]=B[I][0]+A[J][I] ; \end{array}\) References to which variables exhibit spatial locality?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses. 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses. 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253 For each of these references, identify the binary address, the tag, and the index given a direct-mapped cache with 16 one-word blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses. 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253 For each of these references, identify the binary address, the tag, and the index given a direct-mapped cache with two-word blocks and a total size of 8 blocks. Also list if each reference is a hit or a miss, assuming the cache is initially empty.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses. 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253 You are asked to optimize a cache design for the given references. There are three direct-mapped cache designs possible, all with a total of 8 words of data: C1 has 1-word blocks, C2 has 2-word blocks, and C3 has 4-word blocks. In terms of miss rate, which cache design is the best? If the miss stall time is 25 cycles, and C1 has an access time of 2 cycles, C2 takes 3 cycles, and C3 takes 5 cycles, which is the best cache design? There are many different design parameters that are important to a cache’s overall performance. Below are listed parameters for different direct-mapped cache designs. Cache Data Size: 32 KiB Cache Block Size: 2 words Cache Access Time: 1 cycle
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses. 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253 Calculate the total number of bits required for the cache listed above, assuming a 32-bit address. Given that total size, find the total size of the closest direct-mapped cache with 16-word blocks of equal size or greater. Explain why the second cache, despite its larger data size, might provide slower performance than the first cache.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses. 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253 Generate a series of read requests that have a lower miss rate on a 2 KiB 2-way set associative cache than the cache listed above. Identify one possible solution that would make the cache listed have an equal or lower miss rate than the 2 KiB cache. Discuss the advantages and disadvantages of such a solution.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Caches are important to providing a high-performance memory hierarchy to processors. Below is a list of 32-bit memory address references, given as word addresses. 3, 180, 43, 2, 191, 88, 190, 14, 181, 44, 186, 253 The formula shown in Section 5.3 shows the typical method to index a direct-mapped cache, specifically (Block address) modulo (Number of blocks in the cache). Assuming a 32-bit address and 1024 blocks in the cache, consider a different indexing function, specifically (Block address[31:27] XOR Block address[26:22]). Is it possible to use this to index a direct-mapped cache? If so, explain why and discuss any changes that might need to be made to the cache. If it is not possible, explain why.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache. What is the cache block size (in words)?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache. How many entries does the cache have?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache. What is the ratio between total bits required for such a cache implementation over the data storage bits? Starting from power on, the following byte-addressed cache references are recorded.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache. How many blocks are replaced?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache. What is the hit ratio?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For a direct-mapped cache design with a 32-bit address, the following bits of the address are used to access the cache. List the final state of the cache, with each valid entry represented as a record of <index, tag, data>.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Recall that we have two write policies and write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches:
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Recall that we have two write policies and write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches: Buffers are employed between different levels of memory hierarchy to reduce access latency. For this given configuration, list the possible buffers needed between L1 and L2 caches, as well as L2 cache and memory.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Recall that we have two write policies and write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches: Describe the procedure of handling an L1 write-miss, considering the component involved and the possibility of replacing a dirty block.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Recall that we have two write policies and write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches: For a multilevel exclusive cache (a block can only reside in one of the L1 and L2 caches), configuration, describe the procedure of handling an L1 write-miss, considering the component involved and the possibility of replacing a dirty block. Consider the following program and cache behaviors.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Recall that we have two write policies and write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches: For a write-through, write-allocate cache, what are the minimum read and write bandwidths (measured by byte per cycle) needed to achieve a CPI of 2?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Recall that we have two write policies and write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches: For a write-back, write-allocate cache, assuming 30% of replaced data cache blocks are dirty, what are the minimal read and write bandwidths needed for a CPI of 2?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Recall that we have two write policies and write allocation policies, and their combinations can be implemented either in L1 or L2 cache. Assume the following choices for L1 and L2 caches: What are the minimal bandwidths needed to achieve the performance of CPI=1.5?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream: 0, 2, 4, 6, 8, 10, 12, 14, 16, …
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream: 0, 2, 4, 6, 8, 10, 12, 14, 16, … Assume a 64 KiB direct-mapped cache with a 32-byte block. What is the miss rate for the address stream above? How is this miss rate sensitive to the size of the cache or the working set? How would you categorize the misses this workload is experiencing, based on the 3C model?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream: 0, 2, 4, 6, 8, 10, 12, 14, 16, … Re-compute the miss rate when the cache block size is 16 bytes, 64 bytes, and 128 bytes. What kind of locality is this workload exploiting?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream: 0, 2, 4, 6, 8, 10, 12, 14, 16, … Prefetching'' is a technique that leverages predictable address patterns to speculatively bring in additional cache blocks when a particular cache block is accessed. One example of prefetching is a stream buffer that prefetches sequentially adjacent cache blocks into a separate buffer when a particular cache block is brought in. If the data is found in the prefetch buffer, it is considered as a hit and moved into the cache and the next cache block is prefetched. Assume a two-entry stream buffer and assume that the cache latency is such that a cache block can be loaded before the computation on the previous cache block is completed. What is the miss rate for the address stream above? Cache block size (B) can affect both miss rate and miss latency. Assuming a 1-CPI machine with an average of 1.35 references (both instruction and data) per instruction, help find the optimal block size given the following miss rates for various block sizes.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream: 0, 2, 4, 6, 8, 10, 12, 14, 16, … What is the optimal block size for a miss latency of \(20 \times B\) cycles?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream: 0, 2, 4, 6, 8, 10, 12, 14, 16, … What is the optimal block size for a miss latency of 24+B cycles?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Media applications that play audio or video files are part of a class of workloads called “streaming” workloads; i.e., they bring in large amounts of data but do not reuse much of it. Consider a video streaming workload that accesses a 512 KiB working set sequentially with the following address stream: 0, 2, 4, 6, 8, 10, 12, 14, 16, … For constant miss latency, what is the optimal block size?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will look at the different ways capacity affects overall performance. In general, cache access time is proportional to capacity. Assume that main memory accesses take 70 ns and that memory accesses are 36% of all instructions. The following table shows data for L1 caches attached to each of two processors, P1 and P2.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Assuming that the L1 hit time determines the cycle times for P1 and P2, what are their respective clock rates?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
What is the Average Memory Access Time for P1 and P2?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Assuming a base CPI of 1.0 without any memory stalls, what is the total CPI for P1 and P2? Which processor is faster? For the next three problems, we will consider the addition of an L2 cache to P1 to presumably make up for its limited L1 cache capacity. Use the L1 cache capacities and hit times from the previous table when solving these problems. The L2 miss rate indicated is its local miss rate.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
What is the AMAT for P1 with the addition of an L2 cache? Is the AMAT better or worse with the L2 cache?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Assuming a base CPI of 1.0 without any memory stalls, what is the total CPI for P1 with the addition of an L2 cache?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Which processor is faster, now that P1 has an L2 cache? If P1 is faster, what miss rate would P2 need in its L1 cache to match P1’s performance? If P2 is faster, what miss rate would P1 need in its L1 cache to match P2’s performance?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
This exercise examines the impact of different cache designs, specifically comparing associative caches to the direct-mapped caches from Section 5.4. For these exercises, refer to the address stream shown in Exercise 5.2.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Using the sequence of references from Exercise 5.2, show the final cache contents for a three-way set associative cache with two-word blocks and a total size of 24 words. Use LRU replacement. For each reference identify the index bits, the tag bits, the block offset bits, and if it is a hit or a miss.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Using the references from Exercise 5.2, show the final cache contents for a fully associative cache with one-word blocks and a total size of 8 words. Use LRU replacement. For each reference identify the index bits, the tag bits, and if it is a hit or a miss.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Using the references from Exercise 5.2, what is the miss rate for a fully associative cache with two-word blocks and a total size of 8 words, using LRU replacement? What is the miss rate using MRU (most recently used) replacement? Finally what is the best possible miss rate for this cache, given any replacement policy? Multilevel caching is an important technique to overcome the limited amount of space that a first level cache can provide while still maintaining its speed. Consider a processor with the following parameters:
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Calculate the CPI for the processor in the table using: 1) only a first level cache, 2) a second level direct-mapped cache, and 3) a second level eight-way set associative cache. How do these numbers change if main memory access time is doubled? If it is cut in half?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
It is possible to have an even greater cache hierarchy than two levels. Given the processor above with a second level, direct-mapped cache, a designer wants to add a third level cache that takes 50 cycles to access and will reduce the global miss rate to 1.3%. Would this provide better performance? In general, what are the advantages and disadvantages of adding a third level cache?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In older processors such as the Intel Pentium or Alpha 21264, the second level of cache was external (located on a different chip) from the main processor and the first level cache. While this allowed for large second level caches, the latency to access the cache was much higher, and the bandwidth was typically lower because the second level cache ran at a lower frequency. Assume a 512 KiB off -chip second level cache has a global miss rate of 4%. If each additional 512 KiB of cache lowered global miss rates by 0.7%, and the cache had a total access time of 50 cycles, how big would the cache have to be to match the performance of the second level direct-mapped cache listed above? Of the eight-way set associative cache?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Mean Time Between Failures (MTBF), Mean Time To Replacement (MTTR), and Mean Time To Failure (MTTF) are useful metrics for evaluating the reliability and availability of a storage resource. Explore these concepts by answering the questions about devices with the following metrics.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Calculate the MTBF for each of the devices in the table.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Calculate the availability for each of the devices in the table.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
What happens to availability as the MTTR approaches 0? Is this a realistic situation?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
What happens to availability as the MTTR gets very high, i.e., a device is difficult to repair? Does this imply the device has low availability?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
This Exercise examines the single error correcting, double error detecting (SEC/ DED) Hamming code.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
What is the minimum number of parity bits required to protect a 128-bit word using the SEC/DED code?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Section 5.5 states that modern server memory modules (DIMMs) employ SEC/DED ECC to protect each 64 bits with 8 parity bits. Compute the cost/ performance ratio of this code to the code from 5.9.1. In this case, cost is the relative number of parity bits needed while performance is the relative number of errors that can be corrected. Which is better?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Consider a SEC code that protects 8 bit words with 4 parity bits. If we read the value 0x375, is there an error? If so, correct the error.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For a high-performance system such as a B-tree index for a database, the page size is determined mainly by the data size and disk performance. Assume that on average a B-tree index page is 70% full with fi x-sized entries. The utility of a page is its B-tree depth, calculated as \(\log _{2}(\text {entries) }\). The following table shows that for 16-byte entries, and a 10-year-old disk with a 10 ms latency and 10 MB/s transfer rate, the optimal page size is 16K.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
What is the best page size if entries now become 128 bytes?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Based on 5.10.1, what is the best page size if pages are half full?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Based on 5.10.2, what is the best page size if using a modern disk with a 3 ms latency and 100 MB/s transfer rate? Explain why future servers are likely to have larger pages.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Keeping “frequently used” (or “hot”) pages in DRAM can save disk accesses, but how do we determine the exact meaning of “frequently used” for a given system? Data engineers use the cost ratio between DRAM and disk access to quantify the reuse time threshold for hot pages. The cost of a disk access is $Disk/accesses_per_sec, while the cost to keep a page in DRAM is $DRAM_MiB/page_size. The typical DRAM and disk costs and typical database page sizes at several time points are listed below: What are the reuse time thresholds for these three technology generations?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
What are the reuse time thresholds if we keep using the same 4K page size? What’s the trend here?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
What other factors can be changed to keep using the same page size (thus avoiding software rewrite)? Discuss their likeness with current technology and cost trends.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
As described in Section 5.7, virtual memory uses a page table to track the mapping of virtual addresses to physical addresses. This exercise shows how this table must be updated as addresses are accessed. The following data constitutes a stream of virtual addresses as seen on a system. Assume 4 KiB pages, a 4-entry fully associative TLB, and true LRU replacement. If pages must be brought in from disk, increment the next largest page number. 4669, 2227, 13916, 34587, 48870, 12608, 49225 TLB Page table
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Given the address stream shown, and the initial TLB and page table states provided above, show the final state of the system. Also list for each reference if it is a hit in the TLB, a hit in the page table, or a page fault.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Repeat 5.11.1, but this time use 16 KiB pages instead of 4 KiB pages. What would be some of the advantages of having a larger page size? What are some of the disadvantages?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Show the final contents of the TLB if it is 2-way set associative. Also show the contents of the TLB if it is direct mapped. Discuss the importance of having a TLB to high performance. How would virtual memory accesses be handled if there were no TLB?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
There are several parameters that impact the overall size of the page table. Listed below are key page table parameters. Given the parameters shown above, calculate the total page table size for a system running 5 applications that utilize half of the memory available.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Given the parameters shown above, calculate the total page table size for a system running 5 applications that utilize half of the memory available, given a two level page table approach with 256 entries. Assume each entry of the main page table is 6 bytes. Calculate the minimum and maximum amount of memory required.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
A cache designer wants to increase the size of a 4 KiB virtually indexed, physically tagged cache. Given the page size shown above, is it possible to make a 16 KiB direct-mapped cache, assuming 2 words per block? How would the designer increase the data size of the cache?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will examine space/time optimizations for page tables. The following list provides parameters of a virtual memory system.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For a single-level page table, how many page table entries (PTEs) are needed? How much physical memory is needed for storing the page table?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Using a multilevel page table can reduce the physical memory consumption of page tables, by only keeping active PTEs in physical memory. How many levels of page tables will be needed in this case? And how many memory references are needed for address translation if missing in TLB?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
An inverted page table can be used to further optimize space and time. How many PTEs are needed to store the page table? Assuming a hash table implementation, what are the common case and worst case numbers of memory references needed for servicing a TLB miss?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
The following table shows the contents of a 4-entry TLB. Under what scenarios would entry 2’s valid bit be set to zero?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
What happens when an instruction writes to VA page 30? When would a software managed TLB be faster than a hardware managed TLB?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
What happens when an instruction writes to VA page 200?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will examine how replacement policies impact miss rate. Assume a 2-way set associative cache with 4 blocks. To solve the problems in this exercise, you may find it helpful to draw a table like the one below, as demonstrated for the address sequence “0, 1, 2, 3, 4.” Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will examine how replacement policies impact miss rate. Assume a 2-way set associative cache with 4 blocks. To solve the problems in this exercise, you may find it helpful to draw a table like the one below, as demonstrated for the address sequence “0, 1, 2, 3, 4.” Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0 Assuming an LRU replacement policy, how many hits does this address sequence exhibit?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will examine how replacement policies impact miss rate. Assume a 2-way set associative cache with 4 blocks. To solve the problems in this exercise, you may find it helpful to draw a table like the one below, as demonstrated for the address sequence “0, 1, 2, 3, 4.” Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0 Assuming an MRU (most recently used) replacement policy, how many hits does this address sequence exhibit?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will examine how replacement policies impact miss rate. Assume a 2-way set associative cache with 4 blocks. To solve the problems in this exercise, you may find it helpful to draw a table like the one below, as demonstrated for the address sequence “0, 1, 2, 3, 4.” Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0 Simulate a random replacement policy by flipping a coin. For example, “heads” means to evict the first block in a set and “tails” means to evict the second block in a set. How many hits does this address sequence exhibit?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will examine how replacement policies impact miss rate. Assume a 2-way set associative cache with 4 blocks. To solve the problems in this exercise, you may find it helpful to draw a table like the one below, as demonstrated for the address sequence “0, 1, 2, 3, 4.” Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0 Which address should be evicted at each replacement to maximize the number of hits? How many hits does this address sequence exhibit if you follow this “optimal” policy?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will examine how replacement policies impact miss rate. Assume a 2-way set associative cache with 4 blocks. To solve the problems in this exercise, you may find it helpful to draw a table like the one below, as demonstrated for the address sequence “0, 1, 2, 3, 4.” Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0 Describe why it is difficult to implement a cache replacement policy that is optimal for all address sequences.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will examine how replacement policies impact miss rate. Assume a 2-way set associative cache with 4 blocks. To solve the problems in this exercise, you may find it helpful to draw a table like the one below, as demonstrated for the address sequence “0, 1, 2, 3, 4.” Consider the following address sequence: 0, 2, 4, 8, 10, 12, 14, 16, 0 Assume you could make a decision upon each memory reference whether or not you want the requested address to be cached. What impact could this have on miss rate?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
To support multiple virtual machines, two levels of memory virtualization are needed. Each virtual machine still controls the mapping of virtual address (VA) to physical address (PA), while the hypervisor maps the physical address (PA) of each virtual machine to the actual machine address (MA). To accelerate such mappings, a software approach called “shadow paging” duplicates each virtual machine’s page tables in the hypervisor, and intercepts VA to PA mapping changes to keep both copies consistent. To remove the complexity of shadow page tables, a hardware approach called nested page table (NPT) explicitly supports two classes of page tables \((\mathrm{VA} \Rightarrow \mathrm{PA}\) and \(\mathrm{PA} \Rightarrow \mathrm{MA} \text { ) }\) and can walk such tables purely in hardware. Consider the following sequence of operations: (1) Create process; (2) TLB miss; (3) page fault; (4) context switch;
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
To support multiple virtual machines, two levels of memory virtualization are needed. Each virtual machine still controls the mapping of virtual address (VA) to physical address (PA), while the hypervisor maps the physical address (PA) of each virtual machine to the actual machine address (MA). To accelerate such mappings, a software approach called “shadow paging” duplicates each virtual machine’s page tables in the hypervisor, and intercepts VA to PA mapping changes to keep both copies consistent. To remove the complexity of shadow page tables, a hardware approach called nested page table (NPT) explicitly supports two classes of page tables \((\mathrm{VA} \Rightarrow \mathrm{PA}\) and \(\mathrm{PA} \Rightarrow \mathrm{MA} \text { ) }\) and can walk such tables purely in hardware. Consider the following sequence of operations: (1) Create process; (2) TLB miss; (3) page fault; (4) context switch; What would happen for the given operation sequence for shadow page table and nested page table, respectively?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
To support multiple virtual machines, two levels of memory virtualization are needed. Each virtual machine still controls the mapping of virtual address (VA) to physical address (PA), while the hypervisor maps the physical address (PA) of each virtual machine to the actual machine address (MA). To accelerate such mappings, a software approach called “shadow paging” duplicates each virtual machine’s page tables in the hypervisor, and intercepts VA to PA mapping changes to keep both copies consistent. To remove the complexity of shadow page tables, a hardware approach called nested page table (NPT) explicitly supports two classes of page tables \((\mathrm{VA} \Rightarrow \mathrm{PA}\) and \(\mathrm{PA} \Rightarrow \mathrm{MA} \text { ) }\) and can walk such tables purely in hardware. Consider the following sequence of operations: (1) Create process; (2) TLB miss; (3) page fault; (4) context switch; Assuming an x86-based 4-level page table in both guest and nested page table, how many memory references are needed to service a TLB miss for native vs. nested page table?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
To support multiple virtual machines, two levels of memory virtualization are needed. Each virtual machine still controls the mapping of virtual address (VA) to physical address (PA), while the hypervisor maps the physical address (PA) of each virtual machine to the actual machine address (MA). To accelerate such mappings, a software approach called “shadow paging” duplicates each virtual machine’s page tables in the hypervisor, and intercepts VA to PA mapping changes to keep both copies consistent. To remove the complexity of shadow page tables, a hardware approach called nested page table (NPT) explicitly supports two classes of page tables \((\mathrm{VA} \Rightarrow \mathrm{PA}\) and \(\mathrm{PA} \Rightarrow \mathrm{MA} \text { ) }\) and can walk such tables purely in hardware. Consider the following sequence of operations: (1) Create process; (2) TLB miss; (3) page fault; (4) context switch; Among TLB miss rate, TLB miss latency, page fault rate, and page fault handler latency, which metrics are more important for shadow page table? Which are important for nested page table?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Assume the following parameters for a shadow paging system. For a benchmark with native execution CPI of 1, what are the CPI numbers if using shadow page tables vs. NPT (assuming only page table virtualization overhead)?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Assume the following parameters for a shadow paging system. What techniques can be used to reduce page table shadowing induced overhead?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Assume the following parameters for a shadow paging system. What techniques can be used to reduce NPT induced overhead?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
One of the biggest impediments to widespread use of virtual machines is the performance overhead incurred by running a virtual machine. Listed below are various performance parameters and application behavior.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
One of the biggest impediments to widespread use of virtual machines is the performance overhead incurred by running a virtual machine. Listed below are various performance parameters and application behavior. Calculate the CPI for the system listed above assuming that there are no accesses to I/O. What is the CPI if the VMM performance impact doubles? If it is cut in half? If a virtual machine software company wishes to obtain a 10% performance degradation, what is the longest possible penalty to trap to the VMM?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
One of the biggest impediments to widespread use of virtual machines is the performance overhead incurred by running a virtual machine. Listed below are various performance parameters and application behavior. I/O accesses often have a large impact on overall system performance. Calculate the CPI of a machine using the performance characteristics above, assuming a non-virtualized system. Calculate the CPI again, this time using a virtualized system. How do these CPIs change if the system has half the I/O accesses? Explain why I/O bound applications have a smaller impact from virtualization.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
One of the biggest impediments to widespread use of virtual machines is the performance overhead incurred by running a virtual machine. Listed below are various performance parameters and application behavior. Compare and contrast the ideas of virtual memory and virtual machines. How do the goals of each compare? What are the pros and cons of each? List a few cases where virtual memory is desired, and a few cases where virtual machines are desired.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
One of the biggest impediments to widespread use of virtual machines is the performance overhead incurred by running a virtual machine. Listed below are various performance parameters and application behavior. Section 5.6 discusses virtualization under the assumption that the virtualized system is running the same ISA as the underlying hardware. However, one possible use of virtualization is to emulate non-native ISAs. An example of this is QEMU, which emulates a variety of ISAs such as MIPS, SPARC, and PowerPC. What are some of the difficulties involved in this kind of virtualization? Is it possible for an emulated system to run faster than on its native ISA?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will explore the control unit for a cache controller for a processor with a write buffer. Use the finite state machine found in Figure 5.40 as a starting point for designing your own finite state machines. Assume that the cache controller is for the simple direct-mapped cache described on page 465 (Figure 5.40 in Section 5.9), but you will add a write buffer with a capacity of one block. Recall that the purpose of a write buffer is to serve as temporary storage so that the processor doesn’t have to wait for two memory accesses on a dirty miss. Rather than writing back the dirty block before reading the new block, it buffers the dirty block and immediately begins reading the new block. The dirty block can then be written to main memory while the processor is working.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will explore the control unit for a cache controller for a processor with a write buffer. Use the finite state machine found in Figure 5.40 as a starting point for designing your own finite state machines. Assume that the cache controller is for the simple direct-mapped cache described on page 465 (Figure 5.40 in Section 5.9), but you will add a write buffer with a capacity of one block. Recall that the purpose of a write buffer is to serve as temporary storage so that the processor doesn’t have to wait for two memory accesses on a dirty miss. Rather than writing back the dirty block before reading the new block, it buffers the dirty block and immediately begins reading the new block. The dirty block can then be written to main memory while the processor is working. What should happen if the processor issues a request that hits in the cache while a block is being written back to main memory from the write buffer?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will explore the control unit for a cache controller for a processor with a write buffer. Use the finite state machine found in Figure 5.40 as a starting point for designing your own finite state machines. Assume that the cache controller is for the simple direct-mapped cache described on page 465 (Figure 5.40 in Section 5.9), but you will add a write buffer with a capacity of one block. Recall that the purpose of a write buffer is to serve as temporary storage so that the processor doesn’t have to wait for two memory accesses on a dirty miss. Rather than writing back the dirty block before reading the new block, it buffers the dirty block and immediately begins reading the new block. The dirty block can then be written to main memory while the processor is working. What should happen if the processor issues a request that misses in the cache while a block is being written back to main memory from the write buffer?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise, we will explore the control unit for a cache controller for a processor with a write buffer. Use the finite state machine found in Figure 5.40 as a starting point for designing your own finite state machines. Assume that the cache controller is for the simple direct-mapped cache described on page 465 (Figure 5.40 in Section 5.9), but you will add a write buffer with a capacity of one block. Recall that the purpose of a write buffer is to serve as temporary storage so that the processor doesn’t have to wait for two memory accesses on a dirty miss. Rather than writing back the dirty block before reading the new block, it buffers the dirty block and immediately begins reading the new block. The dirty block can then be written to main memory while the processor is working. Design a finite state machine to enable the use of a write buffer.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Cache coherence concerns the views of multiple processors on a given cache block. The following data shows two processors and their read/write operations on two different words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is 32 bits.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Cache coherence concerns the views of multiple processors on a given cache block. The following data shows two processors and their read/write operations on two different words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is 32 bits. List the possible values of the given cache block for a correct cache coherence protocol implementation. List at least one more possible value of the block if the protocol doesn’t ensure cache coherency.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Cache coherence concerns the views of multiple processors on a given cache block. The following data shows two processors and their read/write operations on two different words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is 32 bits. For a snooping protocol, list a valid operation sequence on each processor/cache to finish the above read/write operations.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Cache coherence concerns the views of multiple processors on a given cache block. The following data shows two processors and their read/write operations on two different words of a cache block X (initially X[0] = X[1] = 0). Assume the size of integers is 32 bits. What are the best-case and worst-case numbers of cache misses needed to execute the listed read/write instructions?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Memory consistency concerns the views of multiple data items. The following data shows two processors and their read/write operations on different cache blocks (A and B initially 0). List the possible values of C and D for an implementation that ensures both consistency assumptions on page 470.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Memory consistency concerns the views of multiple data items. The following data shows two processors and their read/write operations on different cache blocks (A and B initially 0). List at least one more possible pair of values for C and D if such assumptions are not maintained.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
Memory consistency concerns the views of multiple data items. The following data shows two processors and their read/write operations on different cache blocks (A and B initially 0). For various combinations of write policies and write allocation policies, which combinations make the protocol implementation simpler?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
p multiprocessors (CMPs) have multiple cores and their caches on a single chip. CMP on-chip L2 cache design has interesting trade-offs. The following table shows the miss rates and hit latencies for two benchmarks with private vs. shared L2 cache designs. Assume L1 cache misses once every 32 instructions. Assume the following hit latencies:
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
p multiprocessors (CMPs) have multiple cores and their caches on a single chip. CMP on-chip L2 cache design has interesting trade-offs. The following table shows the miss rates and hit latencies for two benchmarks with private vs. shared L2 cache designs. Assume L1 cache misses once every 32 instructions. Assume the following hit latencies: Which cache design is better for each of these benchmarks? Use data to support your conclusion.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
p multiprocessors (CMPs) have multiple cores and their caches on a single chip. CMP on-chip L2 cache design has interesting trade-offs. The following table shows the miss rates and hit latencies for two benchmarks with private vs. shared L2 cache designs. Assume L1 cache misses once every 32 instructions. Assume the following hit latencies: Shared cache latency increases with the CMP size. Choose the best design if the shared cache latency doubles. Off-chip bandwidth becomes the bottleneck as the number of CMP cores increases. Choose the best design if off -chip memory latency doubles.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
p multiprocessors (CMPs) have multiple cores and their caches on a single chip. CMP on-chip L2 cache design has interesting trade-offs. The following table shows the miss rates and hit latencies for two benchmarks with private vs. shared L2 cache designs. Assume L1 cache misses once every 32 instructions. Assume the following hit latencies: Discuss the pros and cons of shared vs. private L2 caches for both single-threaded, multi-threaded, and multiprogrammed workloads, and reconsider them if having on-chip L3 caches.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
p multiprocessors (CMPs) have multiple cores and their caches on a single chip. CMP on-chip L2 cache design has interesting trade-offs. The following table shows the miss rates and hit latencies for two benchmarks with private vs. shared L2 cache designs. Assume L1 cache misses once every 32 instructions. Assume the following hit latencies: Assume both benchmarks have a base CPI of 1 (ideal L2 cache). If having non-blocking cache improves the average number of concurrent L2 misses from 1 to 2, how much performance improvement does this provide over a shared L2 cache? How much improvement can be achieved over private L2?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
p multiprocessors (CMPs) have multiple cores and their caches on a single chip. CMP on-chip L2 cache design has interesting trade-offs. The following table shows the miss rates and hit latencies for two benchmarks with private vs. shared L2 cache designs. Assume L1 cache misses once every 32 instructions. Assume the following hit latencies: Assume new generations of processors double the number of cores every 18 months. To maintain the same level of per-core performance, how much more off-chip memory bandwidth is needed for a processor released in three years?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
p multiprocessors (CMPs) have multiple cores and their caches on a single chip. CMP on-chip L2 cache design has interesting trade-offs. The following table shows the miss rates and hit latencies for two benchmarks with private vs. shared L2 cache designs. Assume L1 cache misses once every 32 instructions. Assume the following hit latencies: Consider the entire memory hierarchy. What kinds of optimizations can improve the number of concurrent misses?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise we show the definition of a web server log and examine code optimizations to improve log processing speed. The data structure for the log is defined as follows: Assume the following processing function for the log:
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise we show the definition of a web server log and examine code optimizations to improve log processing speed. The data structure for the log is defined as follows: Assume the following processing function for the log: Which fields in a log entry will be accessed for the given log processing function? Assuming 64-byte cache blocks and no prefetching, how many cache misses per entry does the given function incur on average?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise we show the definition of a web server log and examine code optimizations to improve log processing speed. The data structure for the log is defined as follows: Assume the following processing function for the log: How can you reorganize the data structure to improve cache utilization and access locality? Show your structure definition code.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
In this exercise we show the definition of a web server log and examine code optimizations to improve log processing speed. The data structure for the log is defined as follows: Assume the following processing function for the log: Give an example of another log processing function that would prefer a different data structure layout. If both functions are important, how would you rewrite the program to improve the overall performance? Supplement the discussion with code snippet and data.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For the problems below, use data from “Cache Performance for SPEC CPU2000 Benchmarks” (http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/) for the pairs of benchmarks shown in the following table. For 64 KiB data caches with varying set associativities, what are the miss rates broken down by miss types (cold, capacity, and conflict misses) for each benchmark?
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For the problems below, use data from “Cache Performance for SPEC CPU2000 Benchmarks” (http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/) for the pairs of benchmarks shown in the following table. Select the set associativity to be used by a 64 KiB L1 data cache shared by both benchmarks. If the L1 cache has to be directly mapped, select the set associativity for the 1 MiB L2 cache.
Read more -
Chapter 5: Problem 5 Computer Organization and Design 5
For the problems below, use data from “Cache Performance for SPEC CPU2000 Benchmarks” (http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/) for the pairs of benchmarks shown in the following table. Give an example in the miss rate table where higher set associativity actually increases miss rate. Construct a cache configuration and reference stream to demonstrate this.
Read more