微架构吧 关注:176贴子:525
  • 1回复贴,共1

General Notes on Cache Hierarchy

只看楼主收藏回复

The widening gap between the latency of the processor, DRAM, and secondary storage devices (hard disks) has created the need for a memory hierarchy based on speed, size, and cost. It is historically observed that the speed gap between the processor, clocked in the orders of gigahertz, and the main memory is growing exponentially. This problem is sometimes called the “memory wall” (previously “memory gap”). A hierarchy of cache of various size and speed has been employed to alleviate the speed gap and reduce latency.
Caches are made of cache lines, which are mapped to and host data of a memory block of the same size. Since caches are much smaller in size, a particular cache line may be mapped to different memory locations at different times. If the address requested by the processor happens to be found in the cache at some level, we call the scenario a cache hit, if not, we call it a cache miss.

Source: UC Berkeley
Cache Hierarchies
Caches are designed under the observation that memory requested by the processor during the execution of a program is highly predictable. Consider the pattern exhibited by the following simple loop:
loop: ADD R2, R1, R1
SUBI R3, R3, #1
BNEZ R3, loop
Observe that locations referenced will likely be referenced again in the near future, this is called temporal locality, and that locations nearby are also likely to be referenced in the near future, this is called spatial locality. Thus caches can exploit the temporal and spatial locality by remembering recently accessed locations are prefetching blocks around them.

A cache is made of two memories: a directory memory and data memory. The directory memory contains the ID/tag of the memory block residing in each cache line and some state bits, with at least a validity bit. The data memory contains the cached copy of the memory block. The mapping between a memory block and a cache line is based on the bock address. There are three types of cache mapping: direct mapping, set-associative mapping, and fully associative mapping.
Direct mapped caches:
In direct mapped cache, a given memory block is always mapped to the same cached line by a hash function. The simplest hash function simply takes the least significant bits in the block address that correspond to the number of cache lines and data bytes per line. The remaining bits will be stored as the address tag of that cache line.


Read accesses proceed as follows:
1. Cache indexing. The directory entry is fetched while the data memory is accessed with the offset. Cache indexing carries out at standard SRAM speed.
2. Tag checking. The tag in the cache line is compared with the corresponding most significant bits of the block address. If they match, a validity check is performed on the state bits. If successful, cache hits and the data are forwarded to the next level, otherwise a cache miss triggers.
Set-associative caches:
The advantage of a direct mapped cache is its fast access time on a hit. However, many locations in memory are mapped to the same cache lines, this can cause high miss rates when a large number of memory blocks compete for the same cache line. To resolve this problem, most caches are set-associative. A set associative cache groups cache lines into sets, each memory block is mapped to a single set but can reside in any line of that set, in a sense, this is similar to linear probing.

Source: MIT 6.823
During a read access, all lines of a set must now be checked for consistency. The rest of the procedures for a read access are the same as those of a direct mapped cache, which can also be called a one way set associative cache. The number of cache lines in a set associative cache is typically between 2 and 8. Too many set lines require complex decoding and checking logic that increase the reading latency, a better solution would be to increase the number of sets instead.
Write access in a set associative cache takes at least 2 cycles. The first cycle is required for a tag and consistency check, if hit, the second cycle writes the data. These two cycles can be pipeline if write accesses occur in burst.
Fully associative caches:
A fully associative cache is drastically different from a set associative cache. A memory block can be mapped anywhere in the fully associative cache and as a result, its tag is the entire block address. A read or write access takes 2 cycles. In the first cycle, the block address is compared with the tags of all cache lines in parallel. If hit, data is read or written into that cache line. A fully associative cache can be seen as a set associative cache consisting of a single set.


1楼2015-05-30 17:30回复
    Replacement Policies
    When a memory block requested is not cached, a cache miss is triggered. The block needs to be replaced in the cache, and in the case of a direct mapped cache, replacement is straightforward as only one cache location is possible. In the case of an associative cache however, there are many possible candidates for replacement. The procedure to select a victim block in an associative cache is called replacement policy.
    The simplest replacement policy is random, it does not require access histories to make a decision, but this is apparently not an optimal solution. Most replacement policies attempt to reduce future miss rates. The LRU (least recently used) replacement policy relies on the temporal locality of accesses to each block and keeps track of the order in time the last reference to the block occurred. It replaces the block accessed the longest time ago. LRU is easy to maintain when the set size is small. One problem with LRU is that the history bits must be accessed on every cache access, be it a hit or miss. Other replacement such as FIFO (first in first out) are used in highly associative caches.
    Write Policies
    Write-through caches:
    In write through caches, all stores update the lower level cache. A store buffer is employed to hide latency of the access to a lower level cache. The store buffer keeps pending stores and addresses until the controller moves them to the next level when the bus is available. The processors only needs to stall when the buffer becomes full to prevent overflow. To maintain coherence in hierarchy, a load miss must first look up the store buffer for pending store to the same address. If a match is detected, the load miss must either return the latest value form the store buffer or wait for the store to retire before accessing the next level cache. Write through caches result in large traffic but simplify cache coherence, they are preferred for small first level caches. For larger caches, the write-back policy is preferred.

    Write-back caches:
    In a write-back cache, the stores to the next level are deferred until the modified cache block is victimized by the replacement policy. A dirty bit is associated with each cache line and is set when a store modifies the cache line, and reset every time a block is loaded into cache on a miss. When a dirty block is selected by the replacement policy, it must be written back to the next level. A clean block is not written back upon replacement. Upon a load miss, a cache line is first allocated by victimizing the block and the miss request is sent to the next level. While the load miss is pending, and if the victim has been modified, the victim is saved in a write-back buffer in order to free the cache line before the return of a missing block. The write-back buffer updates in the background.


    2楼2015-05-30 17:31
    回复