Simulation configurations for two-level exclusive cache.
Data from main memory are processed in the CPU of computer in any application like database-management system (DBMS), numerical applications. The computation time can be improved by the addition of processor caches. The cache takes advantage of the locality of reference in any application/calculation. This chapter discusses the two prevalent cache architectures, namely inclusive and exclusive. A new cache architecture for data only called two-type data cache proposed in the literature is presented in the following section. The performance of two-type data cache model is compared with inclusive and exclusive architectures. The energy consumed by inclusive and exclusive caches is mentioned. Methods to reduce the energy consumption are proposed for inclusive and exclusive cache architectures. The hardware and software methods for energy saving are proposed. The proposed models are simulated using SPEC2000 benchmarks. The benchmarks are for compression, combinatorial optimization, word processing, place and route simulator, object oriented database, field-programmable gate array (FPGA) circuit placement and routing. The results are presented.
- average memory access time
- exclusive cache
- inclusive cache
- two-type data cache
The computer has three major parts, namely, processor, memory and input–output. The modern computer is based on Von Neumann architecture of stored program concept. Any application/program is fetched into main memory and executed. The program has instructions. Any instruction is fetched into the processor (CPU) and executed based on pipeline concepts. The instructions act on data. Any program has 80% of its instructions in loops. Repeated access of data gives the concept of locality of reference. If any data is accessed repeatedly it has temporal locality. Data that are in vicinity of access are said to have spatial locality. Taking advantage of locality of reference, the concept of caches was introduced in computers. The data/instruction from main memory is fetched into cache before accessed by the CPU. This improves the performance. The memory hierarchy thus runs from registers, caches, main memory, secondary memory, tertiary memory, etc. The performance of memory system is measured as average memory access time (AMAT). Various cache levels are prevalent in modern processors.
A processor cache is denoted by the tuple (C, k, L) where C is the capacity, k the associativity and L the line size. Based on the various values of k, three types of caches are known. These are direct mapped cache with k = 1, set associative cache with k > 1, fully associative cache with one set and n blocks. An address a is mapped to set given by a mod S, with tag value a div S for S sets. If a line is present in cache, it is cache hit, else it is cache miss. Cache misses are of three kinds: cold, capacity and conflict. A cache miss for first occurrence is called cold miss. The difference between misses in cache and fully associative cache of same capacity is the capacity miss. If a line maps to occupied set or way, it is called conflict miss. A computer system usually has many cache levels. A line can reside in cache level and higher cache levels in inclusive caches. A line resides in only one cache level in exclusive caches. Usually processors have caches dedicated to instructions and data separately. These are called instruction cache and data cache, respectively. Certain systems have same cache for both instruction and data. These are called unified caches. A system with caches of two or more kinds (direct mapped, set associative, fully associative) is called multilateral or hybrid cache. As the address mapping may vary with each cache level, the average memory access time (AMAT) is used to measure the cache performance.
As the number of computer components active increases, the energy consumed also increases. The power consumed by cache depends on number of active components. The energy is given as E = power × time. The energy is given by the formula E = for electronic component where c is the capacity, v the voltage and f the frequency.
The performance of CPU caches is measured by execution time for various applications. Benchmarks are used to measure the CPU performance. The SPEC2000 benchmarks are one of standard benchmarks. The integer benchmark suites of SPEC2K are given as follows:
|300.twolf||Place and route simulator|
|175.vpr||FPGA circuit placement and routing|
The memory performance is improved by adding caches. The inclusive, exclusive and two-type data cache models are presented in this chapter. The proposed models are simulated using SPEC2000 benchmarks. The benchmarks are run using Simplescalar Toolkit for simulations.
2. Inclusive caches
Consider cache system of n cache levels, main memory. Let the cache levels are . Let the cache be inclusive. Then, . Denote this system as . This is shown in Figure 1. The cache sizes grow with the level number [1, 2]. Consider three-level cache system.
Let the levels be . Let an address trace have R references. Let be number of hits in level one, level two, level three, respectively. Let be the access time to level one, level two, level three, transfer time between level two and level one and transfer time between level three and level two cache levels, respectively. Let M be the miss penalty. The average memory access time is given by
The first three terms in Eq. (1) are the access time of level one, level two and level three cache hits. The last term is the miss penalty. This expression can be extended to any number of cache levels.
Energy consumed in cache depends on number of active components. The individual lines can be selectively switched on in caches using certain software techniques or hardware circuits. The total power consumed can be reduced through this technique. Consider w-way set associative cache of S sets. Let the power consumed per line be p watts. The total power consumed is wpS. Consider a circuit which enables lines if occupied. This is shown in Figure 2. If the power consumed by the circuit is q watts, the number of occupied lines is y, the total power consumed is q + yp. An improvement in power consumption is observed if q + yp < wpS.
Power saving using software techniques involves mapping lines to fixed ways by address mapping techniques .
3. Exclusive caches
Exclusive caches have a line in one cache level only. There is no containment property of the inclusive caches. Let the cache levels be .The exclusive cache has the property that , for i not equal to j. The exclusive cache is depicted in Figure 3.
Consider cache with two cache levels. The placement, replacement algorithm as proposed by Jouppi and Wilton  is given as follows:
Check if line is present in level one. If present, access the line and stop.
Check if line is present in level two. If present, swap with level one cache line, access and stop.
If line is not present in cache, put the line in level one cache evicting the victim in first level cache to second level cache.
The number of sets in both levels has to be equal in the abovementioned design. Let be level one hits and level two hits in trace of R references. Let be access time to level one, level two, transfer time between level one and level two caches, transfer time between level one and main memory. Let M be miss penalty. Denote this exclusive cache system as . The average memory access time is given by Eq. (2).
The first term in Eq. (2) is level one hit time. The second term is level two hit time. The factor of 2 in this expression is because of swapping of the lines. The third term is the miss penalty.
Another logic to realize exclusive caches is proposed by Subha . Consider two-level cache system. The placement, replacement logic is given as follows:
Initialize all lines to be in level zero.
If the line is present in level one or level two cache (cache hit), let the line be in logical level one and stop.
Check if level one cache line is free. If so, place the line in level one cache, consider it as logical level one cache and stop.
Check if level two cache line is free. If so, place the line in level two cache; consider it as logical level one cache and stop.
Check the status of physical level one and physical level two caches. If physical level one cache has status of logical level two and vice-versa, place the block in level one cache and change its status to logical level one, change the status of physical level two cache to logical level two cache and stop. Else, place the block in physical level two, treat physical level one as logical level two, physical level two as logical level one and stop. [Put in logical level two, flip the level indices].
The abovementioned model does not require the two cache levels to have equal sets. There is a path between main memory and level two. Let be the decrease in level one hits from model proposed in , increase in level two hits from model proposed in , transfer time between level two and main memory. Let this exclusive cache system be denoted as . The average memory access time of this system is given by Eq. (3).
where are level one and level two hits and . An improvement in AMAT is observed given by Eq. (4).
The AMAT is depicted in Figure 4.
The power consumed by exclusive depends on the number of active cache lines. One method to reduce the number of active lines is to have separate cache called tag cache . The tag cache contains the tag values of all cache levels. The address mapping proceeds as follows in the tag cache.
Compute the following
Set1 = a mod .
Tag1 = a div .
Set2 = a mod .
Tag2 = a div .
Check in tag cache for the matching of Tag1. If match is found, the line is present in level one cache as it is level one hit condition. Access the line in level one cache and stop. If there is level one cache miss, check for level two cache hit by inspecting match of Tag2. If there is level two cache hit, access the line in level two cache and stop.
This step is for cache miss condition. Place the least recently used line in Set2 of level two in main memory. Transfer the least recently used line of Set1 in level one cache in the evicted level two line. Place the line with address a in level one cache. Update the entries in the tag cache.
In the abovementioned algorithm, a level one cache line or higher level cache line is enabled only on cache hit or cache miss in all levels. This saves the energy consumed by the cache system. The cache is exclusive in nature and the exclusive algorithm proposed by Jouppi and Wilton  is used. The proposed model has scalability. The architecture is shown in Figure 5.
Let trace be R references. Let be he hits in level one cache, hits in level two cache, misses filled in vacant level one cache, misses filled in vacant level two cache, conflict misses in level one and level two caches, respectively. Let be tag cache access time, level one cache access time, level two cache access time, transfer time between level one and main memory, transfer time between level two and main memory, transfer time between level one and level caches, respectively. Let the tag cache system be denoted as . The average memory access time is given by Eq. (5).
The tag cache access time is the first term in Eq. (5). The hit time accesses to level one and level two caches are given by second and third terms, respectively. The time taken to fill vacant way in level one cache is given by the fourth term. This involves accessing tag cache in level one, fetching the line to level one cache and updating the tag cache entry. The time taken to fill vacant second level cache line is given by the fifth term. This includes accessing the tag cache to check for match in level one cache and level two cache and placing the line in level two cache, updating the tag cache entry. The time taken to replace existing line is given by the sixth term. This involves checking for tag match in level one cache, level two cache, replacing the level one cache line and updating the tag cache entries. As the tag cache contains the tags in consecutive locations, it may be the case that the tag entries in level one and level two are in two different cache blocks.
The energy consumed in exclusive tag cache is calculated in the following section. Let us assume that the cache operates in two modes: high-power mode and low-power mode. On accessing cache way, its corresponding set is placed in high-power mode from low-power mode. Let , be the energy consumed by the cache way in the proposed tag cache model in high-power mode and low-power mode, respectively. Let be the energy consumed by one cache line in tag cache in high-power mode and low-power mode, respectively. Let be the difference in energy level for cache way and tag cache way between the two modes of operation, respectively. When no cache operation is performed, the energy consumed in the cache system is for -way set associative cache at level one and -set associative cache in level two. For level one cache hit, the energy consumed is . This is because the tag cache entry and the set in level one cache are enabled in high energy mode. For level two cache hit, the energy consumed is . This is because the tag cache is searched for match in level one cache and level two cache and the level two cache set containing the line is enabled. For the unfilled level one cache situation, the energy consumed is as the tag cache is searched to confirm that it is a free level one cache and the way in the level one cache is filled with the block. For the free level two cache way, the energy consumed is . This is because the tag cache entry in level one and level two caches are searched to confirm that there is free level two cache way, and the corresponding level two cache set is enabled to fill the way. For a miss, the total energy consumed is given by . This is because the tag entries for level one and level two caches are enabled, and the line replaces a filled level one cache set and level two cache set. Consider a program with R references. The total energy consumed for this address trace is given by Eq. (6).
The first term in Eq. (6) is the energy consumed when the cache is not accessed. The second term is the energy consumed in level one hits. This is equal to enabling the line in tag cache and accessing the ways in the level one cache for the set. The third term gives the energy consumed for level two cache hits. The fourth term is the energy consumed for filling a vacant level one cache line. The fifth term is the energy consumed to fill a vacant level two cache line. The sixth term is the energy consumed to replace an existing line in level one cache on a conflict miss. Consider a traditional exclusive cache with the same algorithm given in Section 3. The energy consumed for R references is calculated as follows. Let be the energy consumed per way in the cache system. All the sets in both the cache levels are enabled in high energy power mode for all references. The energy consumed is given by Eq. (7).
A saving in energy consumption is observed as given by Eq. (8).
The simulations for the above proposed tag cache model of exclusive cache is presented in the following section . The simulation parameters are given in Table 3. The energy consumed in low-power mode is 5 J and high-power mode is 15 J. The AMAT was calculated using C routines. Figure 6 gives the AMAT and Figure 7 gives the energy consumed. As seen from Figure 6, the AMAT performance is comparable with traditional exclusive cache. The energy is saved by 23% as seen from Figure 7 when compared with traditional exclusive cache.
The simulations for the model presented in  are presented in the following section. The simulation parameters are shown in Table 4. The power consumed is shown in Table 5. The AMAT is shown in Figure 9. The t is traditional model and p is the proposed model. There is 49% improvement in power consumption with no change in AMAT for this model compared with the exclusive model proposed in tag cache model of exclusive cache (t in this discussion) as proposed in .
4. Two-type data cache model
The caches discussed so far in this chapter are inclusive and exclusive. A two-level data cache model  which selectively makes the cache ways in various levels as inclusive or exclusive is presented in the following section. This model makes the cache ways inclusive based on access. A two-level data cache is chosen for discussion. Initially, both cache levels are exclusive. The first occurrence of address places the data in one of cache levels making the occupied way exclusive. Preference is given to place the data in level one in this case. On consecutive access to data in level one cache, cache way is made inclusive with level two cache. During this process, data in level two cache may be replaced. The least recently used algorithm (LRU) is used for data replacement in level two cache. If the data have temporal access in level two cache, the way is made exclusive. On a cold miss with both the levels occupied, the block is placed in level one cache making it exclusive. As the number of ways to place the data increases, the performance increases in terms of the average memory access time (AMAT). Figure 10 shows this architecture.
The algorithm for two-type data cache model is given in the following section.
Algorithm for two-type data cache: Given two-level data cache, this algorithm places the line in the cache system. Input is the address. An index array is maintained per cache for each way. It is zero to indicate inclusive and one to indicate exclusive.
A level one hit occurs if the address is found in level one. The block is made inclusive by placing a copy of it in level two cache. The index array entry is set to 00 to indicate this in both the levels. The block is accessed and the process stops.
A level one miss and level two hit occurs if the address is not found in level one but is found in level two cache. The block is made exclusive in this case. This is indicated by setting index array entry of block to 11. The block is accessed and the process stops.
A cache miss occurs if the block is not present in level one and level two cache. If level one cache is vacant, it is placed in level one cache. Else, if level two cache way is vacant, it is placed in level two cache. The block is made exclusive. The block is accessed and process stops. If the level one set is full, the block replaces an existing block based on least recently used algorithm. The corresponding index array entry is made exclusive by setting it to 11. The mapping process stops. Table 6 gives the algorithm.
Let R, be the number of references, number of hits in level one, number of hits in level two, number of hits in level one after first hit, number of new level one hits, misses that are filled in vacant level two ways, respectively. Let be level one access time, level two access time, transfer time between level one and level two, transfer time between level two and main memory, transfer time between level one and main memory, respectively. Let the proposed system be denoted as . Let k be the time taken to update index array. The AMAT is given by Eq. (9).
In Eq. (9) the first term gives inclusive cache type hits, the second term indicates first time level one hits. The third term indicates the level two hits. The fourth term indicates placing a block in vacant level two way/set. The fifth term indicates either placing a block in free level one cache set/way or replacing a level one cache set/way on cache full scenario. It is assumed that exchange of a block from level one to memory can be done in parallel. The term 2 k is to update the index arrays of both the cache levels. The first term gives the hit time in level one cache. The second term gives the hit time in level two cache. This involves accessing level one and level two caches. The third term gives the time to service the misses. This involves accessing level one and level two caches, determining if it is missed in both levels, accessing main memory and fetching the block into level one. The level one block is written to the memory before the new block is fetched. Simultaneously, the requested block is sent to the processor.
Denote the inclusive cache as and exclusive cache as . Let be the number of level one hits and level two hits in inclusive cache, number of level one hits and number of level two hits in exclusive cache, respectively.
The AMAT for inclusive cache is given by Eq. (10).
The first term is the level one hit time. The level two hit time is given by second term. This involves accessing level one, level two and transferring data from level one to memory and data from level two to level one cache. The miss time is given by third term. This involves accessing level one, level two to determine it is a miss in both levels, write from level one to memory and level two to memory the existing blocks. The new block from main memory is fetched into level two cache and from there to level one cache. Simultaneously, it is given for processing. An improvement in performance is observed as given by Eq. (11).
The AMAT for the exclusive system is given by Eq. (12).
The first term in Eq. (12) is the level one hit time. The second term is for level two cache hits. This needs access of level one cache, level two cache and exchange the contents. On a miss in both the levels, the contents of level two are updated in the main memory, level one cache block is sent to level two and new block is fetched from main memory to level one cache. Simultaneously, the requested block is sent to the processor. The terms in Eq. (9) and Eq. (12) differ due to the architectural differences. A performance improvement in the proposed model over exclusive cache is seen as given by Eq. (13).
In all the models, it is assumed that the cache block when fetched into the cache is simultaneously sent to the processor.
The proposed two-type data cache model is simulated using SPEC2000 benchmarks. The proposed model is compared with inclusive, exclusive caches described in this chapter. The simulation parameters are given as follows (Tables 7 and 8):
The AMAT values are given in the graph in Figure 11. It is compared with inclusive and exclusive caches.
As observed from Figure 11, improvement in AMAT in proposed system compared with inclusive cache is seen. There is a decrease in AMAT by 3% compared with exclusive caches. The proposed model has better performance for systems where elements of set are accessed more than two times after a sequence of other cold misses mapped to the same set such that the number of total misses is a multiple of number of elements of level one cache.
The AMAT values are shown in Table 9.
The author thanks Santa Clara University, Santa Clara, CA, USA for providing SPEC2000 benchmarks.