UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

The OST cache Meng, Ming 2005

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2005-0558.pdf [ 3.86MB ]
Metadata
JSON: 831-1.0092427.json
JSON-LD: 831-1.0092427-ld.json
RDF/XML (Pretty): 831-1.0092427-rdf.xml
RDF/JSON: 831-1.0092427-rdf.json
Turtle: 831-1.0092427-turtle.txt
N-Triples: 831-1.0092427-rdf-ntriples.txt
Original Record: 831-1.0092427-source.json
Full Text
831-1.0092427-fulltext.txt
Citation
831-1.0092427.ris

Full Text

The OST Cache by Ming Meng B.A.Sc, The Xi'an Jiaotong University, China, 1992 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES ( ELECTRICAL AND COMPUTER ENGINEERING ) THE UNIVERSITY OF BRITISH COLUMBIA October 2005 © Ming Meng, 2005 Abstract Current split data caches classify data as having either spatial locality or tempo-ral locality. The cache usually comprises two subcaches with each subcache being used to cache one type of data. More precisely, however, data can exhibit not only temporal or spatial locality, but also both localities, or no locality. Current split data caches do not take advantages of this more precise characterization. In addition, run-time locality change is not supported completely by current split data caches, leading to lower system performance. This thesis presents a new data classification method which divides data into five types in terms of the locality they present. A new kind of split data cache, called the OST cache, which has three subcaches, and an innovative control mechanism, which can detect data locality accurately and supports flexible runtime locality change, are also proposed in this thesis. Experimental results show that the OST cache can improve system performance greatly compared to a conven-tional cache. Compared to other split caches (SS/NS and NTS), the OST cache also shows a better overall system performance. The OST cache can be used in embedded systems which have small L l cache and off-chip L2 cache, and narrow bus between LI cache and L2 cache. i i Contents Abstract ii Contents iii List of Tables vi List of Figures vii Acknowledgement ix 1 Introduction 1 2 Review of Cache Memory 3 • 2 . 1 Cache Memory Overview 3 2.2 Cache Memory Principles 5 2.3 Cache Performance 10 2.4 The Principle of Locality 13 2.4.1 Temporal Locality 13 2.4.2 Spatial Locality 14 iii 3 Related Works 15 3.1 Overview 15 3.2 Split Cache 16 3.2.1 The Dual Data Cache 17 3.2.2 The Split Temporal/Spatial Cache 19 3.2.3 The Split Spatial / Nonspatial Cache 20 3.2.4 The Nontemporal Streaming Cache 21 3.2.5 The Assist Cache 23 3.3 Discussion 24 4 The OST Cache 29 4.1 New data classification 29 4.2 The OST cache 30 4.3 Architectural Structures . . 30 4.4 Control Mechanisms 30 4.4.1 Locality Detection 31 4.4.2 Control Mechanism 33 iv 4.5 Advantages of the OST Cache 43 5 Simulation 47 5.1 Benchmarks and Simulator 47 5.2 Simulation Metrics 48 5.3 Performance Evaluation 52 5.3.1 OST Cache Performance Under Different Sizes of ONC 52 5.3.2 OST Cache Performance Under Different Sizes of SLC 54 5.3.3 The Effectiveness of the Bypass Operation 57 5.3.4 Performance Comparisons 58 6 Summary 66 6.1 Future Work 67 References 68 v List of Tables 1 Caches Used by Common Processors 15 2 Summary of Related Work 25 3 Data Locality in LI 34 4 Data Locality in L2 34 5 FCU Control Mechanism 35 6 RCU Control Mechanism 41 7 Example of RCU Control Mechanism 42 8 Summary of the Split Caches 46 9 Parameters of Locality Change 50 10 Simulation Metrics 51 11 OST Configuration 57 12 Cache Configurations 60 13 Memory Access Cost 60 14 Data Bus width 60 15 Cache Complexity 61 vi List of Figures 1 M e m o r y H i e r a r c h y 3 2 S y s t e m P e r f o r m a n c e 4 3 C a c h e S t r u c t u r e 5 4 C a c h e O r g a n i z a t i o n 7 5 B l o c k A d d r e s s 8 6 A d d r e s s M a p 8 7 D u a l D a t a C a c h e 18 8 S S / N S C a c h e 2 0 9 N T S C a c h e 22 10 T h e O S T C a c h e 31 11 T h e O S T C a c h e S t r u c t u r e 32 12 Sta te T r a n s i t i o n D i a g r a m o f L o c a l i t i e s 3 7 13 S i m u l a t o r S t r u c t u r e 4 8 14 P e r f o r m a n c e U n d e r D i f f e r e n t S i z e o f O N C 5 3 15 P e r f o r m a n c e U n d e r D i f f e r e n t S i z e s o f S L C 55 v i i 16 Performance Under Different Sizes of SLC and ONC 56 17 Impact of the Bypass 59 18 Comparison of the OST and C V N Cache 62 19 Miss Rate 63 20 Average Access Cost 63 .21 Average Bus Traffic 64 22 System Performance Improvement 67 viii Acknowledgement I would like to thank professor Mabo Ito for his supervision and advice. ix 1 INTRODUCTION 1 Introduction T h e p e r f o r m a n c e g a p b e t w e e n C P U ( S R A M ) a n d m a i n m e m o r y ( D R A M ) has b e e n g e t t i n g so l a r g e o v e r the last t w e n t y years that s y s t e m p e r f o r m a n c e is r e d u c e d b e c a u s e the C P U has to w a i t f o r the l o w e r s p e e d m e m o r y o p e r a t i o n s . T h e w i d e l y u s e d s o l u t i o n is c a c h e . C a c h e is l o c a t e d b e t w e e n the C P U a n d the m a i n m e m o r y . Its m a i n o b j e c t i v e is to r e d u c e the p e r f o r m a n c e g a p b e t w e e n the C P U a n d the m a i n m e m o r y b y p r o v i d i n g data to the C P U w h e n the data is r e q u e s t e d b y the C P U . T h e r e are m a n y t e c h n o l o g i e s e m p l o y e d i n c a c h e d e s i g n . O n e o f t h e m is the i d e a o f the s p l i t c a c h e , w h i c h s p l i t s a c a c h e i n t o severa l s u b c a c h e s b a s e d o n data l o c a l i t y . S e v e r a l s p l i t data c a c h e s have been p r e s e n t e d i n the o p e n l i t e r a t u r e ; they are S T S , N T S , S S / N S , etc. A l l o f these caches def ine t w o types o f data a n d s p l i t L I data c a c h e i n t o t w o s u b c a c h e s . D a t a w i l l be c a c h e d i n o n e o f these t w o s u b c a c h e s b a s e d o n t h e i r l o c a l i t y . T h i s thesis presents a n e w data c l a s s i f i c a t i o n m e t h o d w h i c h d i v i d e s data i n t o f ive types i n t e r m s o f the l o c a l i t y they present . A n e w k i n d o f s p l i t data c a c h e , c a l l e d the O S T c a c h e ( 0 stands f o r u n k n o w n l o c a l i t y , S stands f o r s p a t i a l l o c a l i t y , a n d T stands f o r t e m -p o r a l l o c a l i t y ) , i s p r o p o s e d too. T h e O S T c a c h e has three s u b c a c h e s : o n e is the t e m p o r a l data s u b c a c h e ; a n o t h e r i s the s p a t i a l data s u b c a c h e ; the t h i r d o n e is d e d i c a t e d to s t o r i n g data w i t h n o o r u n k n o w n l o c a l i t y . T h e O S T c a c h e is d e s i g n e d to be u s e d i n e m b e d d e d s y s t e m s w h i c h h a v e s m a l l c a c h e s i z e a n d n a r r o w bus w i d t h . A n i n n o v a t i v e c o n t r o l m e c h -a n i s m , w h i c h c a n detect data l o c a l i t y a c c u r a t e l y a n d s u p p o r t s f l e x i b l e r u n t i m e l o c a l i t y c h a n g e , are a l so p r o p o s e d i n this thesis . S i m u l a t i o n s s h o w that the O S T c a c h e c a n g r e a t l y i m p r o v e s y s t e m p e r f o r m a n c e (av-erage m i s s rate, average access t i m e , a n d a v e r a g e b u s traf f ic) c o m p a r e d to a c o n v e n t i o n a l c a c h e . A n d the O S T c a c h e a l so has better o v e r a l l s y s t e m p e r f o r m a n c e than N T S a n d 1 1 INTRODUCTION SS/NS cache designs in terms of the benchmarks used. This thesis is organized as following: background knowledge about cache is intro-duced in the second section. The third section provides explanations about the related split caches. A detailed description of the OST cache design is given in the fourth section. Simulations and conclusions are presented in the last two sections. 2 2 REVIEW OF CACHE MEMORY 2 R e v i e w o f C a c h e M e m o r y This section intorduces backgroud knowledge about cache, including cache memory prin-ciples, cache performance, and the principle of locality. 2.1 Cache Memory Overview Cache memory is a small amount (normally less than 1MB) of high-speed memory re-siding on or close to the CPU. It is intended to provide a memory system, which has a speed approaching that of the memory system's fastest level and costs almost as little as the memory system's cheapest level. Cache memory is only part of the memory hierar-chies. The registers in the CPU represent the highest level in this hierarchy; below these registers are multi-levels of caches, followed by main memory. The registers in the CPU have the highest speed and the highest cost. The main memory has the lowest speed and the lowest cost. A cache is between them. Its speed is between them too. This concept is illustrated in Figure 1. Figure 1: Memory Hierarchy Some caches are built into the micro-architecture of microprocessor implementa-tions. The main reason why a cache is placed between the CPU and the main memory 3 2.1 Cache Memory Overview 2 REVIEW OF CACHE MEMORY is that the performance gap between the CPU (SRAM) and the main memory (DRAM) has been getting very large over the last twenty years. System performance is reduced because the fast CPU has to wait for the slow main memory operations. See Figure 2. From 1980, the CPU performance increased 135% per year to 1986, and 155% thereafter. However, memory has only increased in performance by 7% per year [11]. S R A M / ! moo | ^ ' ioo I ^y' / D R A M ,** ^ N#* s#"> g> /» ^ Figure 2: System Performance SRAMs are too expensive to put too many in the CPU as the register. Meanwhile, although the D R A M , which is used in main memory, is slow compared to the SRAM, it is still worthwhile for the large storage space because it is cheaper. How to increase the speed of main memory operations without dramatically increasing the cost? The solution to this problem is cache, which is located between the CPU and the main memory. The cache has a speed almost as fast as the CPU and a cost almost as low as the main memory. The goals of a cache are the improvement of system performance and the reduction of cost. 4 2.2 Cache Memory Principles 2 REVIEW OF CACHE MEMORY Generally, a cache is larger than the registers in the CPU, but it is much smaller than the main memory. So, it is impossible to store all the data and instructions used by a program in a cache. Little of the data and few instructions can stay in the cache at the same time. The ideal cache can provide CPU data or instructions when the CPU needs them. Thus, the CPU need not wait for the slow main memory operations. Each cache has a control mechanism, which can fetch the proper data or instructions just before the CPU needs them and release some spaces for more useful data or instructions by removing the appropriate data or instructions out of the cache. 2.2 Cache Memory Principles A cache is made of high-speed static R A M (SRAM) instead of the slower and cheaper dynamic R A M (DRAM) used for main memory. It is composed of cache blocks (lines). Throughout the thesis, 'line' is used to mean the traditional cache line in a L2 data cache, 'block' tends to mean a cache line in a LI data cache. Each cache block is composed of data and data address. For example, in Figure 3, each cache block has two parts: address and data, and there are four words in the data part. A d d r e s s Data C a c h e block | | Figure 3: Cache Structure When the CPU requests data (instruction or operand), it will determine if the data is in the cache. If the data is in the cache, it will be brought into the CPU from the cache. If the data is not in the cache, it will be brought into the CPU from the next level of the memory hierarchy. Meanwhile, a cache whole block, which this data belongs to, will be 5 2.2 Cache Memory Principles 2 REVIEW OF CACHE MEMORY brought from the next level of the memory hierarchy to the cache. If there is no space in the cache for this block, the cache control mechanism will remove an old block out of the cache and insert the new one in. Four aspects of cache are presented in this part: • Cache Organization: how to map a block from main memory to a cache • Block Identification: how to check if data in a cache • Cache Replacement: how to remove a block out of a cache • Write Policy: the data coherence problem in a cache memory system Cache Organization According to the location of a block in a cache, there are three categories of cache [11]: 1. Directly mapped cache: each block can only be put in one place in a cache, 2. Fully associative cache: each block can be placed anywhere in a cache, and 3. Set associative cache: each block can be placed in a set of places in a cache. Figure 4 shows these three different caches from the top to the bottom. The number in each cache block is the block address. Block 3 and block 11 in the main memory can only be mapped into block 3 (3 mod 8,11 mod 8) in the direct mapped cache. In the fully associative cache, block 3 and block 11 can be mapped anywhere in the cache. In the set associative cache, block 3 and block 11 can be mapped into any block in set 3 (3 mod 4, 11 mod 4), which means they can be mapped into block 6 or block 7. 6 2.2 Cache Memory Principles 2 REVIEW OF CACHE MEMORY Main Memory Direct Cache Fully Associative Cache Set Associative Cache Figure 4 : Cache Organization Block Identification How do we find data in a cache once it is requested? The address of the requested data can be used to search in the cache. As mentioned before, each block has an address part. Each address part has two sections. One is the block address; the other one is the block offset. The block address also has two sub parts: Tag and Index. The block address is saved along with the cache block when the cache block is fetched into the cache. Figure 5 shows the construction of the address part in a cache block. When CPU tries to search for data in a cache, the address of the data is used to 7 2.2 Cache Memory Principles 2 REVIEW OF CACHE MEMORY Block address Block offset Tag Index Figure 5: Block Address 8 7 4 3 Data Address 1 1 1 0 0 1 1 0 1 0 1 0 0 1 1 1 1 1 . 0 0 0 1 0 1 o n 01.11. Figure 6: Address Map generate the Tag, the Index, and the offset, which are compared with the block address in the cache to check if the data is in the cache. For example, assume that there is a 4-way associative cache. Its size is IK bytes, and each cache block contains 4 words, which means the block size is 4 words (or 16 bytes). So, there are 16 (=1024/16/4) sets in the whole cache. The address of the requested data is 32 bits. Bits 4 to 7 are the Index. They are used to select one set from the 16 sets in the whole cache. The lowest four bits (bit 0 to bit 3) in the data address are offset. They are used to select one word from the whole cache block. The rest of the address is Tag, which is compared with the Tag field in the cache block to check for the cache hit. In Figure 6, data is located in set 7 (0111), and it is the second word (0111) in the cache block. By comparing all the four Tag fields in set 7 simultaneously with the upper 24 bits of the data address (11100110101001111100010111), we can find out if the data is in the cache. Cache Replacement Generally, a cache is much smaller than the main memory, and more than one block can be mapped into the same place in the cache. If the destination of the requested data is 8 2.2 Cache Memory Principles 2 REVIEW OF CACHE MEMORY occupied, one block in those destination places will be removed out of the cache. In the directly mapped cache, just replace the old data with the new one because the new data has only one place to go. In the fully associative and set associative caches, there are three primary strategies used for selecting which block will be replaced. [11]. 1. Random: select a block randomly to be replaced, 2. Least-recenUy used (LRU): the block that has not been used for the longest time is replaced, and 3 . First in, first out (FIFO): the block that has been in the cache for the longest time is replaced. Random replacement is simple to implement. When the number of blocks in a set increases, it is complex and expensive to implement the LRU replacement strategy. The FIFO is a simple version of the LRU, but it is not very accurate. • Write Policy Generally, the data in a cache has more than one copy in the whole hierarchy memory system. At least one copy is in the main memory, and one copy is in the cache. If the data is never modified by the CPU while it is in the cache, the two copies of the data remain identical. If the data in the cache is modified, there will be two different copies of the same data in the whole memory system. One is in the cache, and the other is in the main memory. To solve this coherence problem, two policies are employed for cache write. 1. Write through: every time writing the cache occurs, update the main memory im-mediately, and 9 2.3 Cache Performance 2 REVIEW OF CACHE MEMORY 2. Write back: only update the main memory when the modified block is replaced from the cache. The write through policy can maintain data coherence all the time, but it causes more bus traffic. The write back has less bus traffic, but it cannot guarantee coherence all the time. 2.3 Cache Performance As mentioned before, the main goal of a cache is to reduce the time CPU uses to wait for memory access. So, average memory access time is used as a cache performance measure. Average memory access time = Hit time + Miss Rate X Miss Penalty Where Hit time is the time to hit in a cache, Miss Rate is the fraction of cache accesses that result in a miss (i.e., number of accesses that miss divided by number of accesses), and Miss penalty is the cost of getting the requested data from the lower level of memory hierarchy when a cache miss occurs. Based on the above formula, average memory access time can be decreased by reducing each item in the formula. So, there are three categories of method used for cache performance improvement. They are as follows: 1. technologies used to reduce the miss penalty, 2. technologies used to reduce the hit time, and 3. technologies used to reduce the miss rate . 10 2.3 Cache Performance 2 REVIEW OF CACHE MEMORY The following subsections provide more information concerning the above tech-nologies. Reducing the Miss Penalty The gap between the processor and the memory caused the relative cost of the miss penalty to increase over time. The most important technology to mitigate this is the mul-tilevel cache, which can match the speed of the fast processor in the first-level cache and provide a large enough space in the second-level cache, thereby lessening the effective miss penalty. The second technology is impatient. It will send the requested data to the processor as soon as it arrives in a cache and fetch other data in the same block after that. This technology can reduce the processor stall time because the CPU does not need to wait for the whole block to be fetched into the cache. The third technology gives priority to reads over writes, since the processor generally waits for reads but continues after launch-ing writes. The fourth technology merges several replaced blocks into one big block in a write buffer and writes them into the main memory at the same time, which is faster than writing them one by one [11]. The fifth technology is the victim cache, which adds a small fully associative cache, called victim cache, between a cache and its refill path. Every time a cache miss occurs, the victim cache is checked before the data is fetched from the lower level memory hierarchy. Reducing the Hit Time The hit time includes the time it takes for the index field of an address to read the tag field and the time it takes to compare the tag with the address. The most effective technology to reduce the hit time involves small and simple caches. Smaller hardware can reduce the index and comparison time. A simple cache (i.e., the direct mapping cache) can reduce 11 2.3 Cache Performance 2 REVIEW OF CACHE MEMORY the hit time by overlapping the tag comparison with the data transmission. There are also other technologies that can reduce the hit time, i.e., by avoiding address translation during the indexing of a cache, pipelining cache accesses, and trace caches. Reducing the Miss Rate There are three types of miss: • Compulsory • Capacity • Conflict The compulsory miss is also called the cold-start miss. It occurs when a cache block is referenced for the first time. The capacity miss occurs when a cache cannot contain all the blocks needed during the execution of a program. The conflict miss occurs in the set associative cache and directly mapped cache when too many blocks are mapped into some popular sets. The compulsory miss can occur in any type of cache and is independent of cache size. The capacity miss occurs in a fully associative cache, and it decreases as the cache capacity increases. The conflict miss occurs in a set associative cache, and it decreases as the associativity increases. Increasing block size can reduce the compulsory misses. Increasing cache size can reduce the capacity misses, and increasing cache associativity can reduce the conflict misses. But all these methods have their drawbacks. Larger blocks may decrease the number of blocks in a cache, then increase the conflict misses and the miss penalty. The 12 2.4 The Principle of Locality 2 REVIEW OF CACHE MEMORY obvious drawbacks of the larger cache are longer hit time and higher cost. The higher associativity will come with longer hit time [11]. Another way to reduce the miss rate is compiler optimizations, which can rearrange instructions and data during compile time. The hardware and software prefetching technologies can also help to reduce miss penalty and miss rate. This kind of technology tries to fetch the requested data or instruc-tion into a cache before processors need them. 2.4 The Principle of Locality Cache performs a very important role in system performance improvement. Why does it work? It is because of the principle of locality. This section presents the principle of locality. The principle of locality of programs is very important to cache. It states that pro-grams tend to reuse data and instructions they have used recently. This implies that we can predict what instructions and data a program will use in the near future based on its accesses in the recent past. The principle of locality can be applied to both instructions and data. There are two different types of locality: temporal locality and spatial locality. 2.4.1 Temporal Locality Temporal locality says that recently accessed items are likely to be accessed in the near future. So, if the temporal locality data is replaced after it is referenced for the first time, 13 2.4 The Principle of Locality 2 REVIEW OF CACHE MEMORY many cache misses may occur because of the following references to the same temporal data. Temporal data does not require long block size, since only the temporal data itself will be reused. 2.4.2 Spatial Locality Spatial locality states that items whose addresses are near one another tend to be refer-enced close together in time. If spatial locality data is cached as well as the data near it when it is referenced, many cache hits to these data may occur because they are spatial locality data. So, it is obvious that the spatial locality data needs a longer cache block to cache more data near it to increase the hit rate. 14 3 RELATED WORKS 3 Related Works Cache performance is very critical for the whole system, so improving cache perfor-mance is a very active research area. Many technologies have been presented so far. Among them, split cache is one valuable technology. This section presents some related split cache designs, which include the Dual cache [2], [5], Split Temporal/Spatial cache (STS) [3], Split Spatial/Non-spatial cache (SS/NS) [6], HP-7200 Assist cache [1], Non-Temporal Streaming cache (NTS) [4]. 3 . 1 O v e r v i e w Splitting cache is not a brand new idea. Many modern processors rely on split data and instruction caches. See Table 1 (http://www.cs.umass.edu/). Except for the PPC601, all other processors use split LI caches. PPC601 PPC603 PPC604 PP750 PPC970 Pentium P-2 P-3 P-4 Split N Y Y Y Y Y Y Y Y Data size 32K 8K 16K 32K 32K 8K 16K 16K 8K Instruction 8K 16K 32K 64K 8K 16K 16K 96K Associativity 8 2 4 8 D2/ 11 2 D4/ 12 2 D4/ I-trace Line size 64B 32B 32B 32B 64B 32B 32B 32B 32B Table 1: Caches Used by Common Processors Instructions and data have different memory access patterns. During runtime, few programs modify instructions, but most programs do both read and write operations on data. If instructions and data are separated into different caches, different configurations 15 3.2 Split Cache 3 RELATED WORKS can be used on them respectively. For example, in a split LI cache, the I-cache (instruction cache) is not usually designed to support the feature of write-back to L2 cache or main memory, since few programs need that. The D-cache (Data cache) is designed to support both read and write to L2 cache or main memory. The split LI cache can reduce bus traffic by reducing the number of write-back operations and increase system speed by allowing instruction fetching and data accessing to occur in the same clock cycle. Although splitting LI caches into data cache and instruction cache is very common these days, LI data caches use no separation based on the nature of the locality exhibited by different data references. LI data caches handle all memory references in a uniform manner. Whenever a miss occurs, a whole block in the cache is replaced by a new block, and there is only one kind of block size, and only one type of associativity, replacement, and write back policy. Generally, spatial data require a large block size to increase hit rate, while temporal data need only a small block size to reduce bus traffic. If a uniform cache with a large block size is used for both temporal and spatial data, unnecessary data transfers between LI cache and lower level memory hierarchy may occur when a temporal data is requested. On the other hand, if the block is too small, a higher miss rate is inevitable for spatial data. 3.2 Split C a c h e Since data behaviour changes widely among applications and even in the same applica-tion, many researchers concentrated on the split data cache architecture. Many split LI data cache technologies have been proposed in the open literature. They include the Dual cache [2], [5], Split Temporal/Spatial cache (STS) [3], Split Spatial/Non-spatial cache (SS/NS) [6], HP-7200 Assist cache [1], Non-Temporal Streaming cache (NTS) [4]. This section presents a brief survey of these cache designs according to the following 16 3.2 Split Cache 3 RELATED WORKS three aspects. • Structure of the split cache: number of subcaches, block size, cache size, and cache associativity • Classification of data: how to classify data and where to cache it • Control Mechanism: how the split cache works 3.2.1 The Dual Data Cache The dual data cache was proposed in [2]. In this design, LI data cache is split into a temporal subcache and a spatial subcache. The temporal subcache is designed to exploit temporal locality only. The spatial subcache is designed to exploit both spatial and tempo-ral localities. The temporal subcache is 1/3 of the total data cache capacity, and the spatial subcache is larger than the temporal subcache; it is 2/3 of total data cache capacity. The block size in the temporal subcache is eight bytes, while in the spatial subcache, it is 16 or 32 bytes. Both subcaches work independently and in parallel. There is a Locality Predict table, which is used to predict the locality of data. Figure 7 shows the structure of the dual data cache. When a processor issues a memory access, both subcaches are looked up simul-taneously. If the access hits only in one subcache, the data is read from or written into that subcache. If it hits in both subcaches, the data is read from the temporal subcache or written into both subcaches in parallel. If it misses in both subcaches, one cache block will be fetched from the next level of the memory hierarchy and put into one of the sub-caches or not cached at all based on the locality prediction table. This locality prediction table contains information about recently used load/store instructions. Each entry in the 17 3.2 Split Cache 3 RELATED WORKS Hit/miss to L 2 cache Miss A A i i Hitt Hits Program counter Data from L 2 cache X Locality prediction table Temporal Cache Hitt Memory request from CPU Spatial Cache Hits J Figure 7: Dual Data Cache table includes six fields: Instruction address, Last address, Strides, Length, State, and Prediction. The locality prediction table can predict the locality of the data used by load/store instructions based on the running history. Every time a load/store instruction is executed, the prediction table is consulted. • If a miss occurs, and the instruction address is not in the table, a new entry will be allocated to the instruction address, and the data will be fetched into the temporal subcache. • If a miss occurs, and the instruction has an entry in the table, the data will be fetched from the next level of the memory hierarchy. Then, after the prediction table is updated, the data will be put into the spatial subcache, temporal subcache, or bypass based on the information in the state and predict fields in the prediction table. 18 3.2 Split Cache 3 RELATED WORKS • If the access hits in the LI cache, the prediction table will also be updated. From the above we can see that this dual data cache predicts data locality (or classi-fies data) by analyzing addresses of data used by load/store instructions at runtime. Each data is associated with at least one instruction. 3.2.2 The Split Temporal/Spatial Cache The split temporal/spatial cache was introduced in [3]. In this design, the LI data cache is split into two subcaches. One is the temporal subcache, which is intended to cache the temporal locality data; the other is the spatial subcache, which is designed to store the spatial locality data. The temporal subcache has a one-word long block, and the spatial subcache has a four-word long block. A secondary cache is included only for the temporal subcache. The cache controller classifies data as temporal data or spatial data and makes a decision about which subcache to use at compiling time, runtime, or both. Two counters are assigned to each block in the spatial subcache. One counter is used to record the number of accesses to this block. The other one is used to detect the temporal locality by presenting the difference between the number of the accesses of the upper part in the block and the number of the accesses of the lower part in the block. When the difference is greater than a specified threshold and the first counter saturates, the cache block is marked as temporal and removed from the spatial subcache. Initially, all blocks are cached in the spatial subcache. Once a block is marked as a temporal locality block, it will go to the temporal subcache and will not change back to a spatial locality block. 19 3.2 Split Cache 3 RELATED WORKS 3.2.3 The Split Spatial / Nonspatial Cache The Split Spatial/Nonspatial Cache(SS/NS)was introduced in [6]. In this design, LI data cache is split into two subcaches. One is the spatial subcache; the other is the non-spatial subcache. The spatial subcache has a four-word (32bytes) long block and is designed to cache the data with useful spatial locality. The block of the non-spatial subcache is one word long. And this subcache is intended to cache the data without useful spatial locality. Both of the subcaches have the same number of blocks and the same associativity, so the spatial subcache is four times larger than the non-spatial subcache. F lags Spat ia l s u b c a c h e Non-spat ia l s u b c a c h e 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 0 I _ Spat ia l detec ion Local i ty des ic ion I s s t Seconda ry c a c h e Figure 8: SS/NS Cache SS/NS uses a flag-based method to detect the spatial locality of data. Each word in the spatial subcache is assigned a one-bit flag. Thus, each block has four-bit flags. See Figure 8. When a block is fetched from L2 cache, all of the four flags are set to 0. Whenever a word is accessed during its cache lifetime (the time data remains in a cache), the corresponding flag is set to 1. When a block is replaced from LI data cache, all the 20 3.2 Split Cache 3 RELATED WORKS four flags are checked. If more than two flags are set to 1, this block is marked as a spatial locality block; otherwise, it is a non-spatial locality block. This locality information will go to L2 cache as well as the block. When an LI data cache miss occurs, a block will be fetched from L2 cache. If the block has been marked as a spatial locality block or it has not been marked because it is referenced for the first time, the whole block will be fetched into the spatial subcache. The flags of this block are all set to 0; if the block is marked as a non-spatial locality block, only the requested word will be fetched into the non-spatial subcache. The use of the data in the non-spatial subcache is monitored during its whole cache lifetime. Once two data that belong to the same block present simultaneously in the non-spatial subcache, this block will be marked as a spatial locality block and cached into the spatial subcache. The word in the non-spatial subcache is transferred into the spatial subcache directly, while other words in the same block will be fetched from the secondary cache. The locality information of each block is stored in the secondary level cache when the block is in the secondary cache. For the blocks, which are not in the secondary cache, a one-bit spatial locality predictor is used. 3.2.4 The Nontemporal Streaming Cache The nontemporal streaming cache (NTS) was introduced in [6]. This design splits an LI data cache into two subcaches: the temporal subcache and the non-temporal subcache. The temporal subcache is larger and is directly mapped. The non-temporal subcache is smaller and is fully associative. These two subcaches have the same block size of 32 bytes (4 words). 21 3.2 Split Cache 3 RELATED WORKS A block is considered temporal if, during its last cache lifetime, at least one of its words was referenced. A block is considered non-temporal if no words were referenced during its last cache lifetime. A non-temporal data detection unit (NTDU) is used to detect the locality of data. during runtime. See Figure 9. Each word in the temporal subcache is assigned a reference bit. Al l these bits are laid out as a matrix with the width equal to the number of the words in one cache block and the depth equal to the number of blocks in the temporal subcache. This hardware bit-map structure is called NTDU. In addition, each block in the temporal subcache and in the secondary cache has an NT bit, which is used to indicate where the block should go (temporal subcache or non-temporal subcache) when it is fetched from the secondary cache. The NT bit will be transfered to the secondary cache when the corresponding block is replaced from the temporal subcache. N T D U N T bit Tempora l s u b c a c h e Non- tempora l s u b c a c h e 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 N T bit S e c o n d a r y c a c h e Figure 9: NTS Cache When a memory access is requested by a processor, both the temporal subcache and the non-temporal subcaches are checked sinmultaneously. 22 3.2 Split Cache 3 RELATED WORKS • If the access hits in the non-temporal subcache, only the requested data is delivered to the processor. No other operations are needed. • If the access hits in the temporal subcache, besides the delivery of the requested data, the reference bit of the requested data in the NTDU and the NT bit of the block, which the requested data belongs to, need to be updated as follows: - The NT bit is reset if the reference bit is set; otherwise, keep the NT bit with no change. - Set the reference bit to 1. These two operations are performed in parallel. • If the access misses in the LI cache, the requested data will be fetched from the secondary cache. If the NT bit of the block in the secondary cache is 1, the block will go to the non-temporal subcache. If it is 0, the block will go to the temporal subcache. The NT bit will be lost when the corresponding block is replaced from the secondary cache. The NT bit of the new block, which is fetched from the next level of the memory hierarchy into the secondary cache, will be initialized as 0. 3.2.5 The Assist Cache The Assist Cache is incorporated into the HP PA7200 CPU and was introduced in [1]. The LI data cache in the assist cache is split into two subcaches, too. One is a larger, directly mapped main cache. The other is a smaller, fully associative assist cache. The goal of the assist cache is to reduce the conflicts in the main cache. 23 3.3 Discussion 3 RELATED WORKS This scheme does not need additional hardware to detect the locality of data. The locality detection is performed during compile time. During compile time, some memory access instructions are marked as spatial locality only by compiler. At runtime, data accessed by these spatial locality only instructions is considered as spatial locality data and placed in the assist cache. The spatial locality data in the assist cache will be bypassed by the main cache and go directly back to the secondary cache when it is replaced from the assist cache. Data not accessed by instructions marked as spatial locality only is considered as having both temporal and spatial locality and will go to the assist cache first. When it is replaced from the assist cache, it will go to the main cache instead of the secondary cache. From the above introduction, we can see that data will not go to the main cache only until it is replaced from the assist cache. This process will reduce the conflicts in the main cache. The bypass of the data accessed by the spatial locality only instructions can further reduce the main cache conflicts. 3 . 3 D i s c u s s i o n This section presented some split data cache designs. Table 2 is a summery of them. First, all the cache designs in the above table split LI data cache into two subcaches. The dual cache, the STS cache, and the assist cache have one spatial subcache and one temporal subcache respectively, because they assume that data has either temporal locality or spatial locality. The SS/NS cache and the NTS cache have two subcaches respectively too, but they use different methods to classify data. The SS/NS cache divides data into spatial locality data and non-spatial locality data, so it splits the LI data cache into spatial 24 3.3 Discussion 3 RELATED WORKS Caches Sub- L2 Locality Detection Initial Runtime Bypass caches cache Method Time Locality Locality change Dual Cache S,T N LPT RT T S^->T Y STS S,T Y Counters RT S S -+ T N SS/NS S,NS . Y Flags RT S S <-> NS N NTS T,NT Y NTDU RT T T ^ N T N Assist cache S,T N Compiler CT / / N Table 2: Summary of Related Work subcache and non-spatial subcache. The STS cache divides data into temporal locality data and non-temporal locality data, so it splits the LI cache into temporal subcache and non-temporal subcache. Actually, during runtime, data in a program can present not only the spatial locality or the temporal locality but also both localities or no locality. See the following piece of code: int i, n, s; int StartVal-O; int Grid[30]; n=30; s = 5; Grid[0] - StartVal; for(i-0; ij n; i++) { Grid[i] - Grid[0] + s; Grid[0] = i; } Variables s, i , and n are temporal locality data since they are used several times in 25 3.3 Discussion 3 RELATED WORKS the for loop. Elements in the Grid are referenced in the same for loop, so they are spatial locality data. The first element of the Grid - Grid [0] is referenced n times in the loop, so it has temporal locality as well as spatial locality. StartVal does not present any locality because it is only used once since being declared. Thus, data can be divided into four types according to the locality it presents during a period of time. In addition, when data is referenced for the first time, its locality is unknown. So, there are five types of data: • spatial locality data: presents spatial locality • temporal locality data: presents temporal locality • dual locality data: presents both spatial and temporal localities • null locality data: presents no locality • unknown locality data: referenced for the first time Since each type of data has its own memory reference pattern, each should be stored and dealt with individually. The spatial locality data need a large size block to increase cache hit rates. The temporal locality data need only a small size block, which can reduce bus traffic. The dual locality data can be looked at as spatial locality data and need large size blocks too. The null locality data should not go into cache as they are unlikely to be referenced again. The cache designs in Table 2 deal only with some of the situations. They either assume that data can only have spatial locality or temporal locality, or do not separate no locality data or unknown locality data from temporal locality data or spatial locality data. Following are some issues that may occur when data cannot be either put into appropriate subcaches or dealt with separately. 26 3.3 Discussion 3 RELATED WORKS • Putting no locality data into cache: this will waste cache space and cause extra bus traffic between cache and the main memory. • Putting dual locality data into temporal cache: since only the referenced data is stored in the temporal cache, subsequent references to the data near it will cause many cache misses and bus traffic as well. • Putting unknown locality data into spatial cache or temporal cache: the unknown locality data could be spatial or temporal, so several cache misses will occur if spatial locality data is put into temporal cache; if temporal locality data is put into a spatial cache, much useless data is moved into the spatial cache, causing extra bus traffic and replacements of useful spatial locality data. Second, besides the assist cache, all other cache designs in Table 2 assign a default locality to data when it is referenced for the first time. The dual cache and the NTS cache assume that data initially has temporal locality, and the STS cache and the SS/NS cache assign spatial locality to data when it is referenced for the first time. These assumptions will cause conflicts in spatial cache or temporal cache because the locality of the data is unknown when the data is referenced for the first time. It is possible that the data, which exhibits spatial locality, is assigned as temporal locality and goes into the temporal cache, or data that has temporal locality goes into a spatial cache. Which may lead to more problems : • Caching spatial locality data into a temporal subcache may cause multiple misses in the temporal cache. • Caching temporal locality data into a spatial subcache may cause a lot of useless bus traffic and replacements of useful spatial locality data, leading to more spatial cache misses. 27 3.3 Discussion 3 RELATED WORKS So, the unknown locality data should be isolated from data that has definite locality to avoid conflicts in temporal subcache or spatial subcache. A dedicated subcache for data with unknown locality would be a good idea. When data is referenced for the first time, the data will be brought to the dedicated subcache. In addition, this dedicate subcache can monitor the use of the data in it. When data is replaced from this subcache, the data will be marked as temporal locality data, spatial locality data, dual locality data, or no locality data based on its usage. When the data is referenced again, it will be cached and treated separately according to its locality. Third, in all the cache designs introduced above, when temporal locality data is de-tected, all data in the same block with the temporal locality data are considered as having temporal locality and treated in the same manner. Theoretically, one data presenting tem-poral locality does not mean that all the data in the same block are also temporal locality data. They could present spatial locality or no locality. The unified handling method may cause lots of cache misses when the data near the temporal data presents spatial locality. So, temporal locality should be assigned only to the data which presents temporal locality. Other data that belongs to the same block should not be marked as temporal locality data if they do not present any temporal locality. Last, not all the cache designs in Table 2 can provide all locality change ability at runtime. In theory, data can present different localities in different time periods while program is running. These changes could happen either between spatial and temporal locality, or between no locality and any localities. To increase the cache performance, a cache should reflect the real situation accurately and support locality change during runtime. 28 4 THE OST CACHE 4 The OST Cache 4.1 New data classification Although there are only two conventional localities, for more precise analysis the data locality property can be divided into four classes according to the locality that is present during a period of time. In addition, when data is referenced for the first time, its locality is unknown. So, five property classes can be defined as follows: • Spatial locality class - contains spatial locality data • Temporal locality class - contains temporal locality data • Dual locality class - contains data that exhibits both spatial and temporal localities • Unknown locality class - contains data that referenced for the first time, and could be any of other four types • Null locality class - contains data that presents no locality The spatial locality class needs a large size block to increase cache hit rates. The temporal locality class needs only a small size block, which can reduce bus traffic and cache size. Dual locality data can be looked at as spatial locality data and needs large size blocks as well. The null locality Class does not go into LI cache as it is unlikely to be referenced again. Since each class of data has its own memory reference patterns, they should be and dealt with separately. Otherwise, problems can occur, such as higher cache miss rates, extra bus traffic, replacement of useful spatial locality data, and waste of cache space. 2 9 4.2 The OST cache 4 THE OST CACHE 4.2 The OST cache Based on the above data classification and discussion, a new split cache design is pro-posed. This new cache is called the OST cache, which groups data into five classes and uses a dedicated subcache to save the data of unknown locality and store data in the tem-poral subcache or the spatial subcache only when its locality is definitely detected. If data does not present any locality, it will be bypassed from LI data cache until it presents lo-cality. This design can handle data, which has temporal locality, based on a single word. It also supports all types of locality change during runtime. 4.3 Architectural Structures Figure 10 is the structure of the new cache design. The main cache is split into s-cache and p-cache. S-cache is used to cache the data requested by the CPU for the first time. The p-cache is further split into two sub-caches. One is called the spatial locality cache (SLC), which is intended to cache spatial locality data and dual locality data. The other is called temporal locality cache (TLC), which is used to cache temporal locality data. 4.4 Control Mechanisms This section explains two aspects of the cache control mechanism in detail. They are how to detect the data locality and how the cache works. 30 4.4 Control Mechanisms 4 THE OST CACHE Address Bus (AB) P-Cache S-Cache Figure 10: The OST Cache 4.4.1 Locality Detection Hardware locality detection is used in the OST cache because software locality detection (used in the Assist Cache) cannot support runtime locality change. Currently, there are two types of hardware methods used to detect data locality: the first is based on instruction addresses (used in dual cache). The second type uses data addresses (used in SS/NS, STS and NTS). Both of these have advantages and drawbacks. The first type uses a history table to make the prediction. It keeps all the load/store instruction addresses, the referenced addresses used by those instructions, and some other information in a hardware table. During runtime, the locality of the data referenced by a instruction can be detected based on the analysis of the relationship between the in-struction address and the referenced data address. The main drawback to this detection method is that data locality detection is based on the load/store instruction addresses. As we know, the relationship between data and instructions is not one-to-one. It is very com-mon for the same data to be used by more than one instruction. So, it is possible that the same data presents different localities in different instructions. If the same data presents 31 4.4 Control Mechanisms 4 THE OST CACHE both temporal and spatial localities during a period of time, there will be two copies of the same data in LI data cache. One is in the temporal subcache, and the other is in the spatial subcache, which will lead to a coherence problem. In addition, the table overhead is also a problem. The second type adds additional tags into each cache block. Monitoring the refer-ences among these cache blocks, data localities can be detected. The major drawback of this design is the tag or counter overhead. DB bits P-Cache Control Bus(CB) T LDU ~ ^ O N C T S-Cache — » RCU L1 Cache DB A B bits AB L 2 Cache Figure 11: The OST Cache Structure A modified version of the second locality detection method is used in OST cache design. To detect data locality, in the LI cache, each word in one cache block is assigned 32 4.4 Control Mechanisms 4 THE OST CACHE one locality bit, and the whole block has one additional locality bit. See Table 3. In the L2 cache, in order to identify localities for both line and every word in the line, six bits other than five bits are used for each cache line. As mentioned before, spatial locality means that data belonging to the same line will probably be referenced soon. So, identifying the spatial locality based on line rather than word is very natural for spatial locality data. But, this is not true for temporal locality data because temporal locality is only about one word. We cannot say that all the words belonging to the same cache line are all temporal locality data because only one word in this line is presenting temporal locality. So, identifying temporal locality should be based on word rather than block. To identify localities for each word in a line, four additional bits (for four words in one line) are added to record the related information. The six locality bits in the L2 cache can present four different data localities. See Table 4. In addition, three control units are added. They are Locality Detection Unit (LDU), Fetch Control Unit (FCU), and Replacement Control Unit (RCU). The LDU is used to detect locality of data in LI cache. The FCU is used to select an appropriate subcache for data when the data is brought into LI cache. When data is replaced from LI cache, the RCU generates locality bits for the data according to its references. These locality bits will go to LI cache along with the data when it is replaced and be reused when the data is brought again from L2 cache to LI cache. Figure 11 shows the detailed architectural structure of the OST cache. 4.4.2 Control Mechanism As mentioned before, there are three simple control units in the OST cache design. They are FCU, LDU, and RCU. When a cache line is brought into LI cache, the FCU will direct it to an appropriate subcache. During a cache block is in LI cache, the L D U is responsible 33 4.4 Control Mechanisms 4 THE OST CACHE Locality bits in LI Locality of LI block 0 xxxx Only one word was referenced once 1 xxxx, only one x=l the same word was referenced at least two times 1 xxxx, more than one x =1 more than one word was referenced Table 3: Data Locality in LI Locality bits in L2 Locality of L2 line 00 0000 Referenced for the first time 00 xxxx No locality was detected before 01 xxxx Temporal locality was detected before 10 xxxx Spatial locality was detected before Table 4: Data Locality in L2 for monitoring the usage of it and recording the related information in corresponding locality bits. No extra cycles are needed for L D U operations, because L D U and caches work parallel, when a cache block is replaced from LI cache, the RCU will use the locality information recorded by the L D U to generate new locality bits, which will then be stored in L2 cache along with the whole cache block. RCU operations are simultaneous with write backs to L2 cache, so RCU does not reduce write back speed. Since FCU, LDU, and RCU work simultaneously with the OST cache, no extra CPU cycles are needed in the OST cache. FCU control mechanism The main responsibility of the FCU is to decide into which subcache a cache block should be placed. FCU makes this decision based on the locality bits of a cache line in L2 cache. See Table 5. 34 4.4 Control Mechanisms 4 THE OST CACHE Locality bits Selected subcache O N C S L C T L C bypass 00 0000 X 00 x x x x X X 01 x x x x X X X 10 x x x x X Table 5: FCU Control Mechanism • The locality bits are 00 0000, which means this line is accessed for the first time. This line will be brought into the ONC subcache. • The locality bits are 00 xxxx (at least one x is non zero), which means that some words in this line have been referenced before, and these words did not show any localities. This cache line will be bypassed by LI data cache, if the word requested this time was referenced before. Otherwise the whole line will be brought into the ONC subcache. • The locality bits are 01 xxxx (at least one x is 1), which means that some words in the line presented temporal locality before. If one of those temporal words is referenced again this time, the word will go to the TLC; if the word referenced this time is not one of those temporal words, the whole line will go to the ONC. But, before directing the line into one of the subcaches, FCU will check if one word belonging to this line has already been in the TLC. If so, besides the word in the TLC, the whole line will go to the SLC. The word in the TLC will be moved to the SLC too. If no word belonging to this line is in the TLC, the above operations will be done. • The locality bits are 10 xxxx (at least one x is 1), and the whole cache line will go to the SLC. 35 4.4 Control Mechanisms 4 THE OST CACHE After selecting the subcache for a cache block, the FCU will generate initial locality bits, which will be stored along with the whole cache line in LI cache. The initial block locality bit is set to 0; the locality bit of the referenced word is set to 1; and all other locality bits in the same cache block are set to 0. For example, if the second word in a cache block is referenced, the initial value of the locality bits are 0 0010. This initial locality bits setting method can predict data locality more accurately and can distinguish the data with temporal locality from the data without any temporal or spatial locality, which cannot be done in the SS/NS cache, which sets all locality bits to 0 when a cache line is fetched into LI cache. L D U control mechanism The L D U monitors and records the usage for each cache block during its cache lifetime. When a cache block is accessed, the L D U will update the locality bits for this cache block. First, the block locality bit is set to 1. Then the locality bit, which is associated with the current referenced word, is set to 1. These locality bits will be used by RCU to identify the locality when the block is replaced from the LI data cache. For example, when a cache block is brought into 11 cache, its locality bits are 0 0010, which means that the second word in the cache block is accessed. During its cache lifetime, if the second word is accessed again, the L D U will update the locality bits to 1 0010; if the third word is accessed, the L D U will update its locality bits to 1 0110; if no word is accessed during the whole cache lifetime, the locality bits of this cache block will remain 0 0010. RCU control mechanism When a cache block is replaced from LI cache to L2 cache, the RCU will generate the new locality bits for this cache line according to the locality information gathered during its LI cache lifetime and the corresponding locality bits in the L2 cache. Figure 12 is the state transition diagram of localities. This diagram shows that the 36 4.4 Control Mechanisms 4 THE OST CACHE A B, A2. 03 Figure 12: State Transition Diagram of Localities OST cache can support all kinds of locality change. Here locality means the locality of a cache line in L2 cache. Initially, all lines have unknown locality (N). Under condition A, locality changes to T; under condition B, locality changes to S; under condition O locality remains N. If a line has locality T, under 01, 02, A, and B, locality will change to T, N, T, and S respectively. If a line has locality S, under condition 03, 04, A l , A2, and B, locality will change to S, N , T, S, and S respectively. The above conditions are illustrated in detail in Table 6. This table also shows how the RCU generates locality bits. To explain clearly, assume that LB 1 is the current locality bits of the cache block, which is replaced from LI cache; LB2 is the previous locality bits in L2 cache of this cache block. LB is the new locality bits, which will be stored in L2 cache. LB2 can be 00 0000, 00 xxxx, 01 xxxx, or 10 xxxx. (at least one x in the xxxx is 1). LB1 can be 0 yyyy, or 1 yyyy (At least one y in the yyyy is 1.) Table 7 is some examples. 37 4.4 Control Mechanisms 4 THE OST CACHE • If LB2=00 0000, which means this L2 line was referenced for the first time and - If the corresponding LI block was not referenced again during its last LI cache lifetime (LB1= 0 yyyy), then mark this L2 line as having no locality (LB= 00 yyyy, only one y is 1) - If only one word in the corresponding LI block was referenced during the last LI cache lifetime (LB1= 1 yyyy, only one y is 1), then mark the L2 line as having temporal locality (LB= 01 yyyy) - If more than one word in the corresponding LI block was referenced during the last LI cache lifetime (LB1=1 yyyy, more than one y is 1), then mark the L2 line as having spatial locality (LB = 10 yyyy) • If LB2= 00 xxxx, which means the L2 line has been marked as having no locality, and - If the corresponding LI block was not referenced again during the last LI cache lifetime (LB1= 0 yyyy), then this L2 line still has no locality. LB=00 (xxxx OR yyyy) - If only one word in the corresponding LI block was referenced during the last LI cache lifetime (LB1= 1 yyyy, only one y is 1), then mark this L2 line as having temporal locality (LB= 01 yyyy) - If more than one word the corresponding LI block was referenced during the last LI cache lifetime (1 yyyy, more than one y is 1), then mark this L2 line as having spatial locality (LB = 10 yyyy) • If LB2= 01 xxxx, which mean one or more data in this 12 line has been marked as having temporal locality, and - If the corresponding LI block was not referenced again during the last LI cache lifetime (LB1= 0 yyyy), and 38 4.4 Control Mechanisms 4 THE OST CACHE * If the referenced word did not exhibit temporal locality during its previous LI lifecycle, then this L2 line is still a temporal line (LB=LB2) * If the referenced word presented temporal locality before, reset the cor-responding locality bit in LB to 0. If all the four locality bits are zero after the resetting, mark this block as having no locality (LB=00 yyyy). If the four locality bits are not all zero after resetting, mark this L2 line as having temporal locality. LB=01( xxxx XOR yyyy) - If only one word in the corresponding LI block was referenced during the LI cache lifetime (LB1= 1 yyyy, only one y is 1), mark this L2 line as having temporal locality. LB= 01 ( xxxx OR yyyy) - If more than one word in the corresponding LI block was referenced during the cache lifetime (1 yyyy, more than one y is 1), mark this L2 line as having spatial locality (LB = 10 yyyy) • If LB2= 10 xxxx , which means this L2 line was marked as having spatial locality before, and - If the corresponding LI block was not referenced again during the last LI cache lifetime (LB1= 0 yyyy), and the word referenced this time was never referenced before, or there is still more than one word in the L2 line presented spatial locality, mark thisL2 line as having spatial locality (LB= 10,xxxx). If only one word in the L2 line still has spatial locality, then mark the 12 line as having no locality (LB= 00 yyyy) - If only one word in the corresponding LI block was referenced during the last LI cache lifetime (LB1= 1 yyyy only one y is 1), and this word was not referenced before, or if there is more than one word in this L2 line that still has spatial locality, then mark this 12 line as having spatial locality (LB=10 xxxx). Otherwise, mark this L2 line as having temporal locality (LB= 01 yyyy) 39 4.4 Control Mechanisms 4 THE OST CACHE - If more than one word in the corresponding LI block was referenced during the last LI cache lifetime (1 yyyy, more than one y is 1), mark this L2 line as having spatial locality. LB = (10 xxxx OR yyyy) 40 4.4 Control Mechanisms 4 THE OST CACHE LB2 LB1 L B Locality 00 0000 0: 0 yyyy Only one y is 1 00 yyyy N A: 1 yyyy Only one y is 1 01 yyyy T B: 1 yyyy More than one y is 1 10 yyyy S 00 xxxx 0: 0 yyyy Only one y is 1 00 (xxxx OR yyyy) N A: 1 yyyy Only one y is 1 01 yyyy T B: 1 yyyy More than one y is 1 10 yyyy S 01 xxxx O l : 0 yyyy Only one y is 1 If (xxxx OR yyyy)!= xxxx 01 xxxx T If (xxxx XOR yyyy)!=0 ,01 (xxxx XOR yyyy) T 02: 0 yyyy Only one y is 1 If (xxxx XOR yyyy) = 0, 00 yyyy N A: 1 yyyy Only one y is 1 01 (xxxx OR yyyy) T B: 1 yyyy More than one y is 1 10 yyyy S 10 xxxx 03: 0 yyyy Only one y is 1 let (xxxx AND yyyy) = zzzz if zzzz =0, 10 xxxx if zzzz!=0, more than one z is 1 10 (xxxx XOR yyyy) S 04: 0 yyyy Only one y is 1 if zzzz!=0, only one z is 1 00 yyyy N A l : 1 yyyy Only one y is 1 let if (xxxx AND yyyy) = zzzz if zzzz = 0, 10 xxxx if zzzz!=0, more than one z is 1 10 (xxxx XOR yyyy) S A2: 1 yyyy Only one y is 1 if zzzz!=0, only one z is 1 01 yyyy T B: 1 yyyy More than one y is 1 10 (xxxx OR yyyy) S Table 6: RCU Control Mechanism 41 4.4 Control Mechanisms 4 THE OST CACHE L B 2 L B 1 L B Old Locality New Locality 00 0000 0 0100 00 0100 0 N 1 0100 01 0100 o T 1 1100 10 1100 0 S 00 0100 0 1000 00 1100 N N 1 0010 01 0010 N T 1 1100 10 1100 N S 01 1100 0 0100 01 1000 T T 01 0100 0 0010 01 0100 T N 01 0100 0 0100 00 0100 T N 01 0100 1 0010 01 0110 T T 1 0110 100110 T S 10 1110 0 0001 10 1110 S s 0 0100 10 1010 S s 10 1100 0 0100 00 0100 S N 10 1110 1 0001 10 1110 s s 1 0100 10 1010 s s 10 1100 1 0100 01 0100 s T 10 1100 1 0110 10 1110 s S Table 7: Example of RCU Control Mechanism 42 4.5 Advantages of the OST Cache 4 THE OST CACHE 4 .5 Advantages of the OST Cache 1. Reducing the conflicts between data that have different types of locality (spatial, temporal, or no locality). In OST cache, there are three subcaches, and the ONC subcache is designed to cache data with no locality or unknown locality. Data with determinate locality will be cached in the temporal subcache or the spatial subcache. These three subcaches can minimize interference between data that have different localities and can take advantage of the characteristics of each type of locality data to increase system performance. All the cache designs mentioned in the third section split L I cache into two sub-caches. They do not have a dedicated subcache for the data with no or unknown locality, and just cache such data into the spatial subcache or the temporal subcache. 2. Changing localities during runtime. Data may present different localities during the whole running process. So, a cache system should have the ability to detect the changes in the data locality and cache the data to an appropriate subcache according to its current locality. The more accurate the data locality detection, the less the conflict between different localities, and the higher system performance. The OST cache can support the data locality change not only when the data is replaced from the main cache but also when it is in the L I data cache. Temporal locality data can be moved to the spatial subcache immediately after its locality change (from temporal to spatial) is detected. The locality change supported by the OST cache is not just between the spatial locality and temporal locality. It also includes the changes between spatial or temporal locality and no locality. But in the NTS cache, once a cache block is marked as temporal locality, there is no chance for it to change to spatial locality. In the SS/NS cache, although data locality can 43 4.5 Advantages of the OST Cache 4 THE OST CACHE change from temporal to spatial and from spatial to temporal, there is no chance for temporal data or spatial data to change to no locality data. The flexibility of the data locality change enables the OST cache to reflect data locality more accurately, thus reducing the conflicts between data which have dif-ferent localities and leading to a higher performance. 3. Handling the temporal locality more accurately. The OST cache can handle the temporal locality based on word precision instead of the whole cache block, which is generally more than one word. This method can minimize interferences between temporal locality words and other words that may have spatial locality or no locality. In the NTS and the SS/NS cache, when a cache block is marked as temporal locality because of the repeated references to one word in that cache block, every word in the same cache block will be handled in the same way as temporal locality data. In the NTS cache, the whole temporal cache block will go to the temporal cache if one word in the block is referenced. In the SS/NS cache, each word in the temporal cache block will go to the temporal locality when the word is referenced. In real cases, one word presenting temporal locality cannot guarantee that the other words in the same cache block all have temporal locality. Handling all the words in the cache block uniformly as temporal locality may decrease performance. For example, a cache block is marked as a temporal block because of repeated refer-ences to one word in it during its cache lifetime. The other three words in the same cache block have spatial localities. If all the four words in the same cache block are handled as temporal locality data, and only the currently referenced word is placed into TLC (as in the SS/NS cache), three cache misses will be caused. On the other hand, if all the four words in the same cache block are presenting temporal locality, and there is no spatial locality between them, fetching all the four words into the TLC (as in the NTS) may cause more bus traffic, since the other three words may 44 4.5 Advantages of the OST Cache 4 THE OST CACHE never be referenced during the current cache lifetime. So, in OST cache, only the word that presents temporal locality is fetched into TLC. Any references to other words in the same cache block will cause the whole block to be fetched into the ONC. The localities of these words can be detected while they are in the ONC. But there is a potential cache coherence problem with this method. In the case of a temporal word in the TLC, a reference to another no locality word (or word with unknown locality) in the same cache block will lead to the whole cache block being fetched from the second-level cache to the ONC. Thus, there are two copies of the same temporal word in the main cache. One is in the ONC; the other one is in the TLC. Actually, in the OST cache design, this case will never happen. Every time a word in a temporal cache block is referenced, the OST control mechanism will check whether a temporal word in the same cache block is currently in the TLC. If so, the whole cache block will be fetched into the SLC as a spatial locality block and the temporal word will also be moved to SLC from TLC. Otherwise, the whole block will be fetched into the ONC. Thus, the potential cache coherence problem is avoided. 4. Bypassing the word that presents neither temporal nor spatial locality. If a word did not present any locality during its last cache lifetime, and the whole cache block also has no definite locality, it will be bypassed from the LI data cache and sent to CPU only when it is referenced again. This does not mean that this word has no chance of changing its locality and going back to LI the cache. For example, when a word in the same cache block is referenced for the first time, the whole cache block, including the word with no locality, will be fetched into the OST. If the locality of the whole cache block changes to one other than no locality during this cache lifetime, no bypass will be applied again on the word during its next referenced time. Bypassing the word with no locality can reduce bus traffic between the main cache 45 4.5 Advantages of the OST Cache 4 THE OST CACHE Caches Sub- L2 Locality Detection Initial Runtime Bypass caches cache Method Time Locality Locality change Dual Cache S,T N LPT RT T S ~ T Y STS S,T Y Counters RT S S —> T N SS/NS S,NS Y Flags RT s S <-+NS N NTS T,NT Y NTDU RT T T —> NT N Asist cache S,T N Compiler CT -/ / N OST S,T,N Y LPU RT N S <-> T, ST <-> N Y Table 8: Summary of the Split Caches and the second-level cache as well as save some valuable main cache space, leading to the reduction of conflict cache miss and the improvement of system performance. Table 8 summarizes all the related caches, including the OST cache. 46 5 SIMULATION 5 Simulation This section has three parts. The first part is a brief description of the simulator and benchmarks used; the second part gives a detailed description of the simulation, which tries to find the best configuration for the OST cache design. The last part evaluates the OST cache design by comparing it to three other cache designs. 5.1 Benchmarks and Simulator All simulations are based on SimpleScalar simulator (sim-outorder 3.0). The SimpleScalar tool set is a system software infrastructure used to build modeling applications for pro-gram performance analysis, detailed microarchitectural modeling, and hardware-software co-verification. It was created by Todd Austin. Today, SimpleScalar is developed and supported by SimpleScalar LLC (www.simplescalar.com). In addition, the SimpleScalar tools are used widely for computer architecture research and instruction. For example, in the year 2002, more than one third of all papers published in top computer architecture conferences used the SimpleScalar tools to evaluate their designs [8]. The simulation runs under the SunOS release 5.8 generic on the Sparc workstation. Simulation is implemented by modifying most parts of cache.c and few lines of sim-outorder.c. These two C files are included in SimpleScalar tool sets. Other parts of the SimpleScalar are unaltered. Default organization and configuration of CPU are used. Up to four outstanding loads are supported. There are 32 integer registers, 32 single-precision floating registers and 2 double-precision floating registers. Four parallel pipelines are included. Since the target applications are embedded systems such as multimedia and com-47 5.2 Simulation Metrics 5 SIMULATION munications applications and it is difficult to get all source files and test data, so only JPEG, MEPG, and PEGWIT are used in simulation. JPEG is a standardized compression method for full color and gray-scale images. JPEG is lossy, meaning that the output image is not exactly identical to the input image. MPEG is used for lossy compression for video. It is a standard for high quality digital video transmission. PEGWIT is a program for pub-lic key encryption and authentication. It uses an elliptic curve over GF (2 2 5 5 ) , SHA1 for hashing, and the symmetric block cipher square. Al l these benchmarks were compiled by the GNU GCC-based cross-compiler provided by the SimpleScalar LLC. After being compiled, the benchmarks run on the simulator modified and based on the SimpleScalar sim-outorder. During runtime, LI cache miss rate, average access cost, average bus traffic as well as some other parameters are collected. Figure 13 shows the structure of the simulator environment. benchmarks Compiled benchmarks V Gnu gcc 1/ Modified Sim-outorder Output parameters Input data Figure 13: Simulator Structure 5.2 Simulation Metrics For evaluation purposes, average access cost (AAC), LI cache miss rate (MR), and aver-age bus traffic (ABT) are used as performance measures. 48 5.2 Simulation Metrics 5 SIMULATION The LI cache miss rate is obtained by dividing the total number of misses by the total number of accesses. The average bus traffic is obtained by multiplying the number of each type of data transfer between the main cache and the second-level cache (fetch or writeback) by the data width (32 byte for data to and from TLC, others are 4*32 byte), then summing all the multiples and dividing by the total number of accesses. Since embedded systems with small LI cache and narrow bus between LI cache and L2 cache are target systems, a dedicated bus with bandwidth of 4 bytes per clock is assumed between the L2 cache and the LI cache. The effects of the L2 cache are ignored, i.e., assuming a 100% hit in the L2 cache. Each LI cache hit has a latency of one cycle, the LI cache miss and the L2 cache hit have a latency of four cycles for the first word and one cycle for each word thereafter. The average access cost is obtained by multiplying each type of data access (hit or miss) by the average cost (a hit costs 1 cycle, a miss costs 7 cycles), then summing all the multiples and dividing by the total number of accesses. The average access cost depends on cache access costs and miss rates. In the OST cache, data will be put into the ONC when it is referenced for the first time. During the data is in the ONC, the use of the data is monitored. When the data is replaced from the ONC, it is marked as temporal locality data, spatial locality data, or no locality data in the L2 cache. When the data is referenced again, it will be put into one of the subcaches based on its locality. Since different subcaches have different configurations, system performance can be improved by taking advantage of the characteristics of each type of data. Putting data into the wrong subcache may lead to a lower system performance. Thus, the correctness of data locality detection is very critical for the OST cache. To analyze the accuracy of the data locality detection, the total number 49 5.2 Simulation Metrics 5 SIMULATION of data for each different locality is gathered separately. See Table 9. changed.t # of the data marked as T when it is firstly replaced from the ONC changed.n # of the data marked as N when it is firstly replaced from the ONC changed-s # of the data marked as S when it is firstly replaced from the ONC Table 9: Parameters of Locality Change Runtime data locality change is a unique feature of the OST cache. Its impact should be evaluated in simulation. So, the total number of fetched data for each locality, the total number of replaced data for each locality, the total number of data for each locality change, and the total number of bypassed data are gathered to record the data locality change. See Table 10 for the detail information. 50 5.2 Simulation Metrics 5 SIMULATION simulation.bypass # of data bypassed simulation.ttos # of cache line changed from T to S simulation. fetchecLn # of no locality cache line fetched from the secondary cache simulation. fetchecLs # of S cache line fetched from the secondary cache simulation.fetched-t # of T cache line fetched from the secondary cache simulation.fetched_tn # of N locality cache line in a T type cache line fetched simulation.replaced_n # of N locality cache line replace from the main cache simulation.replaced_s # of S locality cache line replace from the main cache simulation.replaced_t # of T locality cache line replace from the main cache simulation. changed_nt # of cache line changed from N locality to T locality simulation.changed_ns # of cache line a changed from N locality to S locality simulation. changed_nn # of cache line changed from N locality to N locality simulation.changed_tt # of cache line changed from T locality to T locality simulation.changedJs # of cache line changed from T locality to S locality simulation.changed_tn # of cache line changed from T locality to N locality simulation.changed_st # of cache line changed from S locality to T locality simulation.changed_ss # of cache line changed from S locality to S locality simulation.changed_sn # of cache line changed from S locality to N locality simulation.changedJnt # of cache line in which one word changed from T to N , but the whole line is T simulation.changed_SS/NS .# of cache line in which one word changed from S to N , but the whole line is S Table 10: Simulation Metrics 51 5.3 Performance Evaluation 5 SIMULATION 5.3 Performance Evaluation The performance evaluation includes the following: • Evaluating the performance of the OST cache by checking the impact of the ONC subcache, the SLC subcache, and the bypass operation • Trying to find the best configuration for the OST cache • Comparing the OST cache design with other related cache designs in complexity and system performance 5.3.1 OST Cache Performance Under Different Sizes of ONC Simulation parameters The size of the ONC subcache is important in the OST cache design. If the ONC is too small, the time data remains in the ONC subcache will be too short for accurate locality detection. On the other hand, if the ONC is too large, the locality detection accuracy may become less of an issue, because the effectiveness of the split cache will eventually decrease to the level of a conventional cache. For the purpose of evaluation, assume that the TLC is lk bytes, the SLC is 4k bytes, and let the size of the ONC change from 4k bytes to 16k bytes. From Figure 14 we can see the following: • The miss rate (MR) decreases as the size of the ONC increases in all benchmarks. Increasing the ONC size will extend the time data stays in the ONC, which may lead to more cache hits as well as less bus traffic and less access cost. 52 5.3 Performance Evaluation 5 SIMULATION Figure 14: Performance Under Different Size of ONC • MR's decreasing gets slower when the size of the ONC becomes larger, i.e., see the jpeg part in Figure 14. When the size of the ONC increases 2k bytes from 4k bytes to 6k bytes, the miss rate decreases 22%. When the ONC size increases 10k bytes from 6 to 16, the miss rate only decreases 22% too. The improvement gained by increasing the size of the ONC becomes less effective when the size of the ONC is large enough. • Average access cost (AAC) and average bus traffic (ABT) show a similar relation-ship according to the size of the ONC. The AAC and ABT decrease while the size of 53 5.3 Performance Evaluation 5 SIMULATION the ONC increase, but this improvement gets slower when the ONC size is greater than a certain number. 5.3.2 OST Cache Performance Under Different Sizes of SLC Simulation parameters To analyze the impact of the size of the SLC, assume that the TLC is lk bytes, the ONC is 6 or 8k bytes, and that the SLC changes from 4k bytes to 12k bytes. First of all, the SLC should be large enough to avoid the most conflict misses; second, increasing the size of the SLC will result in larger hardware overhead because each cache line has 5 bits of additional tag. So, finding the best SLC size range for the OST cache design is also important. Following are the simulation results. • In figure 15, Miss rate (MR), average access cost (AAC), and average bus traffic (ABT) decrease, when the size of the ONC increases. And as the ONC becomes larger and larger, i.e., greater than 8k bytes, the decrease rate becomes less, although the MR, AAC, and ABT are still decreasing. • In Figures 16, increasing the size of the ONC has less effect than increasing the size of the SLC. For example, see the jpeg miss rate in Figure 15. The upper line is for ONC = 8k bytes; the bottom line is for ONC = 6k bytes. Increasing the SLC from 4k bytes to 6k bytes, better miss rate will be gained than increasing the ONC from 4k bytes to 6k bytes. So, when the size of the total LI data cache is fixed, more space should be assigned to the SLC than to the ONC. In addition, this effect diminishes, when the size of SLC becomes larger, especially in the jpeg and mpeg applications. So, when the size of the total LI data cache is 54 5.3 Performance Evaluation 5 SIMULATION A B T vs. sizes of S L C 4 6 8 10 12 S U ' S i / c l K I l ) p e g e n c — • — p e g d e c .* i j p « g . m p e g d e c Figure 15: Performance Under Different Sizes of S L C fixed, more space should be assigned to the S L C than to the ONC. In the remanining simulations, ONC=6k bytes and SLC=8k bytes. 55 5.3 Performance Evaluation 5 SIMULATION P t g w H E r v c o c l . t . A A C P e t t w i l E n c o d e r , A B T « B 10 u f«'i!A>t D s e o d e r , A B T Figure 16: Performance Under Different Sizes of SLC and ONC 56 5.3 Performance Evaluation 5 SIMULATION 5.3.3 The Effectiveness of the Bypass Operation To test the effectiveness of the bypass operation, the bypass function is removed from the OST cache design. Then do all the above simulations again and compare these simulation results to the previous results which have the bypass operation. The OST cache was configured like this: ONC SLC TLC Cache size (kb) 6 8 1 Block size (byte) 16 16 4 Associativity 4 4 fully Write policy wb wb wb Table 11: OST Configuration The bypass operation works like this: when data in a cache line presents no locality, the data and the cache line it belongs to are marked as no locality. When the data is referenced again, the whole cache line it belongs to will not be fetched into the main cache. Only the referenced data is sent to the requested unit, i.e., the CPU. The purpose of using bypass here is to reduce bus traffic and increase system performance. The impact of the bypass operation on average bus traffic, average access cost, and miss rate are analyzed. Figure 17 shows both the simulation results with the bypass operation and without it. In Figure 17, for the pegwit applications, the bypass operation reduces average bus traffic significantly (by 6.83% and 6.77%). But the bypass operation also has some negative impact on the miss rate and the average access cost. The miss rates increase by 0.72% and 0.37% ; the average access costs increase by 0.28% and 0.16%. Since 57 5.3 Performance Evaluation 5 SIMULATION the improvements in the miss rate and the average access cost are small compared to the improvement of the average bus traffic, using the bypass operation in pegwit-like applications is a good alternative. In the mpeg and jpeg applications, the bypass operation has only a little impact on the improvement of bus traffic. The average bus traffic decreases only very slightly, by 0.017% in the jpeg application. In the mpeg application, bus traffic even increases by 0.25%. The reason for this is that the number of the bypass operations is too small (581) in the mpeg dec application compared to the number of bypass operations in the pegwit applications (68777,40443). The miss rate and the average access cost also become worse without the bypass operation. So the bypass operation is not suitable for such mpeg and jpeg applications, which have small numbers of no locality data. Applications (pegwit applications) that have a large number of no locality data will benefit from the bypass operations. 5.3.4 Performance Comparisons Simulation parameters In this part, the OST cache is first compared to conventional cache (CVN), then to two other similar cache designs: NTS and SS/NS. In the simulation, all four cache designs have a similar cache size. The C V N is 16k bytes. The NTS cache has lk byte temporal cache and 16k bytes non-temporal cache, and the total is 17k bytes. The SS/NS has 3k bytes non-spatial subcache and 12k bytes spatial subcache, and the total is 15k bytes. Based on the previous simulation results, let the OST cache have lk bytes temporal subcache, 8k bytes spatial subcache, and 6k bytes non-locality subcache. The total size is 15k bytes. Al l the subcaches in the SS/NS have 3-way associativity. The temporal subcache has full associativity in the OST and the NTS 58 5.3 Performance Evaluation 5 SIMULATION p e g w i t i h i ' i . * i i i - ' i L;,-,Mv. A S T t f w g . m p » g w i t t w w t t h o t * b y p a w . A B T p e g * i t d o c p e g w i t » n c • n o b , [ 1,  -I. —*— i - ; i M • AO bypaat --- ~ hypastt p c g w l i w i t l v w i t t i o u r b y p a a t , A A C j p v g . m p * g w i t h o u t b y p a s s . A A C 1 . T 4 1 . 6 4 1.U i N t H 1 0 2 • n o b y p a w • b y p a M • no bypas» •• bypaw p e e w i t w H r V w « r t c w r t b y p a s s . M i * t H M * JP*9. mp#g nfidvwltttot** bypass, Mies R»ta o.o: 9 . W 8 0 0 1 8 0 . 0 1 4 9 . 0 1 p o g w i t d a c p ^ y w i t a o c • n o b y p a « b y p a M • b y p a w — » ~ - b y p a n Figure 17: Impact of the Bypass 59 5.3 Performance Evaluation 5 SIMULATION cache. Other subcaches and the C V N have 4-way associativity. The write-back policy is used in all the cache designs. A l l the subcaches have a 16-byte block size, except the temporal subcache in the OST cache and the non-spatial subcache in the SS/NS. These two subcaches have a 4-byte block size. LI cache miss rate (MR), Average access cost (AAC), and Average bus traffic (ABT) are still used here as the performance measures. The definitions of these three measures remain the same as before. The benchmarks used to compare these different cache designs are jpeg, mpeg, and pegwit applications. Followings are the configurations of these cache designs. C V N NTS SS/NS OST Cache size (kb) 16 17(1+16) 15(3+12) 15 (1+8+6) Block size (byte) 16 16/16 4/16 4/16/16 Associativity 4 fully/4 3/3 fully/4/4 Write policy wb wb wb wb Table 12: Cache Configurations Cost(cycles) C V N NTS SS/NS OST Main cache hit 1 1 1 1 Secondary cache hit 7 7 7 7 Table 13: Memory Access Cost Locality C V N NTS SS/NS OST S 32B 32B 32B 32B N 4B 0/32B T 32B 4/32B Table 14: Data Bus width 60 5.3 Performance Evaluation 5 SIMULATION Complexity For the purpose of a more logical and reasonable comparison, the complexities of these caches are evaluated first. The total number of bits required to implement a cache is used as a measure of complexity. Only LI caches are included here, because in the target systems L2 caches are often off-chip and cost less than the LI cache. Since all caches are three or four-way associative, two bits are enough to encode LRU state information. One validity bit is needed for each cache block. NTS needs four reference bits and one NT bit for each block. SS/NS includes four bits(Flags) to detect spatial locality in its spatial subcache. The SLC subcache, TLC ubcache and ONC subcache of OST cacche includes 5 bits per block to detect locality. Assume that the caches are physically tagged, with the physical address being 36 bits wide. C V N NTS SS/NS OST Cache size (kb) 16 17(1+16) 15(3+12) 15 (1+8+6) Block size (byte) 16 16/16 4/16 4/16/16 Associativity 4 fully/4 3/3 fully/4/4 Number of blocks 1024 64/1024 768/768 256/512/384 Number of sets 256 0/256 256/256 0/128/96 LRU+V+L 2+1+0 2+1+5/2+1+5 2+1+4/2+1+0 2+1 + 1/2+1+5/2+1+5 Tag bits 26 34/26 27/29 35/27/34 Data bits 128 128/128 32/128 32/128/128 Bits/Block 157 170/162 66/160 70/163/170 Total Bits 160768 176768 173568 166656 Table 15: Cache Complexity Number of blocks = CacheSize - r Blocksize -+ Associativity Bits jBlock = Replacementpolicy(LRU) + V aliditybit(V)+ Bitstodetectlocality(L) + Tag + Data The above table 15 shows that all these four caches have comparable complexities. 61 5.3 Performance Evaluation 5 SIMULATION First, the OST is compared to the C V N cache. Figure 18. O S T C V N m p e g d e c (a) Miss Rate 70 GO 5 0 40 3 0 20 1 0 0 p e g w i t d e c p c g w i t e n c j p e g m p e g d e c (b) Average Bus Traffic _ • _ O S T — » — C V N m p e g d e c (c) Average Access Cost Figure 18: Comparison of the OST and C V N Cache Al l the above simulation results show that, compared to the conventional cache on all the benchmarks used, the OST cache can greatly improve miss rates, average cache access time, and average bus traffic. Which means that the OST cache design does have positive impact on system performance. C V N O.S 0 p e g w i t d e c p e g w i t e n c j p e g 6 2 5.3 Performance Evaluation 5 SIMULATION Then, the OST is compared to SS/NS and NTS. Figure 19 shows the miss rates for OST, NTS, and SS/NS on each benchmark. This figure shows that the OST has the best miss rate on all of the benchmarks used. The average access cost yields almost the same characteristics as the miss rate in the simulation, as seen in Figure 20. 0 . 2 5 Figure 19: Miss Rate p e g w i t d e c pegwi t enc j p e g m p e g dee Figure 20: Average Access Cost The OST cache design has the best average access cost in all the benchmarks used. Figure 21 shows the average bus traffic of the NTS, the SS/NS, and the OST on different benchmarks. This figure shows that the SS/NS has the best A B T values in all the bench-marks used. The A B T values of the OST cache design are between that of SS/NS and 63 5.3 Performance Evaluation 5 SIMULATION 6 0 5 0 4 0 3 0 2 0 10 0 El i l l s • ss/ns O ost pegwitdec pegwit enc jpeg mpeg dec Figure 21: Average Bus Traffic NTS in all the benchmarks. The A B T values of the NTS are the worst in all benchmarks used. Analyzing the simulation results, the number of times that non spatial block is brought into non spatial subcache and the number of times that non spatial block is re-placed from non spatial subcache are the main parts of the bus traffic in the SS/NS cache. The non-spatial subcache is only one-word (4 bytes) long, which is much less than the four-word (16 bytes) long cache block in the spatial subcache. Since only one word needs to be brought into the non-spatial subcache, the bus traffic will be reduced significantly. That is why the SS/NS has lowest bus traffic. The number of times that spatial block is brought into spatial subcache and the number of times that spatial block is replaced from spatial subcache are the main parts of the bus traffic in the OST cache, which is different from the SS/NS cache design. The S L C subcache has a four-word long cache block. When data is brought into LI cache, the whole cache block (16 bytes) will go into LI cache at the same time, which may cause more bus traffic compared to the SS/NS cache. 64 5.3 Performance Evaluation 5 SIMULATION Both subcaches of NTS have 16-byte long blocks, that is why the NTS has the highest bus traffic. 65 6 SUMMARY 6 S u m m a r y This thesis proposes a new data classification, which defines five data classes: temporal locality class, spatial locality class, dual locality class, non-locality class and unknown locality class. Each class should be treated separately, since each of them has its own access pattern. In addition, this thesis suggests that temporal locality data should be treated on a single word basis rather than on the basis of a whole cache line. Based on above classification, this thesis also presents a new cache design, called the OST cache, which splits the LI data cache into three subcaches: ONC, SLC, and TLC. The ONC subcache is designed to cache data without locality or with unknown locality. The SLC subcache is intended to cache spatial locality data and the TLC subcache is used to store data with temporal locality. Temporal locality data and unknown locality data are treated on the basis of a single word in the OST cache. The control mechanism used in the OST cache design can identify data locality more accurately and deal with data that presents different localities more flexibly, leading to a higher system performance. The OST cache is designed for embedded systems which have small cache size and narrow bus width. Simulation results showed that, compared to the C V N cache, the OST cache improved miss rates on average 70%, access costs on average 31%, and bus traffic on average 75% in terms of the benchmarks used as shown in Figure 22. Compared to the SS/NS and NTS, the OST cache design improved miss rates on average 19% and 51% respectively, and improved average access costs on average 2% and 19% respectively. With respect of bus traffic, although simulation results showed 37% increase in ABT compared to the SS/NS cache design, there is still 59% reduction compared to NTS cache design. Overall, the OST cache design shows an overall improved system performance for the benchmarks used in simulations. 66 6.1 Future Work 6 SUMMARY 0 .6 O S T vs O S T vs O S T vs S S / C V N N T S " N S Figure 22: System Performance Improvement 6.1 Future Work Prefetching has been separated from the OST cache design, because the main objective of this thesis is the split cache. However, that does not mean that prefetching is not useful here. On the contrary, adding prefetching to the OST cache design may result in even more system performance improvement. So trying to find an appropriate prefetching method for the OST cache design will be an interesting topic and will be worthy of future exploration. In the simulation, only multimedia benchmarks are used. The OST cache design can also be used in other application fields, such as science computation, network application, etc. So checking the effects in such applications may prove helpful for expanding the OST cache design for wider use. 67 REFERENCES REFERENCES R e f e r e n c e s [1] G. Kurpanek, K. Chan, J. Zheng, E. Delano and W. Bryg, PA7200: A PA-RISC Processor with Intergrated High Performance MP Bus Interface, COMPCON Digest of Papers, Feb 1994, pp. 375-382. [2] C. Gonzalez, A. Aliagas, and M . Valero, Data Cache with Multiple Caching Strategies Tuned to different Types of Locality, in proceedings of International Conference on Supercomputing '95, July 1995, pp. 338-347. [3] V. Milutinovic, M . Tomasevic, B. Markovic, and M . Tremblay, The Split Tempo-ral/Spatial Cache: Initial Performance Analysis, SClzzL-5, Mar. 1996. [4] J. A. Rivers and E. S. Davidson, Reducing Conflicts in Direct-Mapped Caches with a Temporality based Design, in Proc. 1996 International Conference on Parallel Processing, August 1996. [5] F. J. Sanchez, A. Gonzalez, and M . Valero, Software Management of Selective and Dual Data Caches, IEEE TCCA NEWSLETTERS, March 97, pp. 11-16. [6] V. Milutinovic, M . Prvulovic, D. Marinov, A. Dimitrijevic, The Split Spatial/Non-Spatial Cache: A Performance and complexity Evaluation, in Newsletter of Technical Committee on Computer Architecture, IEEE Computer Society, July 1999. [7] http://www.cs.umass.edu/weems/CmpSci635A/LecturelO/L10.34.html. 68 REFERENCES REFERENCES [8] http://www.sirnplescalar.corn. [9] A. J. Smith, Cache memories, Computing Surveys, vol.14, no.3. pp. 473-530, 1982. [10] N . P. Jouppi, Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers, in Proceedings of the 17th International Symposium on Computer Architecture, June 1990, pp. 364-373. [11] John L. Hennessy, David A. Patterson Computer Architecture: A Quantitative Approach 3rd Edition, Beta draft, Morgan Kaufmann Publishing Co., Menlo Park, CA. 2001. [12] K. K. Chan et al., Design of the HP PA7200 CPU, Hewlett-Packard J., Feb, 1996, pp. 364-373 [13] J. H. Lee, J. S. Lee, and S. D. Kin, A New Cache Architecture Based on Temporal and Spatial Locality, Journal of Systems Architecture, Vol. 46, pp. 1451-1467, Sep. 2000 [14] S G. Abraham et al., Predictability of Load/Store Instruction Latencies, in Proc. ofMICRO-26, 1993, pp. 139-152. [15] Y. Jegou and O. temam, Speculative Prefetching, in Proc. of the 1993 Int. conf. on Supercomputing, 1993, pp. 57-66. 69 REFERENCES REFERENCES [16] T. L. Johason, and W. W. Hwu, Runtime Adaptive Cache Hierarchy Manage-ment via Reference Analysis, in Proc. of the 24th international symposium on Computer Architecture, Denver, Colorado, June 1997. [17] T. L. Johason, M . C. Merten, and W. W. Hwu, Runtime Spatial Locality Detection and Optimization in Proc. of Micro-30, Research Triangle Park, North Carolina, USA, December 1997. [18] Prvulovic, M . , Marinow D., Milutinovic V., A Performance Reevaluation of the Split Temporal Spatial Cache, Workshop Digest of the PAID/ISCA-98, Barcelona, Spain, June 1998 [19] G. Tyson, M . farrens, J. Mathews, and A. R. Pleszkum, A Modified Approach to Data Cache Management, in Proc. of the 28th Annual International symposium on Microarchitecture, December 1995, pp. 93-103. [20] T. L. Johason, M . C. Merten, and W. W. Hwu, Runtime Spatial Locality Detection and Optimization in Proc. of Micro-30, Research Triangle Park, North Carolina, USA, December 1997. [21] A. Naz, M . Rexaei, K. kavi and P. Sweany, Improving Data Cache Performance with Integrated Use of Split Caches, Victim Cache and Stream Butters in Proc. of the workshop on MEDEA/PACT-2004, Antibes Juan-Les-Pins, France, 2004. 70 REFERENCES REFERENCES [22] M. Prvulovic, D. Marinov, A. Dimitrijevic, and V. Milutinovic, The Split Spatial/Non-Spatial Cache: A Performance and Complexity Evaluation in Newsletter of Technical Committee on Computer Architecture, IEEE Computer Society, July 1999 . [23] Lee, C, M. Potkonjak, and W. H. Mangione-Smith, MediaBench: A Tool for Evaluating and Synthesizing Multimedia and Communications Systems, Proc. 30th Annual International Symposium on. Microarchitecture, Dec. 1997, pp. 330-335. [24] M. Prvulovic, D. Marinov, A. Dimitrijevic, and V. Milutinovic, Split Tem-poal/Spatial Cache: A Survey and Reevaluation of Performance in Newsletter of Technical Committee on Computer Architecture, IEEE Computer Society, Spring 99, pp. 1-10. [25] J. Sahuquillo and Ana Pont, Splitting the Data Cache: A Survey in Journal of IEEE concurrency Vol. 8, No. 3, Jul.-Sept. 2000, pp. 30-35. [26] J. A. Rivers et al., Utilizing Reuse Information in Data Cache Management in Proc. 1998 Int's Conf. Supercomputing, ACM Press, New York, 1998, pp. 449-456 . [27] J. Sanchez and A. Gonzalez, A Locality Sensitive Multi-Module Cache with Explicit Management, in Proc. 1999 Int'I Conf, Supercomputing, ACM Press, New York, 1999, pp. 51-59. 71 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0092427/manifest

Comment

Related Items