How do caches work?
November 12, 2025
I think it is fair to say that most people have an idea of what caching does at a variety of levels. To someone not in tech, "clearing your cache" (also those yummy cookies!) is a way to get your browser to update. So it is understood that some data is stored pretty close to you. To someone in tech, you may understand that caching data is a way to keep it accessible so that you don't have to go and retrieve it. Content Delivery Networks (CDN's) cache webpages (and other data that is distributed) at different locations so that your performance as a user is faster. Great, makes sense!
For both of these examples, the ideas are based on what comes from CPU hardware. When a program is running and fetching instructions/data, accessing memory can be costly (in terms of cycle time).
Locality
When we fetch data from memory, there are two different themes that emerge. Data that was just accessed is likely to be accessed again, and data that is accessed will be close to the data that was accessed in the prior instruction. Those are two examples of temporal locality and spatial locality, respectively. A good way to think about this is in loops. Let's look at the code below.
int sum = 0;
for (i = 0; i < 10, i++) {
sum = sum + arr[i];
}
For each value between 0 and 9, we add to the sum the value from arr at location i. Each time we iterate through the loop, we are accessing the sum (temporal locality) and we are accessing the next position in the array (spatial locality).
Specifically looking at temporal locality, accessing the value of sum from memory each time we use it can become quite a bottleneck in our program, especially when the loop becomes much larger. This is where cacheing comes in! Instead of going to the main memory every iteration, what if had a smaller memory module on the chip? It would be closer to all of the work that is being done which would make it much faster.
For now, I am only going to focus on a single core with a single cache. Caches in hardware can become more complex with multiple levels (L1, L2, or L3) and multiple cores (how does one core know what is in the cache of a different core?).
Hit or Miss?
Whether we have a cache hit or cache miss, it is pretty straightforward. If the data we are trying to get is in the cache, it's a hit! And if it is not, then it's a miss. The data we are tying to get is based on the address of the location of the data in memory.
So what is the actual cache?
The cache is basically just a lookup table. In order to keep it small to fit on the chip and provide quick lookups, there are only so many records of data in that table. Now that makes sense, but how do we know if the data is in the table? We will use the address of the data that we are trying to access to index into this table. The "records" that I mentioned are actually called blocks or lines, and the length of it is called the block/line size. The block size is typically going to be 32-128 bytes.
We know that blocks are the records in the cache, how do we know how many blocks are in the cache? To calculate this, we just take the size of the cache and divide by the block size. So if we have a 32KB cache and we have 64B block size, then we know that 512 blocks!
The blocks in the cache will have different locations along that block where we can put data. To figure out which block and which section of that block we will put the data, we look at the block offset and block number. We find both of these values from the address.
If we have a 16B block size, the 4 least significant bits of the address will be the block offset and the the remaining bits are the block number. Why 4 bits? $2^4$ is 16, so we get 4 bits. If we had a block size of 32B, then we would have 5 bits used for the offset. Say we had the address of 0x123, which is 000100100011 in binary, the block offset would be 0011 and the block number would be 00010010.
We have said that the cache is indexed by the address and that the data is stored in the cache. But we also said that the cache can contain more than one section of data, denoted by the offset. So how do we know if the data we are looking for is actually in that line? In addition to the data, the cache also keeps track of the cache tags which will keep track of the tags that are in the line.
Until this point, we never really touched on what the values are when the cache is instantiated. The cache may have some random data in it as well as random tags. To know if the data we have in it is actual data or some random gibberish, we will also keep track of a valid bit. This valid bit simply says whether the data is real or not. So, to get a cache hit, we need the tag to match our block number AND the valid bit to be 1.
Types of caches
There are three types of caches:
- Fully Associative
- Set-Associative
- Direct-Mapped
When an address can go to any block, that is an example of a fully associative cache. This is what we have talked about so far. When an address can go to only one block, that is a direct mapped cache. And in the middle, if an address can go to n number of blocks, that is a set associative cache.
Direct-Mapped Cache
Let's say that we have a cache with 4 lines. Of course, we will have more than 4 blocks in memory. In a direct-mapped cache, each block in memory will map to one of the specific lines in the cache. But how? Well just like before, the first few bits will be used for the block address ($2^n$ where n is the value that gets you to the block size). However, instead of all the remaining bits being used for the block tag, we will take the next few bits for the index and the remaining bits after that will be the tag. How many bits for the index? Similarily to the offset, it will be $2^n$ where n is the number that gets you equal to the number of lines in the cache. In our example with 4 lines, then we would have $2^2=4$.
For a more concrete example, consider the address 0x12345678 that is trying to get inserted into a direct mapped cache with 16 lines and has a block size of 32B.
Number of bits:
- Number of offset bits = $2^n=32$, with $n=5$
- Number of index bits = $2^n=16$, with $n=4$
- Number of tag bits = $32$ bits - offset bits - index bits
With the bits above, now we can find the bits that correspond to each chunk. Our hex value in binary is
00010010001101000101011001111000
and we can split it up into tag, index, offset like this
00010010001101000101011|0011|11000.
Looking specificially at the 4 bits for the offset, we can easily figure out which line the address maps to in our cache!
Set-Associative Cache
Similarly to a direct-mapped cache, a set-associative cache only has specific lines that the address is able to access. When we say set, we are talking about a set (or group) of lines in the cache that we are able to put the data for a specific address. For example, if we had a cache with 16 lines and the lines were in sets of 4, we would call that a 4 way set associative cache. There would be 4 sets each with 4 lines.
To find which set our address can be inserted into, we first find the number of sets that are in the cache. The general formula is the number of lines in the cache divided by the $n$ ways in each set, where ways is the number of lines in each set. If you had 32 lines in your cache and 2 ways in each set, then you would have 16 sets.
Just like the direct-mapped cache, we use the index bits to find which set to access. In our 32 line, 2 way cache that gave us 16 sets, which would be 5 bits for the offset ($2^5=16$).
The direct-mapped cache is simply a 1-way set-associative cache!
Fully-Associtive Cache
A fully-associative cache is a cache where any address can go to any line in the cache. When that is the case, we have no offset bits and just have the tag and offset bits because there is no need for indexing.
Replacement Policies
Now that we know how caches are organized and the different types of caches, we have to decide what to do when we are trying to write to the cache and it is full. A few different options could be to randomly kick out an entry, choose whatever was first in the cache (FIFO), or choose whatever was Least Recently Used (LRU).
Least Recently Used
Like the name implies, in an LRU cache we will replace the item in the cache that has least recently been used. To keep track of what has been used when, we will add an LRU counter. Let's say that we have a cache with just 4 lines. Each line in the LRU counter will have a value between 0 and 3, with none repeating. 0 is the line that was least recently used and 3 is the line that was most recently used. When we need to replace a line, we will place it in the line that has the LRU counter value of 0. When we do this, we will change the LRU counter of that line to 3 and then decrement all of the other values.
If we are accessing the most recenlty used line (LRU counter of 3), then we do not need to change any of the values of the LRU counter because they are just the same as before.
But what happens if we want to access something that is in our cache, but say it has an LRU counter of 1? Well that line that we access will have its LRU value change to 3 and then anything with a value larger than the value just accessed (2 and 3 in this case) will be decremented. The line with the LRU counter value of 0 will stay the same.
Write Policy
The last topic I will touch on is the write policy. First we have to decide what to do when there is a write miss. In a write miss, we want to write a value to the cache but it does not exist. At that point, do we write it to the cache or not? Simiply put, in a write allocate cache, we will write that value to the cache and in a no-write allocated cache we do not. Write allocate caches are what is commonly used and the reasoning is related to temporal locality. For any item we are accessing, we are likely to access that again so we put it in the cache to make it quicker to access.
In addition to the allocation, we also have to decide if we will write the value to memory. In a write-through cache, we will update the memory immediately on cache write. The other option is a a write-back cache, where we only write to memory when that value is being replaced. This reduces the total number of times we have to access the memory. As you can imagine, write-back caches are much more popular.
Write Back Caches
Further exploring write back caches, consider two scenarios: we wrote a block to the cache and read block from memory into the cache. In both scenarios, the cache has a value. These two scenarios work differently when we have to replace that block in the cache. In the first scenario, the cache has the newest data value so when we replace it, we will have to write the current value in the cache to memory. But in the second scenario, the cached value is the same value that is currently in memory, so we do not have to write it to memory. How do we know which scenario we have? We use what is called a dirty bit. When the dirty bit is 0, we know that the value is "clean" or that we don't need to write to memory. If the dirty bit is 1, the value is "dirty" and we have to write it to memory.
Hopefully after reading this you have a better understanding of how caches work. This article only scratched the surface, and most notably only talked about single caches in a single core chip. As you add in additional levels of caches and multiple cores, the complexity of managing caches increases.