CPU cache hierarchy explained: L1, L2, L3 and why they exist

CPU cores execute instructions in fractions of a nanosecond. RAM responds in 50-80 nanoseconds. Without cache, a CPU would spend 99% of its time waiting for data. The cache hierarchy (L1, L2, L3) bridges this gap by keeping frequently accessed data closer to the execution units. Each level is larger but slower than the one before it.

Hardware tier
CPU Cache
On-die processor cache levels
Topic focus
CPU cache hierarchy
cache-hierarchy

How this is calculated

The hierarchy works on the principle of locality: if a program accesses a memory address, it's likely to access nearby addresses soon (spatial locality) and the same address again soon (temporal locality). L1 caches the most recently used data at the smallest granularity (64-byte cache lines). L2 catches L1 evictions. L3 catches L2 evictions and serves as a shared pool for inter-core communication. Cache design is a trade-off between latency (smaller is faster), hit rate (larger catches more), and power consumption (larger uses more). Modern CPUs spend roughly 30-40% of their die area on cache.

Verdict

The cache hierarchy is the most important architectural feature of a modern CPU that most developers never think about. Understanding it explains why array-of-structs vs struct-of-arrays matters, why linked lists are slow, and why optimizing for cache locality can speed up code by an order of magnitude.

More Latency scenarios

Frequently asked questions

How much faster is L1 cache than RAM?
Roughly 40-60x faster for random access. L1 cache access is about 1 ns; DDR5 RAM is about 50-80 ns end-to-end including controller overhead. That's why keeping hot data in cache dominates real-world CPU performance.
Is NVMe SSD faster than RAM?
No. NVMe is fast for storage, but for random access its latency is around 50-150 µs versus RAM's 50-80 ns. That's a 1,000x gap. NVMe beats RAM only on raw capacity and persistence, never on latency.
Why is HDD so much slower than SSD?
A spinning HDD has to physically move a read head to the right track and wait for the platter to rotate into position, typically 5-15 ms per random access. An SSD has no moving parts and returns data in under 100 µs, roughly 100x faster for random reads.
What's the point of L3 cache?
L1 and L2 are tiny (KB to low MB) and per-core. L3 is much larger (tens of MB) and shared across cores, acting as a buffer before requests go to main RAM. It catches data evicted from L1/L2 and data shared between cores.
How many nanoseconds is one CPU cycle?
At 4 GHz, one cycle is 0.25 ns. At 5 GHz, 0.2 ns. Cache hits are measured in single-digit cycles; main memory access costs hundreds of cycles, which is why optimizing for cache locality matters enormously in performance-critical code.
Does DDR5 have lower latency than DDR4?
Not usually at the same relative tier. DDR5 improved bandwidth and capacity significantly, but true latency (in ns) for mainstream kits is similar to late-stage DDR4. The gains from DDR5 come from bandwidth and larger capacities, not lower memory latency.