How much VRAM does Llama 3.1 8B need at 128K context? Long-context cost

Llama 3.1 8B at Q4_K_M with the full 128K context needs about 24 GB of VRAM - nearly 4x the 6 GB it uses at 8K context, even though the model itself is unchanged. The KV cache balloons to ~17 GB on its own, exceeding the weight memory and dominating total usage.

Total VRAM required
23.8 GB
Llama 3.1 8B at Q4_K_M
Weights
4.5 GB
8B params
KV cache
17.2 GB
128K tokens, FP16 KV

Calculator

Estimated VRAM required

23.8 GB

8B params at Q4_K_M, 131,072 token context, batch 1, inference.

Weights
4.5 GB
KV cache
17.2 GB
Overhead
2.2 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Hardware that fits

RTX 3090
Consumer
24 GB
99% used
RTX 5090
Consumer
32 GB
74% used
A100 40GB
Datacenter
40 GB
60% used
Apple M3 Max 64GB
Unified
48 GB
50% used

How this is calculated

KV cache scales linearly with context length and the cost is independent of quantization (KV cache is its own dtype). For Llama 3.1 8B (32 layers, 4096 hidden), each token in context uses 4 KB of FP16 KV memory (with 8 KV heads), so 128K tokens is roughly 17 GB total memory across all layers. Switching to Q8 KV halves that to 8.5 GB, and reducing the working context to what you actually need is by far the biggest lever.

Verdict

If you're not actually using 128K of context, don't allocate for it. Inference engines like llama.cpp and vLLM let you cap context at runtime. Setting it to 16K or 32K instead of 128K saves more VRAM than any quantization change you could make.

More Llama scenarios

Frequently asked questions

Why does long context cost so much VRAM?
Every token in the context window stores its key and value vectors, one per attention head per layer. At 128K tokens that's hundreds of millions of values cached for the duration of the request.
Does prompt caching help with long contexts?
Yes, prefix caching reuses the KV cache across requests that share a prompt prefix. It doesn't reduce memory for one request but dramatically reduces re-computation across many.