How much VRAM does Llama 3.1 8B need at 128K context? Long-context cost

If you're not actually using 128K of context, don't allocate for it. Inference engines like llama.cpp and vLLM let you cap context at runtime. Setting it to 16K or 32K instead of 128K saves more VRAM than any quantization change you could make.

Llama 3.1 8B at Q4_K_M with the full 128K context needs about 24 GB of VRAM - nearly 4x the 6 GB it uses at 8K context, even though the model itself is unchanged. The KV cache balloons to ~17 GB on its own, exceeding the weight memory and dominating total usage.

By TechCompare · Updated July 2026

Total VRAM required

23.8 GB

Llama 3.1 8B at Q4_K_M

Weights

4.5 GB

8B params

KV cache

17.2 GB

128K tokens, FP16 KV

Estimated VRAM required

23.8 GB

8B params at Q4_K_M, 131,072 token context, batch 1, inference.

Weights

4.5 GB

KV cache

17.2 GB

Overhead

2.2 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Hardware that fits

RTX 3090

Consumer

24 GB

99% used

RTX 5090

Consumer

32 GB

74% used

A100 40GB

Datacenter

40 GB

60% used

Apple M3 Max 64GB

Unified

48 GB

50% used

Open full calculator to tweak settings ➜

How this is calculated

KV cache scales linearly with context length and the cost is independent of quantization (KV cache is its own dtype). For Llama 3.1 8B (32 layers, 4096 hidden), each token in context uses 4 KB of FP16 KV memory (with 8 KV heads), so 128K tokens is roughly 17 GB total memory across all layers. Switching to Q8 KV halves that to 8.5 GB, and reducing the working context to what you actually need is by far the biggest lever.

Calculator

Hardware that fits

How this is calculated

Verdict

More Llama scenarios

Frequently asked questions

Related tools

RAM Latency Calculator

Power Cost Estimator

Data Transfer Calculator

Data Read Visualizer