How much VRAM does Llama 3.1 8B need at 128K context? Long-context cost
Llama 3.1 8B at Q4_K_M with the full 128K context needs about 24 GB of VRAM - nearly 4x the 6 GB it uses at 8K context, even though the model itself is unchanged. The KV cache balloons to ~17 GB on its own, exceeding the weight memory and dominating total usage.
Calculator
Estimated VRAM required
23.8 GB
8B params at Q4_K_M, 131,072 token context, batch 1, inference.
Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.
KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.
Hardware that fits
How this is calculated
KV cache scales linearly with context length and the cost is independent of quantization (KV cache is its own dtype). For Llama 3.1 8B (32 layers, 4096 hidden), each token in context uses 4 KB of FP16 KV memory (with 8 KV heads), so 128K tokens is roughly 17 GB total memory across all layers. Switching to Q8 KV halves that to 8.5 GB, and reducing the working context to what you actually need is by far the biggest lever.
Verdict
If you're not actually using 128K of context, don't allocate for it. Inference engines like llama.cpp and vLLM let you cap context at runtime. Setting it to 16K or 32K instead of 128K saves more VRAM than any quantization change you could make.
More Llama scenarios
Frequently asked questions
Why does long context cost so much VRAM?
Does prompt caching help with long contexts?
Related tools
RAM Latency Calculator
Convert DDR3/DDR4/DDR5 timings (CL, tRCD, tRP, tRAS) into true latency in nanoseconds.
Use tool ➜Power Cost Estimator
Estimate annual electricity costs for your PC, Server, or TV.
Use tool ➜Data Transfer Calculator
Estimate transfer times for files over USB, WiFi, Ethernet, and more.
Use tool ➜Data Read Visualizer
Visualize the massive speed difference between CPU cache, RAM, and storage.
Use tool ➜