How much VRAM does Llama 3.1 8B need at Q4_K_M? Single GPU local inference

Llama 3.1 8B at Q4_K_M needs about 23.8 GB of VRAM at its native 128K context. This is what makes local inference of small models at high context highly attractive on 24 GB consumer GPUs.

Total VRAM required
23.8 GB
Llama 3.1 8B at Q4_K_M
Weights
4.5 GB
8B params
KV cache
17.2 GB
128K tokens, FP16 KV

Calculator

Estimated VRAM required

23.8 GB

8B params at Q4_K_M, 131,072 token context, batch 1, inference.

Weights
4.5 GB
KV cache
17.2 GB
Overhead
2.2 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Hardware that fits

RTX 3090
Consumer
24 GB
99% used
RTX 5090
Consumer
32 GB
74% used
A100 40GB
Datacenter
40 GB
60% used
Apple M3 Max 64GB
Unified
48 GB
50% used

How this is calculated

8B at Q4_K_M is roughly 4.5 GB of weights plus 17.2 GB of KV cache and 2.1 GB of activation overhead, totaling 23.8 GB.

Verdict

If you have a working PC, you can run Llama 3.1 8B locally. Q4_K_M is the recommended quant - it fits everywhere and the quality loss vs FP16 is negligible at this scale. Step up to Q8_0 only if you have headroom and want maximum quality.

More Llama scenarios

Frequently asked questions

Can I run Llama 3.1 8B on a 6 GB GPU?
Yes at reduced context lengths. At the native 128K context, the 17.2 GB KV cache exceeds standard consumer limits, but dropping context to 8K brings VRAM down to 6.1 GB.
Is Llama 3.1 8B good enough for daily use?
For chat, summarization, and simple code completion, yes. For complex reasoning, math, or production-quality writing, you'll notice the gap to 70B-class models. Use 8B for speed and 70B+ for accuracy.