How much VRAM does Llama 3.1 405B need at Q4_K_M? Multi-GPU planning

Llama 3.1 405B at Q4_K_M needs roughly 324 GB of VRAM at its native 128K context. That's a highly professional datacenter configuration.

Total VRAM required
324 GB
Llama 3.1 405B at Q4_K_M
Weights
227 GB
405B params
KV cache
67.6 GB
128K tokens, FP16 KV

Calculator

Estimated VRAM required

324 GB

405B params at Q4_K_M, 131,072 token context, batch 1, inference.

Weights
227 GB
KV cache
67.6 GB
Overhead
29.4 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Hardware that fits

No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.

Just barely too small

NVIDIA B300
Datacenter
288 GB
short by 35.9 GB

How this is calculated

Q4_K_M weights take about 226.8 GB. KV cache adds another 67.7 GB at 128K context, with overhead pushing the total to 324 GB.

Verdict

405B local inference is achievable but not cheap. The minimum viable rig is around $30-50K of hardware, which only makes sense for serious research, regulated environments that prohibit cloud inference, or pure curiosity. For nearly every other use case, paying per-token to a hosted endpoint wins.

More Llama scenarios

Frequently asked questions

What's the minimum hardware for 405B at Q4?
Four A100 80GB cards in NVLink is the entry-level professional setup. A single Mac Studio with 256 GB or 512 GB unified memory also works, with slower prompt processing but tolerable token generation speeds.
Is running 405B locally worth it vs the API?
Almost never on cost alone. Hosted APIs price 405B inference at fractions of a cent per thousand tokens. Self-host only when the workload requires data residency, offline operation, or custom fine-tuning.