How much VRAM does Llama 3.1 70B need at Q4_K_M? Memory and GPU guide

Llama 3.1 70B at Q4_K_M (4-bit quantization, GGUF sweet spot) needs roughly 90.4 GB of VRAM with its native 128K context window. That puts it out of reach for consumer setups, but it fits comfortably on two 80 GB datacenter cards, a high-end unified memory device, or split across pooled GPUs.

Total VRAM required
90.4 GB
Llama 3.1 70B at Q4_K_M
Weights
39.2 GB
70B params
KV cache
42.9 GB
128K tokens, FP16 KV

Calculator

Estimated VRAM required

90.4 GB

70B params at Q4_K_M, 131,072 token context, batch 1, inference.

Weights
39.2 GB
KV cache
42.9 GB
Overhead
8.2 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Hardware that fits

Apple M3 Ultra 128GB
Unified
96 GB
94% used
H200 141GB
Datacenter
141 GB
64% used
Apple M3 Ultra 192GB
Unified
144 GB
63% used

Just barely too small

A100 80GB
Datacenter
80 GB
short by 10.4 GB
H100 80GB
Datacenter
80 GB
short by 10.4 GB

How this is calculated

Q4_K_M weights take up 39 GB. The KV cache adds another 43 GB at 128K context with FP16 KV. Overhead adds roughly 8.2 GB, totaling 90.4 GB.

Verdict

Q4_K_M is the canonical local-inference recipe for 70B-class models. It runs on hardware mortals can buy (twin 3090s or one used A6000) at quality close to FP16. The one wrinkle is throughput: a single RTX 5090 with CPU offload still produces useful tokens per second, just slower than a fully resident multi-GPU setup.

More Llama scenarios

Frequently asked questions

Can I run Llama 3.1 70B on a single RTX 4090?
Not at Q4_K_M with native 128K context, the model needs roughly 90.4 GB total and a 4090 has 24 GB. Two 4090s with tensor parallelism work, or you can drop to Q3_K_M (~37 GB) and use partial CPU offload at reduced context.
Does Llama 3.1 70B Q4_K_M lose quality vs FP16?
Quality loss is small, typically a 0.1 to 0.3 perplexity increase on standard benchmarks, and almost imperceptible on most chat or coding tasks. Q4_K_M is generally considered the best quality-to-size ratio for local inference.