How much VRAM does Qwen 2.5 72B need at Q4_K_M? Long-context inference

The KV cache is the hidden cost of long-context models. If you're not actually using the full 128K context, set the context to 8K or 16K in your inference engine - the savings are immediate. Q8 KV is the other obvious lever and almost always worth it.

Qwen 2.5 72B at Q4_K_M with the model's native 128K context needs about 91.6 GB of VRAM. Note how the long context cache of 43 GB exceeds the 40 GB of weights.

By TechCompare · Updated July 2026

Total VRAM required

91.6 GB

Qwen 2.5 72B at Q4_K_M

Weights

40.3 GB

72B params

KV cache

42.9 GB

128K tokens, FP16 KV

Estimated VRAM required

91.6 GB

72B params at Q4_K_M, 131,072 token context, batch 1, inference.

Weights

40.3 GB

KV cache

42.9 GB

Overhead

8.3 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Hardware that fits

Apple M3 Ultra 128GB

Unified

96 GB

95% used

H200 141GB

Datacenter

141 GB

65% used

Apple M3 Ultra 192GB

Unified

144 GB

64% used

Just barely too small

A100 80GB

Datacenter

80 GB

short by 11.6 GB

H100 80GB

Datacenter

80 GB

short by 11.6 GB

Open full calculator to tweak settings ➜

How this is calculated

At 128K with FP16 KV, the cache for this model is around 43 GB on its own. Weights take 40.3 GB, and overhead is 8.3 GB, totaling 91.6 GB.

Calculator

Hardware that fits

How this is calculated

Verdict

More Qwen scenarios

Frequently asked questions

Related tools

RAM Latency Calculator

Power Cost Estimator

Data Transfer Calculator

Data Read Visualizer