How much VRAM does Qwen 2.5 72B need at Q4_K_M? Long-context inference
Qwen 2.5 72B at Q4_K_M with the model's native 128K context needs about 91.6 GB of VRAM. Note how the long context cache of 43 GB exceeds the 40 GB of weights.
Calculator
Estimated VRAM required
91.6 GB
72B params at Q4_K_M, 131,072 token context, batch 1, inference.
Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.
KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.
Hardware that fits
Just barely too small
How this is calculated
At 128K with FP16 KV, the cache for this model is around 43 GB on its own. Weights take 40.3 GB, and overhead is 8.3 GB, totaling 91.6 GB.
Verdict
The KV cache is the hidden cost of long-context models. If you're not actually using the full 128K context, set the context to 8K or 16K in your inference engine - the savings are immediate. Q8 KV is the other obvious lever and almost always worth it.
More Qwen scenarios
Frequently asked questions
Why is the VRAM higher than Llama 3.1 70B at the same quant?
Does Q8 KV cache hurt quality?
Related tools
RAM Latency Calculator
Convert DDR3/DDR4/DDR5 timings (CL, tRCD, tRP, tRAS) into true latency in nanoseconds.
Use tool ➜Power Cost Estimator
Estimate annual electricity costs for your PC, Server, or TV.
Use tool ➜Data Transfer Calculator
Estimate transfer times for files over USB, WiFi, Ethernet, and more.
Use tool ➜Data Read Visualizer
Visualize the massive speed difference between CPU cache, RAM, and storage.
Use tool ➜