How much VRAM does Qwen 2.5 32B need at Q4_K_M? Single 24GB GPU sweet spot

Qwen 2.5 32B at Q4_K_M with native 128K context needs about 57.5 GB of VRAM, making it a fit for pooled cards or unified memory workstations.

Total VRAM required
57.5 GB
Qwen 2.5 32B at Q4_K_M
Weights
17.9 GB
32B params
KV cache
34.4 GB
128K tokens, FP16 KV

Calculator

Estimated VRAM required

57.5 GB

32B params at Q4_K_M, 131,072 token context, batch 1, inference.

Weights
17.9 GB
KV cache
34.4 GB
Overhead
5.2 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Hardware that fits

A100 80GB
Datacenter
80 GB
72% used
H100 80GB
Datacenter
80 GB
72% used
Apple M3 Ultra 128GB
Unified
96 GB
60% used

How this is calculated

32B at Q4_K_M is roughly 17.9 GB of weights, 34.4 GB of KV cache, and 5.2 GB of activation overhead, totaling 57.5 GB.

Verdict

Qwen 2.5 32B Q4_K_M at reduced context is a sweet spot for 24 GB cards, but at its full native 128K context, the KV cache grows to 34.4 GB, pushing the total to 57.5 GB and requiring pooled cards or unified workstations.

More Qwen scenarios

Frequently asked questions

Can I run Qwen 2.5 32B on an RTX 3090?
Yes, at Q4_K_M and reduced context lengths (like 8K context, which uses ~22 GB total). For native 128K context, it will exceed 24 GB and require multiple cards or unified systems.
Is Qwen 2.5 32B better than Llama 3.1 8B?
Substantially, especially for code, math, and multi-step reasoning. The 4x parameter count translates to a meaningful capability jump that matches the memory cost.