How much VRAM does DeepSeek V4 Pro 1.6T need at Q4_K_M? CSA + MLA frontier

DeepSeek V4 Pro 1.6T at Q4_K_M with the full 1M-token context needs about 1012 GB of VRAM with every expert resident - this is the real shape of the model and the number to plan a deployment against. The 1.6T parameter pool at Q4 is 896 GB of weights on its own, plus ~24 GB of KV cache thanks to MLA + Compressed Sparse Attention and ~92 GB of activation/overhead at this scale. As a planning escape hatch, llama.cpp `--n-cpu-moe` style active-only loading drops the resident footprint to about 70 GB, but routes cold experts through system RAM or NVMe at a per-token bandwidth penalty.

Total VRAM required
1012 GB
DeepSeek V4 Pro 1.6T (MoE) at Q4_K_M
Weights
896 GB
1600B params
KV cache
24.2 GB
1024K tokens, FP16 KV

Calculator

Estimated VRAM required

1012 GB

1600B params at Q4_K_M, 1,048,576 token context, batch 1, inference.

Weights
896 GB
KV cache
24.2 GB
Overhead
92.0 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Sequence-dim compression applied: This model uses Compressed Sparse Attention. KV cache shown is 25% of pure MLA. Real V4 cache hits ~10% at 1M context; this is a conservative middle estimate.

Hardware that fits

No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.

How this is calculated

The 1.6T parameter pool dominates: 1600B * 0.56 B/param = 896 GB of weights regardless of which expert routes for a given token. CSA on top of MLA collapses the KV cache hard even at 1M context - the calculator models this as a 4x sequence-dim compression on top of the per-token MLA compression (1 KV head, head_dim 288), producing roughly 24 GB at the full 1M window. Activation/overhead at this scale lands around 92 GB. The 1012 GB total is the realistic resident-mode budget when you actually want frontier-grade serving throughput - active-only loading is a different deployment shape, useful for low-QPS chat on a single 80 GB GPU but not how you'd build a production endpoint.

Verdict

Resident-mode V4 Pro at 1M context is multi-node datacenter territory: ~13 H100 80GB or ~8 H200 141GB with NVLink, or comparable MI300X capacity. Active-only on a single 80 GB H100 / 96 GB pro card is a chat-grade fallback, not a serving target. For most workloads the hosted DeepSeek API is dramatically cheaper than self-hosting at this scale - self-host only when data residency or bit-exact control is the real requirement.

More DeepSeek scenarios

Frequently asked questions

What's CSA and why does it matter?
Compressed Sparse Attention is V4's sequence-dimension compression layered on top of MLA's per-token compression. At 1M context the real KV cache is roughly 10% of pure MLA. We model it conservatively as a 4x reduction (0.25x ratio) which keeps the calculator within ~2x of reality across 128K-1M and never under-counts.
Can V4 Pro fit on 8x H100?
Not in resident mode at full 1M context - 1012 GB exceeds 8 * 80 = 640 GB. You need ~13 H100 80GB or ~8 H200 141GB to keep all experts resident. 8x H100 works only with active-only loading (where 70 GB fits comfortably) or with aggressive quantization below Q4_K_M.
Should I run V4 Pro or V4 Flash locally?
V4 Flash 284B is the practical answer - 284B/0.56 = ~159 GB resident, ~21 GB active-only. Fits on a single MI300X 192GB or a 64 GB Mac Studio respectively, while keeping most of V4 Pro's reasoning quality.