How much VRAM does DeepSeek V4 Pro 1.6T need at Q4_K_M? CSA + MLA frontier
DeepSeek V4 Pro 1.6T at Q4_K_M with the full 1M-token context needs about 1012 GB of VRAM with every expert resident - this is the real shape of the model and the number to plan a deployment against. The 1.6T parameter pool at Q4 is 896 GB of weights on its own, plus ~24 GB of KV cache thanks to MLA + Compressed Sparse Attention and ~92 GB of activation/overhead at this scale. As a planning escape hatch, llama.cpp `--n-cpu-moe` style active-only loading drops the resident footprint to about 70 GB, but routes cold experts through system RAM or NVMe at a per-token bandwidth penalty.
Calculator
Estimated VRAM required
1012 GB
1600B params at Q4_K_M, 1,048,576 token context, batch 1, inference.
Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.
Sequence-dim compression applied: This model uses Compressed Sparse Attention. KV cache shown is 25% of pure MLA. Real V4 cache hits ~10% at 1M context; this is a conservative middle estimate.
Hardware that fits
No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.
How this is calculated
The 1.6T parameter pool dominates: 1600B * 0.56 B/param = 896 GB of weights regardless of which expert routes for a given token. CSA on top of MLA collapses the KV cache hard even at 1M context - the calculator models this as a 4x sequence-dim compression on top of the per-token MLA compression (1 KV head, head_dim 288), producing roughly 24 GB at the full 1M window. Activation/overhead at this scale lands around 92 GB. The 1012 GB total is the realistic resident-mode budget when you actually want frontier-grade serving throughput - active-only loading is a different deployment shape, useful for low-QPS chat on a single 80 GB GPU but not how you'd build a production endpoint.
Verdict
Resident-mode V4 Pro at 1M context is multi-node datacenter territory: ~13 H100 80GB or ~8 H200 141GB with NVLink, or comparable MI300X capacity. Active-only on a single 80 GB H100 / 96 GB pro card is a chat-grade fallback, not a serving target. For most workloads the hosted DeepSeek API is dramatically cheaper than self-hosting at this scale - self-host only when data residency or bit-exact control is the real requirement.
More DeepSeek scenarios
Frequently asked questions
What's CSA and why does it matter?
Can V4 Pro fit on 8x H100?
Should I run V4 Pro or V4 Flash locally?
Related tools
RAM Latency Calculator
Convert DDR3/DDR4/DDR5 timings (CL, tRCD, tRP, tRAS) into true latency in nanoseconds.
Use tool ➜Power Cost Estimator
Estimate annual electricity costs for your PC, Server, or TV.
Use tool ➜Data Transfer Calculator
Estimate transfer times for files over USB, WiFi, Ethernet, and more.
Use tool ➜Data Read Visualizer
Visualize the massive speed difference between CPU cache, RAM, and storage.
Use tool ➜