How much VRAM does DeepSeek V3 671B need at Q4_K_M with full 128K context?

DeepSeek V3 ships with a 128K native context, so that's the realistic deployment target - nobody runs a frontier model at 8K. At Q4_K_M with 128K context, the practical VRAM footprint is around 420 GB. The full 671B parameter count must be resident in memory even though only 37B activate per token, which puts this firmly in multi-GPU datacenter territory: 6x H100 80GB, 4x H200 141GB, or 3x MI300X 192GB.

Total VRAM required
416 GB
DeepSeek V3 671B at Q4_K_M
Weights
376 GB
671B params
KV cache
2.3 GB
128K tokens, FP16 KV

Calculator

Estimated VRAM required

423 GB

671B params at Q4_K_M, 131,072 token context, batch 1, inference.

Weights
376 GB
KV cache
9.2 GB
Overhead
38.5 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Hardware that fits

No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.

How this is calculated

DeepSeek V3 uses Multi-head Latent Attention (MLA), which is the main reason 128K is even runnable. MLA compresses the KV cache to roughly 576 bytes per token instead of the standard 2 x layers x hidden x bytes that a vanilla transformer pays. The KV cache figure shown above will be much higher than reality - in practice DeepSeek V3 at 128K context consumes roughly 2.3 GB of KV cache. Weights still dominate at 376 GB regardless. The inference economics are unusual: per-token compute matches a ~37B dense model thanks to sparse expert activation, but the memory footprint matches a 671B dense model.

Verdict

DeepSeek V3 671B local hosting is for organizations with serious infrastructure budgets. For everyone else the hosted API is dramatically cheaper. DeepSeek V3's MLA keeps the KV cache astonishingly small compared to a standard transformer.

More DeepSeek scenarios

Frequently asked questions

Why is the KV cache so high in this calculator?
DeepSeek V3 and R1 use Multi-head Latent Attention (MLA), which compresses the KV cache by 10-20x compared to standard multi-head attention. We natively model MLA compression, which is why the 128K KV cache sits at a tiny 2.3 GB rather than the 200+ GB a naive computation gives.
Can DeepSeek V3 run on a single MI300X?
No. Even with MLA reducing the KV cache, the 376 GB of Q4 weights alone exceeds a single 192 GB MI300X. You need either 3x MI300X, 6x H100 80GB, or 4x H200 141GB with NVLink for tensor parallelism.
Does sparse activation reduce DeepSeek V3 memory?
It reduces compute, not memory. All experts must be resident in VRAM; sparse activation just means only some are used per forward pass.
What about DeepSeek-R1-Distill models?
The R1-Distill variants (Qwen 7B/14B/32B and Llama 8B/70B) are dense models that drop into standard VRAM math. R1-Distill-Llama-70B at Q4 needs about 41 GB at 8K context - very different from the 671B parent.