How much VRAM does DeepSeek V3 671B need at Q4_K_M with full 128K context?

DeepSeek V3 671B local hosting is for organizations with serious infrastructure budgets. For everyone else the hosted API is dramatically cheaper. DeepSeek V3's MLA keeps the KV cache astonishingly small compared to a standard transformer.

DeepSeek V3 ships with a 128K native context, so that's the realistic deployment target - nobody runs a frontier model at 8K. At Q4_K_M with 128K context, the practical VRAM footprint is around 420 GB. The full 671B parameter count must be resident in memory even though only 37B activate per token, which puts this firmly in multi-GPU datacenter territory: 6x H100 80GB, 4x H200 141GB, or 3x MI300X 192GB.

By TechCompare · Updated July 2026

Total VRAM required

416 GB

DeepSeek V3 671B at Q4_K_M

Weights

376 GB

671B params

KV cache

2.3 GB

128K tokens, FP16 KV

Estimated VRAM required

423 GB

671B params at Q4_K_M, 131,072 token context, batch 1, inference.

Weights

376 GB

KV cache

9.2 GB

Overhead

38.5 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Hardware that fits

No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.

Open full calculator to tweak settings ➜

How this is calculated

DeepSeek V3 uses Multi-head Latent Attention (MLA), which is the main reason 128K is even runnable. MLA compresses the KV cache to roughly 576 bytes per token instead of the standard 2 x layers x hidden x bytes that a vanilla transformer pays. The KV cache figure shown above will be much higher than reality - in practice DeepSeek V3 at 128K context consumes roughly 2.3 GB of KV cache. Weights still dominate at 376 GB regardless. The inference economics are unusual: per-token compute matches a ~37B dense model thanks to sparse expert activation, but the memory footprint matches a 671B dense model.

Calculator

Hardware that fits

How this is calculated

Verdict

More DeepSeek scenarios

Frequently asked questions

Related tools

RAM Latency Calculator

Power Cost Estimator

Data Transfer Calculator

Data Read Visualizer