How much VRAM does DeepSeek V3 671B need at Q4_K_M with full 128K context?
DeepSeek V3 ships with a 128K native context, so that's the realistic deployment target - nobody runs a frontier model at 8K. At Q4_K_M with 128K context, the practical VRAM footprint is around 420 GB. The full 671B parameter count must be resident in memory even though only 37B activate per token, which puts this firmly in multi-GPU datacenter territory: 6x H100 80GB, 4x H200 141GB, or 3x MI300X 192GB.
Calculator
Estimated VRAM required
423 GB
671B params at Q4_K_M, 131,072 token context, batch 1, inference.
Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.
Hardware that fits
No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.
How this is calculated
DeepSeek V3 uses Multi-head Latent Attention (MLA), which is the main reason 128K is even runnable. MLA compresses the KV cache to roughly 576 bytes per token instead of the standard 2 x layers x hidden x bytes that a vanilla transformer pays. The KV cache figure shown above will be much higher than reality - in practice DeepSeek V3 at 128K context consumes roughly 2.3 GB of KV cache. Weights still dominate at 376 GB regardless. The inference economics are unusual: per-token compute matches a ~37B dense model thanks to sparse expert activation, but the memory footprint matches a 671B dense model.
Verdict
DeepSeek V3 671B local hosting is for organizations with serious infrastructure budgets. For everyone else the hosted API is dramatically cheaper. DeepSeek V3's MLA keeps the KV cache astonishingly small compared to a standard transformer.
More DeepSeek scenarios
Frequently asked questions
Why is the KV cache so high in this calculator?
Can DeepSeek V3 run on a single MI300X?
Does sparse activation reduce DeepSeek V3 memory?
What about DeepSeek-R1-Distill models?
Related tools
RAM Latency Calculator
Convert DDR3/DDR4/DDR5 timings (CL, tRCD, tRP, tRAS) into true latency in nanoseconds.
Use tool ➜Power Cost Estimator
Estimate annual electricity costs for your PC, Server, or TV.
Use tool ➜Data Transfer Calculator
Estimate transfer times for files over USB, WiFi, Ethernet, and more.
Use tool ➜Data Read Visualizer
Visualize the massive speed difference between CPU cache, RAM, and storage.
Use tool ➜