How much VRAM does Nemotron 3 Super 120B need at Q4_K_M? NVIDIA's serving-tuned MoE

Nemotron 3 Super 120B at Q4_K_M with native 1M context needs about 376 GB of VRAM with all experts resident. It's an MoE (12B active per token / 120B total) that NVIDIA has tuned aggressively for H100/H200 batched-serving throughput - the architecture is 64 layers, hidden 6144, 8 KV heads, head_dim 128. Active-only loading drops the resident footprint to ~310 GB, but Nemotron's whole point is sustained server throughput, which is a resident-mode concern.

Total VRAM required
376 GB
Nemotron 3 Super 120B (MoE) at Q4_K_M
Weights
67.2 GB
120B params
KV cache
275 GB
1024K tokens, FP16 KV

Calculator

Estimated VRAM required

376 GB

120B params at Q4_K_M, 1,048,576 token context, batch 1, inference.

Weights
67.2 GB
KV cache
275 GB
Overhead
34.2 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Custom architecture - SWA not applied. If you're modeling Gemma 3/4 or Mistral Nemo, pick the preset for accurate KV cache.

Hardware that fits

No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.

How this is calculated

67 GB of weights at Q4_K_M (full 120B pool), 275 GB KV cache at 1M context with standard 8-KV-head GQA, and ~34 GB overhead. The 376 GB resident total fits multi-GPU hardware (such as four 141 GB or 192 GB datacenter cards with tensor parallelism). Active-only is ~310 GB, but at the usual cold-expert PCIe penalty.

Verdict

Pick Nemotron 3 Super 120B in resident mode when you have datacenter hardware and predictable sustained throughput matters - it's the MoE tuned for batched serving rather than peak single-stream speed. For desktop experimentation, active-only works at the cost of variable per-token latency.

Frequently asked questions

Why is Nemotron 3 Super tuned by NVIDIA for serving?
NVIDIA optimizes the kernel-level expert routing, fused attention, and tensor/pipeline parallel splits specifically for H100/H200. The result is markedly higher batched-throughput than a generic MoE of the same active-param count - particularly important for high-QPS endpoints rather than single-stream chat.
Can Nemotron 3 Super fit on a 4090/5090?
No. At native 1M context, the 275 GB KV cache alone vastly exceeds consumer GPU limits. To run the model at full 1M context, pooled datacenter cards are required. Even with active-only offload, the model requires ~310 GB of memory.