How much VRAM does Nemotron 3 Super 120B need at Q4_K_M? NVIDIA's serving-tuned MoE

Pick Nemotron 3 Super 120B in resident mode when you have datacenter hardware and predictable sustained throughput matters - it's the MoE tuned for batched serving rather than peak single-stream speed. For desktop experimentation, active-only works at the cost of variable per-token latency.

Nemotron 3 Super 120B at Q4_K_M with native 1M context needs about 376 GB of VRAM with all experts resident. It's an MoE (12B active per token / 120B total) that NVIDIA has tuned aggressively for H100/H200 batched-serving throughput - the architecture is 64 layers, hidden 6144, 8 KV heads, head_dim 128. Active-only loading drops the resident footprint to ~310 GB, but Nemotron's whole point is sustained server throughput, which is a resident-mode concern.

By TechCompare · Updated July 2026

Total VRAM required

376 GB

Nemotron 3 Super 120B (MoE) at Q4_K_M

Weights

67.2 GB

120B params

KV cache

275 GB

1024K tokens, FP16 KV

Estimated VRAM required

376 GB

120B params at Q4_K_M, 1,048,576 token context, batch 1, inference.

Weights

67.2 GB

KV cache

275 GB

Overhead

34.2 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Custom architecture - SWA not applied. If you're modeling Gemma 3/4 or Mistral Nemo, pick the preset for accurate KV cache.

Hardware that fits

No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.

Open full calculator to tweak settings ➜

How this is calculated

67 GB of weights at Q4_K_M (full 120B pool), 275 GB KV cache at 1M context with standard 8-KV-head GQA, and ~34 GB overhead. The 376 GB resident total fits multi-GPU hardware (such as four 141 GB or 192 GB datacenter cards with tensor parallelism). Active-only is ~310 GB, but at the usual cold-expert PCIe penalty.

Calculator

Hardware that fits

How this is calculated

Verdict

More Nemotron scenarios

Frequently asked questions

Related tools

RAM Latency Calculator

Power Cost Estimator

Data Transfer Calculator

Data Read Visualizer