How much VRAM does Nemotron 3 Super 120B need at Q4_K_M? NVIDIA's serving-tuned MoE
Nemotron 3 Super 120B at Q4_K_M with native 1M context needs about 376 GB of VRAM with all experts resident. It's an MoE (12B active per token / 120B total) that NVIDIA has tuned aggressively for H100/H200 batched-serving throughput - the architecture is 64 layers, hidden 6144, 8 KV heads, head_dim 128. Active-only loading drops the resident footprint to ~310 GB, but Nemotron's whole point is sustained server throughput, which is a resident-mode concern.
Calculator
Estimated VRAM required
376 GB
120B params at Q4_K_M, 1,048,576 token context, batch 1, inference.
Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.
KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.
Custom architecture - SWA not applied. If you're modeling Gemma 3/4 or Mistral Nemo, pick the preset for accurate KV cache.
Hardware that fits
No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.
How this is calculated
67 GB of weights at Q4_K_M (full 120B pool), 275 GB KV cache at 1M context with standard 8-KV-head GQA, and ~34 GB overhead. The 376 GB resident total fits multi-GPU hardware (such as four 141 GB or 192 GB datacenter cards with tensor parallelism). Active-only is ~310 GB, but at the usual cold-expert PCIe penalty.
Verdict
Pick Nemotron 3 Super 120B in resident mode when you have datacenter hardware and predictable sustained throughput matters - it's the MoE tuned for batched serving rather than peak single-stream speed. For desktop experimentation, active-only works at the cost of variable per-token latency.
Frequently asked questions
Why is Nemotron 3 Super tuned by NVIDIA for serving?
Can Nemotron 3 Super fit on a 4090/5090?
Related tools
RAM Latency Calculator
Convert DDR3/DDR4/DDR5 timings (CL, tRCD, tRP, tRAS) into true latency in nanoseconds.
Use tool ➜Power Cost Estimator
Estimate annual electricity costs for your PC, Server, or TV.
Use tool ➜Data Transfer Calculator
Estimate transfer times for files over USB, WiFi, Ethernet, and more.
Use tool ➜Data Read Visualizer
Visualize the massive speed difference between CPU cache, RAM, and storage.
Use tool ➜