How much VRAM does Llama 4 Scout (17B active / 109B total) need at Q4_K_M? Native 10M Context

Llama 4 Scout at Q4_K_M with its native 10M context needs about 2231 GB of VRAM with all 109B params resident - that's the number you size hardware against. Scout has 17B active parameters per token out of a 109B pool, 48 layers, 8 KV heads, and Meta's iRoPE interleaved rotary scheme that natively supports a 10M-token context window. As a single-node escape hatch, llama.cpp-style active-only loading drops the resident footprint to roughly 2180 GB at the cost of streaming cold experts through system RAM each token.

Total VRAM required
2335 GB
Llama 4 Scout (MoE 17B/109B) at Q4_K_M
Weights
61.0 GB
109B params
KV cache
2062 GB
10240K tokens, FP16 KV

Calculator

Estimated VRAM required

2335 GB

109B params at Q4_K_M, 10,485,760 token context, batch 1, inference.

Weights
61.0 GB
KV cache
2062 GB
Overhead
212 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Hardware that fits

No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.

How this is calculated

The 109B parameter pool at Q4_K_M is 61 GB of weights, plus a massive 2062 GB KV cache (8 KV heads * 128 head_dim * 48 layers * 10M * 2 bytes) and ~109 GB overhead. That 2231 GB total is the resident-mode budget. Active-only loading keeps the same KV cache and activations but shrinks weights to 17B * 0.56 = 9.5 GB, totaling ~2180 GB - useful for fitting Scout on a pooled multi-GPU workstation or high-end unified memory setup.

Verdict

Llama 4 Scout in resident mode at native 10M context requires massive datacenter setups (e.g., 28x H100 80GB cards). Active-only drops VRAM requirements slightly to ~2180 GB, which still demands an extreme server configuration.

More Llama scenarios

Frequently asked questions

What's iRoPE and why does it support 10M context?
iRoPE interleaves rotary positional encodings differently across attention layers, alternating local-pattern and global-pattern layers. The structural change lets Scout extrapolate well past its training context up to 10M tokens with stable attention patterns. Maverick uses the same scheme tuned for 10M.
Can Scout run on a single 24 GB GPU?
No. At native 10M context, the massive 2062 GB KV cache alone exceeds the memory of any single card. To run at 10M context, you need multi-GPU setups. However, if you drop the context to 32K, active-only loading drops the footprint to ~18 GB which fits comfortably on a 24 GB card.
What about Maverick (17B/400B)?
Same active params (17B), 4x larger expert pool. Resident is ~250 GB at Q4 (multi-GPU), active-only is ~22 GB. Stronger reasoning than Scout but the native context window is 1M.