How much VRAM does Llama 4 Scout (17B active / 109B total) need at Q4_K_M? Native 10M Context

Llama 4 Scout in resident mode at native 10M context requires massive datacenter setups (e.g., 28x H100 80GB cards). Active-only drops VRAM requirements slightly to ~2180 GB, which still demands an extreme server configuration.

Llama 4 Scout at Q4_K_M with its native 10M context needs about 2231 GB of VRAM with all 109B params resident - that's the number you size hardware against. Scout has 17B active parameters per token out of a 109B pool, 48 layers, 8 KV heads, and Meta's iRoPE interleaved rotary scheme that natively supports a 10M-token context window. As a single-node escape hatch, llama.cpp-style active-only loading drops the resident footprint to roughly 2180 GB at the cost of streaming cold experts through system RAM each token.

By TechCompare · Updated July 2026

Total VRAM required

2335 GB

Llama 4 Scout (MoE 17B/109B) at Q4_K_M

Weights

61.0 GB

109B params

KV cache

2062 GB

10240K tokens, FP16 KV

Estimated VRAM required

2335 GB

109B params at Q4_K_M, 10,485,760 token context, batch 1, inference.

Weights

61.0 GB

KV cache

2062 GB

Overhead

212 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Hardware that fits

No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.

Open full calculator to tweak settings ➜

How this is calculated

The 109B parameter pool at Q4_K_M is 61 GB of weights, plus a massive 2062 GB KV cache (8 KV heads * 128 head_dim * 48 layers * 10M * 2 bytes) and ~109 GB overhead. That 2231 GB total is the resident-mode budget. Active-only loading keeps the same KV cache and activations but shrinks weights to 17B * 0.56 = 9.5 GB, totaling ~2180 GB - useful for fitting Scout on a pooled multi-GPU workstation or high-end unified memory setup.

Calculator

Hardware that fits

How this is calculated

Verdict

More Llama scenarios

Frequently asked questions

Related tools

RAM Latency Calculator

Power Cost Estimator

Data Transfer Calculator

Data Read Visualizer