How much VRAM does Llama 4 Scout (17B active / 109B total) need at Q4_K_M? Native 10M Context
Llama 4 Scout at Q4_K_M with its native 10M context needs about 2231 GB of VRAM with all 109B params resident - that's the number you size hardware against. Scout has 17B active parameters per token out of a 109B pool, 48 layers, 8 KV heads, and Meta's iRoPE interleaved rotary scheme that natively supports a 10M-token context window. As a single-node escape hatch, llama.cpp-style active-only loading drops the resident footprint to roughly 2180 GB at the cost of streaming cold experts through system RAM each token.
Calculator
Estimated VRAM required
2335 GB
109B params at Q4_K_M, 10,485,760 token context, batch 1, inference.
Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.
KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.
Hardware that fits
No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.
How this is calculated
The 109B parameter pool at Q4_K_M is 61 GB of weights, plus a massive 2062 GB KV cache (8 KV heads * 128 head_dim * 48 layers * 10M * 2 bytes) and ~109 GB overhead. That 2231 GB total is the resident-mode budget. Active-only loading keeps the same KV cache and activations but shrinks weights to 17B * 0.56 = 9.5 GB, totaling ~2180 GB - useful for fitting Scout on a pooled multi-GPU workstation or high-end unified memory setup.
Verdict
Llama 4 Scout in resident mode at native 10M context requires massive datacenter setups (e.g., 28x H100 80GB cards). Active-only drops VRAM requirements slightly to ~2180 GB, which still demands an extreme server configuration.
More Llama scenarios
Frequently asked questions
What's iRoPE and why does it support 10M context?
Can Scout run on a single 24 GB GPU?
What about Maverick (17B/400B)?
Related tools
RAM Latency Calculator
Convert DDR3/DDR4/DDR5 timings (CL, tRCD, tRP, tRAS) into true latency in nanoseconds.
Use tool ➜Power Cost Estimator
Estimate annual electricity costs for your PC, Server, or TV.
Use tool ➜Data Transfer Calculator
Estimate transfer times for files over USB, WiFi, Ethernet, and more.
Use tool ➜Data Read Visualizer
Visualize the massive speed difference between CPU cache, RAM, and storage.
Use tool ➜