How much VRAM does Qwen3.5 122B need at Q4_K_M? The 2026 MoE workhorse

Qwen3.5 122B at Q4_K_M with native 256K context needs about 104 GB of VRAM with all 122B params resident. The architecture is Qwen's Gated Delta Networks at scale: 48 layers, 12B active per token, 2 KV heads, head_dim 256, native 256K context. It's the workhorse MoE of the 2026 Qwen3.5 fleet - frontier-quality reasoning with Apache 2.0 licensing. Active-only loading drops the resident footprint to ~36 GB if you want to run it on a 48 GB GPU at the usual cold-expert bandwidth cost.

Total VRAM required
103 GB
Qwen3.5 122B (MoE) at Q4_K_M
Weights
68.3 GB
122B params
KV cache
25.8 GB
256K tokens, FP16 KV

Calculator

Estimated VRAM required

103 GB

122B params at Q4_K_M, 262,144 token context, batch 1, inference.

Weights
68.3 GB
KV cache
25.8 GB
Overhead
9.4 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Custom architecture - SWA not applied. If you're modeling Gemma 3/4 or Mistral Nemo, pick the preset for accurate KV cache.

Hardware that fits

H200 141GB
Datacenter
141 GB
73% used
Apple M3 Ultra 192GB
Unified
144 GB
72% used
MI300X
Datacenter
192 GB
54% used

Just barely too small

Apple M3 Ultra 128GB
Unified
96 GB
short by 7.5 GB

How this is calculated

68 GB of weights at Q4_K_M, 26 GB KV cache (at the full 256K native context), and ~9.4 GB overhead. The 104 GB resident total is the deployment number - one H200 141GB, an MI300X, or two 80 GB cards with tensor parallelism. Active-only shrinks weights to 12B * 0.56 = 6.7 GB while keeping the KV cache and overhead stable, totaling ~36 GB.

Verdict

Qwen3.5 122B at 104 GB resident is the configuration to deploy when you have datacenter-grade hardware and want frontier reasoning quality with Apache 2.0 licensing. Active-only on a single 48 GB consumer or pro GPU is the 'just works' fallback for someone asking 'what's the smartest local model I can run?' in 2026 - capability that was hosted-API-only six months ago.

More Qwen scenarios

Frequently asked questions

How does Qwen3.5 122B compare to Llama 4 Scout?
Roughly comparable footprint in active-only mode (~36 GB vs ~237 GB at full context), but Qwen3.5 trades a smaller native context for stronger general reasoning. Pick Qwen3.5 for general use, Scout when you need extreme long context.
What about Qwen3.5 27B (dense) instead?
Qwen3.5 27B is dense - 27B * 0.56 = 15 GB at Q4_K_M, fits on a 24 GB card with full speed. The 122B MoE is stronger on hard reasoning tasks but 27B is faster per token in resident mode. Both are valid choices.
Does Qwen3.5 use Gated Delta Networks?
Yes - Qwen3.5 interleaves Gated Delta (linear attention) with Gated standard attention layers. The KV cache numbers in this calculator approximate the standard-attention path. Gated Delta layers have constant memory regardless of context length, so real long-context behavior is better than what we show.