How much VRAM does Qwen3.5 122B need at Q4_K_M? The 2026 MoE workhorse

Qwen3.5 122B at 104 GB resident is the configuration to deploy when you have datacenter-grade hardware and want frontier reasoning quality with Apache 2.0 licensing. Active-only on a single 48 GB consumer or pro GPU is the 'just works' fallback for someone asking 'what's the smartest local model I can run?' in 2026 - capability that was hosted-API-only six months ago.

Qwen3.5 122B at Q4_K_M with native 256K context needs about 104 GB of VRAM with all 122B params resident. The architecture is Qwen's Gated Delta Networks at scale: 48 layers, 12B active per token, 2 KV heads, head_dim 256, native 256K context. It's the workhorse MoE of the 2026 Qwen3.5 fleet - frontier-quality reasoning with Apache 2.0 licensing. Active-only loading drops the resident footprint to ~36 GB if you want to run it on a 48 GB GPU at the usual cold-expert bandwidth cost.

By TechCompare · Updated July 2026

Total VRAM required

103 GB

Qwen3.5 122B (MoE) at Q4_K_M

Weights

68.3 GB

122B params

KV cache

25.8 GB

256K tokens, FP16 KV

Estimated VRAM required

103 GB

122B params at Q4_K_M, 262,144 token context, batch 1, inference.

Weights

68.3 GB

KV cache

25.8 GB

Overhead

9.4 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Custom architecture - SWA not applied. If you're modeling Gemma 3/4 or Mistral Nemo, pick the preset for accurate KV cache.

Hardware that fits

H200 141GB

Datacenter

141 GB

73% used

Apple M3 Ultra 192GB

Unified

144 GB

72% used

MI300X

Datacenter

192 GB

54% used

Just barely too small

Apple M3 Ultra 128GB

Unified

96 GB

short by 7.5 GB

Open full calculator to tweak settings ➜

How this is calculated

68 GB of weights at Q4_K_M, 26 GB KV cache (at the full 256K native context), and ~9.4 GB overhead. The 104 GB resident total is the deployment number - one H200 141GB, an MI300X, or two 80 GB cards with tensor parallelism. Active-only shrinks weights to 12B * 0.56 = 6.7 GB while keeping the KV cache and overhead stable, totaling ~36 GB.

Calculator

Hardware that fits

How this is calculated

Verdict

More Qwen scenarios

Frequently asked questions

Related tools

RAM Latency Calculator

Power Cost Estimator

Data Transfer Calculator

Data Read Visualizer