How much VRAM does GLM-5.1 744B (MoE) need at Q4_K_M? Zhipu 200K frontier

Self-hosting GLM-5.1 744B in resident mode requires a substantial hardware setup. You'll need at least eight 80 GB datacenter cards with NVLink, or four 141 GB H200 cards. Active-only offload is possible on dual 80 GB cards or a high-end unified memory workstation, but generation speeds will drop. For typical applications, the official API is the most economical starting point.

GLM-5.1 744B at Q4_K_M with native 200K context needs about 574 GB of VRAM with all experts resident in GPU memory. That's the baseline you'll size production cluster deployments against. This giant Mixture of Experts from Zhipu activates 40B parameters per token from its 744B total parameter pool. If you stream inactive experts from host RAM or solid-state storage using expert offload like llama.cpp active-only mode, the resident VRAM requirement drops to roughly 140 GB. However, this offloaded configuration introduces significant PCIe round-trip latency.

By TechCompare · Updated July 2026

Total VRAM required

574 GB

GLM-5.1 744B (MoE) at Q4_K_M

Weights

417 GB

744B params

KV cache

105 GB

195.3125K tokens, FP16 KV

Estimated VRAM required

574 GB

744B params at Q4_K_M, 200,000 token context, batch 1, inference.

Weights

417 GB

KV cache

105 GB

Overhead

52.1 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Hardware that fits

No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.

Open full calculator to tweak settings ➜

How this is calculated

Zhipu built GLM-5.1 with 64 layers, hidden size 8192, and 16 key-value heads. The 744B parameter pool at Q4_K_M requires 417 GB of weights regardless of routing. At the maximum native context window of 200K tokens, the FP16 key-value cache uses a significant 105 GB of memory. Standard activation and software overhead scale with model size, adding about 52 GB. If you offload cold experts and only keep the active 40B weights in VRAM, weight memory shrinks to 22.4 GB, which drops total resident usage to 140 GB. You'll need high-bandwidth hardware to run this MoE at acceptable speeds.

Calculator

Hardware that fits

How this is calculated

Verdict

More GLM scenarios

Frequently asked questions

Related tools

RAM Latency Calculator

Power Cost Estimator

Data Transfer Calculator

Data Read Visualizer