How much VRAM does GLM-5.1 744B (MoE) need at Q4_K_M? Zhipu 200K frontier

GLM-5.1 744B at Q4_K_M with native 200K context needs about 574 GB of VRAM with all experts resident in GPU memory. That's the baseline you'll size production cluster deployments against. This giant Mixture of Experts from Zhipu activates 40B parameters per token from its 744B total parameter pool. If you stream inactive experts from host RAM or solid-state storage using expert offload like llama.cpp active-only mode, the resident VRAM requirement drops to roughly 140 GB. However, this offloaded configuration introduces significant PCIe round-trip latency.

Total VRAM required
574 GB
GLM-5.1 744B (MoE) at Q4_K_M
Weights
417 GB
744B params
KV cache
105 GB
195.3125K tokens, FP16 KV

Calculator

Estimated VRAM required

574 GB

744B params at Q4_K_M, 200,000 token context, batch 1, inference.

Weights
417 GB
KV cache
105 GB
Overhead
52.1 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Hardware that fits

No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.

How this is calculated

Zhipu built GLM-5.1 with 64 layers, hidden size 8192, and 16 key-value heads. The 744B parameter pool at Q4_K_M requires 417 GB of weights regardless of routing. At the maximum native context window of 200K tokens, the FP16 key-value cache uses a significant 105 GB of memory. Standard activation and software overhead scale with model size, adding about 52 GB. If you offload cold experts and only keep the active 40B weights in VRAM, weight memory shrinks to 22.4 GB, which drops total resident usage to 140 GB. You'll need high-bandwidth hardware to run this MoE at acceptable speeds.

Verdict

Self-hosting GLM-5.1 744B in resident mode requires a substantial hardware setup. You'll need at least eight 80 GB datacenter cards with NVLink, or four 141 GB H200 cards. Active-only offload is possible on dual 80 GB cards or a high-end unified memory workstation, but generation speeds will drop. For typical applications, the official API is the most economical starting point.

Frequently asked questions

What hardware is best to run GLM-5.1 in full resident mode?
You'll need a cluster of eight GPU cards with at least 80 GB of memory each, such as H100 or A100. Linking them with NVLink is vital to handle expert routing communication without bottlenecks. Four 141 GB H200 cards also provide enough memory.
Can I run this model on consumer hardware?
Not at its full native context size. With active-only offload, the model needs 140 GB of VRAM. This configuration fits on two RTX 3090 or RTX 4090 cards if you cap the context length to a much smaller window, but token generation will be slow because cold experts must load over the PCIe bus.