How much VRAM does GLM-5.1 744B (MoE) need at Q4_K_M? Zhipu 200K frontier
GLM-5.1 744B at Q4_K_M with native 200K context needs about 574 GB of VRAM with all experts resident in GPU memory. That's the baseline you'll size production cluster deployments against. This giant Mixture of Experts from Zhipu activates 40B parameters per token from its 744B total parameter pool. If you stream inactive experts from host RAM or solid-state storage using expert offload like llama.cpp active-only mode, the resident VRAM requirement drops to roughly 140 GB. However, this offloaded configuration introduces significant PCIe round-trip latency.
Calculator
Estimated VRAM required
574 GB
744B params at Q4_K_M, 200,000 token context, batch 1, inference.
Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.
Hardware that fits
No single GPU in our catalog has enough memory. Multi-GPU or CPU offload required.
How this is calculated
Zhipu built GLM-5.1 with 64 layers, hidden size 8192, and 16 key-value heads. The 744B parameter pool at Q4_K_M requires 417 GB of weights regardless of routing. At the maximum native context window of 200K tokens, the FP16 key-value cache uses a significant 105 GB of memory. Standard activation and software overhead scale with model size, adding about 52 GB. If you offload cold experts and only keep the active 40B weights in VRAM, weight memory shrinks to 22.4 GB, which drops total resident usage to 140 GB. You'll need high-bandwidth hardware to run this MoE at acceptable speeds.
Verdict
Self-hosting GLM-5.1 744B in resident mode requires a substantial hardware setup. You'll need at least eight 80 GB datacenter cards with NVLink, or four 141 GB H200 cards. Active-only offload is possible on dual 80 GB cards or a high-end unified memory workstation, but generation speeds will drop. For typical applications, the official API is the most economical starting point.
Frequently asked questions
What hardware is best to run GLM-5.1 in full resident mode?
Can I run this model on consumer hardware?
Related tools
RAM Latency Calculator
Convert DDR3/DDR4/DDR5 timings (CL, tRCD, tRP, tRAS) into true latency in nanoseconds.
Use tool ➜Power Cost Estimator
Estimate annual electricity costs for your PC, Server, or TV.
Use tool ➜Data Transfer Calculator
Estimate transfer times for files over USB, WiFi, Ethernet, and more.
Use tool ➜Data Read Visualizer
Visualize the massive speed difference between CPU cache, RAM, and storage.
Use tool ➜