How much VRAM does Mixtral 8x7B need at Q4_K_M? MoE memory math

Mixtral 8x7B at Q4_K_M with 32K context needs about 34 GB of VRAM. Mixture-of-experts is the wrinkle here: even though only 2 of the 8 experts activate per token, all 8 must be in memory at all times. That makes Mixtral 8x7B effectively a 47B-parameter model from a memory standpoint, even though it computes like a 12B.

Total VRAM required
33.7 GB
Mixtral 8x7B at Q4_K_M
Weights
26.3 GB
47B params
KV cache
4.3 GB
32K tokens, FP16 KV

Calculator

Estimated VRAM required

33.7 GB

47B params at Q4_K_M, 32,768 token context, batch 1, inference.

Weights
26.3 GB
KV cache
4.3 GB
Overhead
3.1 GB
Doesn't fit on a 32 GB consumer GPU at Q4_K_M. Try (21.3 GB) for the smallest quant that fits a single RTX 5090.

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Custom architecture - SWA not applied. If you're modeling Gemma 3/4 or Mistral Nemo, pick the preset for accurate KV cache.

Hardware that fits

A100 40GB
Datacenter
40 GB
84% used
Apple M3 Max 64GB
Unified
48 GB
70% used
RTX 6000 Ada
Pro
48 GB
70% used

Just barely too small

RTX 5090
Consumer
32 GB
short by 1.7 GB

How this is calculated

MoE is a compute optimization, not a memory one. The 8 expert FFN blocks plus the shared attention give Mixtral roughly 47B total parameters that must all be loaded in VRAM. At Q4_K_M that's about 26 GB of weights, plus ~4.3 GB of KV cache at 32K context. The benefit is throughput - inference runs at the speed of a 12B dense model, so you get 70B-class quality at 12B-class generation speed if it fits.

Verdict

Mixtral 8x7B is the rare model where memory is the bottleneck and compute is cheap. It needs at minimum a 32 GB card or a 24 GB card with Q8 KV cache and reduced context. When it fits, it's one of the fastest high-quality local options.

More Mistral scenarios

Frequently asked questions

Why does Mixtral 8x7B use so much memory if only 2 experts run?
Because all 8 experts must be in memory at all times. The router selects which 2 to activate per token, but switching is per-token, so swapping experts in and out of VRAM would be far slower than just keeping them all loaded.
How fast is Mixtral 8x7B compared to a dense 47B model?
Significantly faster - generation throughput approximates a dense 12-13B model since only 2 of 8 experts compute per token, even though total memory is the full 47B at the chosen quant.