How much VRAM does Mixtral 8x7B need at Q4_K_M? MoE memory math
Mixtral 8x7B at Q4_K_M with 32K context needs about 34 GB of VRAM. Mixture-of-experts is the wrinkle here: even though only 2 of the 8 experts activate per token, all 8 must be in memory at all times. That makes Mixtral 8x7B effectively a 47B-parameter model from a memory standpoint, even though it computes like a 12B.
Calculator
Estimated VRAM required
33.7 GB
47B params at Q4_K_M, 32,768 token context, batch 1, inference.
Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.
Custom architecture - SWA not applied. If you're modeling Gemma 3/4 or Mistral Nemo, pick the preset for accurate KV cache.
Hardware that fits
Just barely too small
How this is calculated
MoE is a compute optimization, not a memory one. The 8 expert FFN blocks plus the shared attention give Mixtral roughly 47B total parameters that must all be loaded in VRAM. At Q4_K_M that's about 26 GB of weights, plus ~4.3 GB of KV cache at 32K context. The benefit is throughput - inference runs at the speed of a 12B dense model, so you get 70B-class quality at 12B-class generation speed if it fits.
Verdict
Mixtral 8x7B is the rare model where memory is the bottleneck and compute is cheap. It needs at minimum a 32 GB card or a 24 GB card with Q8 KV cache and reduced context. When it fits, it's one of the fastest high-quality local options.
More Mistral scenarios
Frequently asked questions
Why does Mixtral 8x7B use so much memory if only 2 experts run?
How fast is Mixtral 8x7B compared to a dense 47B model?
Related tools
RAM Latency Calculator
Convert DDR3/DDR4/DDR5 timings (CL, tRCD, tRP, tRAS) into true latency in nanoseconds.
Use tool ➜Power Cost Estimator
Estimate annual electricity costs for your PC, Server, or TV.
Use tool ➜Data Transfer Calculator
Estimate transfer times for files over USB, WiFi, Ethernet, and more.
Use tool ➜Data Read Visualizer
Visualize the massive speed difference between CPU cache, RAM, and storage.
Use tool ➜