How much VRAM does Mistral 7B need at Q4_K_M? Lightweight local LLM

Mistral 7B at Q4_K_M needs about 9.0 GB of VRAM at its native 32K context. The 9 GB footprint fits cleanly on common 12 GB or 16 GB GPUs.

Total VRAM required
9.0 GB
Mistral 7B at Q4_K_M
Weights
3.9 GB
7B params
KV cache
4.3 GB
32K tokens, FP16 KV

Calculator

Estimated VRAM required

9.0 GB

7B params at Q4_K_M, 32,768 token context, batch 1, inference.

Weights
3.9 GB
KV cache
4.3 GB
Overhead
0.8 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Hardware that fits

RTX 3060
Consumer
12 GB
75% used
RTX 4060 Ti 16GB
Consumer
16 GB
56% used
A100 40GB
Datacenter
40 GB
23% used
Apple M3 Max 64GB
Unified
48 GB
19% used

Just barely too small

RTX 4060
Consumer
8 GB
short by 1.0 GB

How this is calculated

7B at Q4_K_M is about 3.9 GB of weights, 4.3 GB of KV cache, and 0.8 GB of overhead, totaling 9.0 GB.

Verdict

Mistral 7B Q4_K_M is the canonical 'small but useful' local LLM configuration. It's been overtaken on most benchmarks by Llama 3.1 8B and Qwen 2.5 7B, but it's still a fine baseline and it fits anywhere.

More Mistral scenarios

Frequently asked questions

Is Mistral 7B still worth running in 2026?
Llama 3.1 8B and Qwen 2.5 7B generally outperform it on benchmarks, but Mistral 7B is well-supported and remains a solid baseline for fine-tuning experiments and lightweight deployments.
What's the smallest GPU that runs Mistral 7B?
A 12 GB or 16 GB GPU handles native 32K context with substantial buffer. For 6 GB or 8 GB cards, drop to 8K context which lowers the KV cache and brings total VRAM down to 5.5 GB.