How much VRAM does Gemma 2 27B need at Q4_K_M? Single GPU performance

Gemma 2 27B at Q4_K_M with 8K context needs about 19.2 GB of VRAM. That's a comfortable fit on any 24 GB card with substantial headroom for longer contexts or larger batch sizes. Gemma 2 punches above its weight on benchmarks, often matching 70B-class models on instruction following and chat quality.

Total VRAM required
19.2 GB
Gemma 2 27B at Q4_K_M
Weights
15.1 GB
27B params
KV cache
2.3 GB
8K tokens, FP16 KV

Calculator

Estimated VRAM required

19.2 GB

27B params at Q4_K_M, 8,192 token context, batch 1, inference.

Weights
15.1 GB
KV cache
2.3 GB
Overhead
1.7 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Sliding-window attention applied: This model caps 1 of every 2 layers at a 4096-token window. KV cache estimate is 25% smaller than naive full-attention math at this context length.

Hardware that fits

RTX 3090
Consumer
24 GB
80% used
RTX 5090
Consumer
32 GB
60% used
A100 40GB
Datacenter
40 GB
48% used
Apple M3 Max 64GB
Unified
48 GB
40% used

How this is calculated

Gemma 2 27B has 46 layers and a 4608 hidden size, slightly different aspect ratios from Llama-style models which makes its KV cache lighter per token. At 8K context you're looking at about 15.1 GB of weights and 2.3 GB of KV cache. The architecture also features sliding-window attention which can be exploited at longer contexts to reduce KV memory further if your inference engine supports it.

Verdict

Gemma 2 27B is the dark horse of local inference. The 19.2 GB footprint at Q4_K_M leaves room for context expansion or batched serving, and the model quality outperforms its size class. A genuinely good pick for a 24 GB GPU.

More Gemma scenarios

Frequently asked questions

Is Gemma 2 27B better than Qwen 2.5 32B?
They trade benchmarks. Gemma 2 27B is stronger on instruction following and chat quality, Qwen 2.5 32B is generally better on code and math. Both fit on a 24 GB GPU at Q4_K_M.
Does Gemma 2 use sliding-window attention?
Yes, alternating sliding-window and full attention layers. Inference engines that exploit this can reduce KV cache significantly at long contexts, llama.cpp and vLLM both support it.