How much VRAM does Gemma 4 E2B need at Q4_K_M? Phone-class via interleaved SWA

Gemma 4 E2B at Q4_K_M with native 128K context needs about 6.2 GB of VRAM total. The 'E' stands for Edge - the model is designed for phones, laptops, and integrated graphics. The architecture is 35 layers with interleaved sliding-window attention: every 6th layer does full attention, the rest cap their KV cache at a 512-token window. This caps the KV cache size at roughly the same number whether the context is 8K, 32K, or 128K.

Total VRAM required
6.2 GB
Gemma 4 E2B at Q4_K_M
Weights
2.9 GB
5.1B params
KV cache
2.7 GB
128K tokens, FP16 KV

Calculator

Estimated VRAM required

6.2 GB

5.1B params at Q4_K_M, 131,072 token context, batch 1, inference.

Weights
2.9 GB
KV cache
2.7 GB
Overhead
0.6 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Sliding-window attention applied: This model caps 5 of every 6 layers at a 512-token window. KV cache estimate is 85% smaller than naive full-attention math at this context length.

Hardware that fits

RTX 4060
Consumer
8 GB
77% used
RTX 3060
Consumer
12 GB
51% used
A100 40GB
Datacenter
40 GB
15% used
Apple M3 Max 64GB
Unified
48 GB
13% used

How this is calculated

2.86 GB of weights at Q4_K_M (the 5.1B param count includes embedding + LM head, both at full precision), 2.75 GB KV cache thanks to SWA (only ~5 of the 35 layers cache the full 128K, the other 30 are capped at 512 tokens), and ~0.56 GB overhead. SWA is the structural feature that makes E2B usable at long context on tiny hardware - a non-SWA architecture at the same param count would need ~5x more KV cache memory at 128K.

Verdict

Gemma 4 E2B at Q4_K_M is the canonical 'runs on anything' configuration for 2026. Phones with 8 GB+ unified memory via MLC or Llamafile, integrated graphics laptops, Raspberry Pi 5 with extra RAM, Steam Deck. The interleaved SWA design is what lets the small footprint hold up at long context where most models would be KV-cache-bottlenecked.

More Gemma scenarios

Frequently asked questions

How does interleaved SWA save memory?
Most layers (5 of every 6 in Gemma 4) only attend to the most recent 512 tokens, so their KV cache is capped at 512 tokens regardless of total context. Only the every-6th global layer caches the full context. This makes long-context cost grow ~6x slower than a fully-global model.
What's the difference between E2B and E4B?
E4B is roughly 8B params with 42 layers and 8 KV heads - about 2x the memory of E2B and noticeably stronger reasoning. Both use the same SWA scheme. Pick E2B for phones, E4B for laptops with 8+ GB GPU memory.
Why does the param count say 5.1B for a '2B' model?
Gemma counts 'E2B' / 'E4B' as effective dense parameters; the actual file is ~5.1B because of larger embedding tables and tied weights. The calculator uses the actual param count to size weights correctly.