How much VRAM does Gemma 4 E2B need at Q4_K_M? Phone-class via interleaved SWA
Gemma 4 E2B at Q4_K_M with native 128K context needs about 6.2 GB of VRAM total. The 'E' stands for Edge - the model is designed for phones, laptops, and integrated graphics. The architecture is 35 layers with interleaved sliding-window attention: every 6th layer does full attention, the rest cap their KV cache at a 512-token window. This caps the KV cache size at roughly the same number whether the context is 8K, 32K, or 128K.
Calculator
Estimated VRAM required
6.2 GB
5.1B params at Q4_K_M, 131,072 token context, batch 1, inference.
Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.
Sliding-window attention applied: This model caps 5 of every 6 layers at a 512-token window. KV cache estimate is 85% smaller than naive full-attention math at this context length.
Hardware that fits
How this is calculated
2.86 GB of weights at Q4_K_M (the 5.1B param count includes embedding + LM head, both at full precision), 2.75 GB KV cache thanks to SWA (only ~5 of the 35 layers cache the full 128K, the other 30 are capped at 512 tokens), and ~0.56 GB overhead. SWA is the structural feature that makes E2B usable at long context on tiny hardware - a non-SWA architecture at the same param count would need ~5x more KV cache memory at 128K.
Verdict
Gemma 4 E2B at Q4_K_M is the canonical 'runs on anything' configuration for 2026. Phones with 8 GB+ unified memory via MLC or Llamafile, integrated graphics laptops, Raspberry Pi 5 with extra RAM, Steam Deck. The interleaved SWA design is what lets the small footprint hold up at long context where most models would be KV-cache-bottlenecked.
More Gemma scenarios
Frequently asked questions
How does interleaved SWA save memory?
What's the difference between E2B and E4B?
Why does the param count say 5.1B for a '2B' model?
Related tools
RAM Latency Calculator
Convert DDR3/DDR4/DDR5 timings (CL, tRCD, tRP, tRAS) into true latency in nanoseconds.
Use tool ➜Power Cost Estimator
Estimate annual electricity costs for your PC, Server, or TV.
Use tool ➜Data Transfer Calculator
Estimate transfer times for files over USB, WiFi, Ethernet, and more.
Use tool ➜Data Read Visualizer
Visualize the massive speed difference between CPU cache, RAM, and storage.
Use tool ➜