How much VRAM does gpt-oss 20B need at Q4_K_M? OpenAI's first open-weights model

gpt-oss 20B at Q4_K_M with native 128K context needs about 19.4 GB of VRAM with all experts resident, dropping to roughly 9.3 GB with active-only weight loading. It's an MoE with 3.6B active parameters per token, 24 layers, an unusual hidden size of 2944, head_dim 64, and 128K native context. The 20B variant of OpenAI's first Apache 2.0 release is the smallest model in the family and is intentionally sized for laptops and consumer GPUs.

Total VRAM required
19.4 GB
gpt-oss 20B (MoE) at Q4_K_M
Weights
11.2 GB
20B params
KV cache
6.4 GB
128K tokens, FP16 KV

Calculator

Estimated VRAM required

19.4 GB

20B params at Q4_K_M, 131,072 token context, batch 1, inference.

Weights
11.2 GB
KV cache
6.4 GB
Overhead
1.8 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

Hardware that fits

RTX 3090
Consumer
24 GB
81% used
RTX 5090
Consumer
32 GB
61% used
A100 40GB
Datacenter
40 GB
49% used
Apple M3 Max 64GB
Unified
48 GB
40% used

How this is calculated

11.2 GB of weights at Q4_K_M plus a 6.4 GB KV cache (at the full 128K context window) and ~1.8 GB overhead. Active-only loading shrinks weights to 3.6B * 0.56 = 2.0 GB while keeping the same KV cache, totaling ~9.3 GB - it runs on a 12 GB GPU or a unified memory device.

Verdict

gpt-oss 20B at Q4_K_M is the model to download if you've never run a local LLM before in 2026. Apache 2.0, runs on anything, decent reasoning for the size, and OpenAI lineage. Not as strong as Qwen3 30B-A3B at the same memory footprint, but the brand recognition matters and the licensing is the cleanest in the open-weights ecosystem.

Frequently asked questions

What does '3.6B active' actually mean for inference speed?
Per token, the model only computes through 3.6B parameters worth of expert weights even though all 20B must be reachable. With active-only loading and CPU offload the cold experts add a per-token PCIe round-trip - typical numbers are 15-30 tokens/sec on consumer hardware vs 60+ all-resident.
Should I pick gpt-oss 20B or Qwen3 30B-A3B?
Qwen3 30B-A3B is generally stronger at the same active footprint (3B active vs 3.6B). gpt-oss has cleaner licensing and OpenAI lineage. Both fit on the same hardware - try whichever fits your stack first.