LLM VRAM Calculator

Estimate the GPU memory needed to run or fine-tune a large language model. Pick a preset or enter your own values.

Model preset

Parameters (billions)

Quantization

Context length

SWA window

SWA global every

Layers

Hidden size

KV heads (GQA)

Head dim

Batch size

KV cache dtype

Weight loading

Mode

Estimated VRAM required

52.8 GB

9B params at Q4_K_M, 262,144 token context, batch 1, inference.

Weights

5.0 GB

KV cache

42.9 GB

Overhead

4.8 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Hardware that fits

A100 80GB

Datacenter

80 GB

66% used

H100 80GB

Datacenter

80 GB

66% used

Apple M3 Ultra 128GB

Unified

96 GB

55% used

Just barely too small

Apple M3 Max 64GB

Unified

48 GB

short by 4.8 GB

RTX 6000 Ada

Pro

48 GB

short by 4.8 GB

How to use this tool

Pick a model
Select a model from the dropdown (Llama 3.1, Qwen 2.5, Mistral, Mixtral, Gemma, DeepSeek, Phi-3, and more). The calculator loads that model's parameter count, layer count, and hidden size automatically.
Choose a quantization format
Pick a quantization level: FP16 (full precision, ~2 bytes/param), Q8_0 (~1 byte/param), Q5_K_M, or Q4_K_M (~0.56 bytes/param). Lower precision shrinks the weight memory at a small quality cost.
Set context length and batch size
Enter the context length in tokens (8K, 32K, 128K, etc.) and the batch size. The KV cache grows linearly with both, so long contexts can dominate the total VRAM at low quantization.
Read the VRAM estimate
The calculator returns total VRAM with a breakdown of weights, KV cache, and runtime overhead. It also lists which GPUs and unified-memory machines have enough memory to run that configuration.

About this tool

The LLM VRAM Calculator estimates how much GPU memory you need to run or fine-tune a large language model. Pick a popular model from the dropdown (Llama 3.1, Qwen 2.5, Mistral, Mixtral, Gemma, DeepSeek, Phi-3, and more), choose a quantization format, set the context length and batch size, and the calculator returns total VRAM along with a breakdown of weights, KV cache, and runtime overhead. It also lists which GPUs and unified-memory machines have enough memory.

Use it before downloading a model to confirm it fits, when shopping for hardware to run a specific size class, or when deciding whether quantization (Q4_K_M, Q5_K_M, Q8_0) buys you enough headroom to run a tier larger. The math behind it is the same as what llama.cpp and vLLM use internally, so the numbers match what you'll actually see when you load the model.

Formula

Total VRAM = weights + KV cache + activation overhead. Weights = params × bytes_per_param (FP16 = 2, Q8 ≈ 1, Q4_K_M ≈ 0.56). KV cache = 2 × layers × hidden_size × context × batch × KV_dtype_bytes. Overhead is roughly 10% of weights + KV for inference, doubled for training plus 4x weights to cover Adam optimizer states.

When to use it

Ideal for choosing a model size that fits your existing GPU, planning a multi-GPU build for a specific model, or comparing quantization options before downloading multi-GB GGUF files. Pair with the Power Cost Estimator to factor running cost over the lifetime of an inference rig, and the RAM Latency Calculator if you're considering CPU offload as a fallback.

Quantization explained

Modern quantization (Q4_K_M, Q5_K_M, etc.) doesn't just lower precision uniformly. K-quants store most weights at the target bit width but keep a small fraction in higher precision (FP16 metadata, occasional Q6 blocks for important weights), which is why effective bytes-per-param numbers like 0.56 for Q4_K_M are slightly higher than the raw 4-bit value would suggest. The trade-off is favorable - you get most of the memory savings of pure 4-bit with quality very close to Q8_0 or FP16.

What we don't model

The calculator gives you a static memory ceiling. It doesn't capture transient memory spikes during long generations, batch scheduling overhead in serving engines, or the small reduction from techniques like prefix caching. As a practical rule, leave 10-20% headroom on top of the calculator's estimate to avoid out-of-memory errors during real workloads.

Pre-computed VRAM estimates for the model and quantization combinations visitors ask about most.

DeepSeek V4 Pro 1.6T (MoE) at Q4_K_M

~1,012 GB resident, ~70 GB active-only - CSA + MLA at 1M context

View details →

Llama 4 Scout (17B/109B) at Q4_K_M

~294 GB resident, ~237 GB active-only - iRoPE at 1M context

View details →

gpt-oss 20B (MoE) at Q4_K_M

~19.4 GB resident, ~9.3 GB active-only - OpenAI MoE at 128K

View details →

Frequently asked questions

How much VRAM do I need to run Llama 3.1 70B locally?

At Q4_K_M quantization with 8K context, Llama 3.1 70B needs about 46 GB of VRAM. That fits on a single 48 GB pro card (RTX 6000 Ada), two 24 GB consumer GPUs, or a Mac with 64 GB+ unified memory. FP16 needs roughly 159 GB, putting it firmly in datacenter territory.

What's the difference between Q4_K_M, Q5_K_M, and Q8_0?

These are GGUF quantization formats with different precision-vs-size trade-offs. Q8_0 stores weights in 8 bits (~1 byte/param) for near-FP16 quality. Q5_K_M and Q4_K_M use 5 and 4 bits respectively with mixed precision blocks. Q4_K_M is the canonical sweet spot - 4x smaller than FP16 with negligible quality loss for most use cases.

Why does context length affect VRAM so much?

The KV cache stores key and value vectors for every token in context, growing linearly with context length. Doubling from 16K to 32K context doubles the KV cache memory. For long-context models (128K+) the cache often exceeds the weight memory, especially at low quantization.

What is KV cache and can I reduce it?

The KV cache stores intermediate attention values to avoid recomputing them on every new token. You can reduce it by lowering context length (most effective), switching from FP16 to Q8 KV cache (halves memory), or using attention variants like sliding-window attention if your model supports it.

How much extra VRAM does training or fine-tuning need?

Full fine-tuning with the Adam optimizer needs roughly 4x more memory per parameter than inference - 1x for weights, 1x for gradients, and 2x for Adam's first and second moments. LoRA reduces this to inference + a few GB; QLoRA quantizes the base model on top of LoRA for further savings.

Can I split a model across multiple GPUs?

Yes. Tensor parallelism splits each layer across GPUs (best with NVLink), and pipeline parallelism splits whole layers between GPUs (works without NVLink). Most inference engines (vLLM, llama.cpp, exllama) handle this automatically. Multi-GPU adds 5-15% memory overhead for communication buffers.

Does Apple unified memory count as VRAM?

Effectively yes for inference. Apple Silicon uses a single memory pool shared between CPU and GPU, so a 192 GB Mac Studio can hold a model that needs 150 GB of 'VRAM'. Throughput is lower than discrete GPUs (especially for prompt processing) but for many local inference use cases it's the most cost-effective way to run large models.

Why doesn't my model fit even though the math says it should?

Inference engines reserve memory for activations during generation, batch buffers, CUDA kernels, and system overhead. Always leave 10-20% headroom on the GPU. If you're at the limit, lower the context length or batch size, switch to Q8 KV cache, or drop one quantization step.

LLM API Pricing Calculator

Compare API costs across major models (OpenAI, Anthropic, Google) with prompt caching.

Use tool ➜

LLM Token Counter

Count tokens in any prompt for GPT, Claude, Gemini, and Llama with exact OpenAI tokenization.

Use tool ➜

Power Cost Estimator

Estimate annual electricity costs for your PC, Server, or TV.

Use tool ➜

RAM Latency Calculator

Convert DDR3/DDR4/DDR5 timings (CL, tRCD, tRP, tRAS) into true latency in nanoseconds.

Use tool ➜

LLM VRAM Calculator

Hardware that fits

How to use this tool

Pick a model

Choose a quantization format

Set context length and batch size

Read the VRAM estimate

About this tool

Formula

When to use it

Quantization explained

What we don't model

Popular models and configurations

Frequently asked questions

Related tools

LLM API Pricing Calculator

LLM Token Counter

Power Cost Estimator

RAM Latency Calculator