Back to Home

LLM VRAM Calculator

Estimate the GPU memory needed to run or fine-tune a large language model. Pick a preset or enter your own values.

B

Estimated VRAM required

52.8 GB

9B params at Q4_K_M, 262,144 token context, batch 1, inference.

Weights
5.0 GB
KV cache
42.9 GB
Overhead
4.8 GB

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.

Hardware that fits

A100 80GB
Datacenter
80 GB
66% used
H100 80GB
Datacenter
80 GB
66% used
Apple M3 Ultra 128GB
Unified
96 GB
55% used

Just barely too small

Apple M3 Max 64GB
Unified
48 GB
short by 4.8 GB
RTX 6000 Ada
Pro
48 GB
short by 4.8 GB

About this tool

The LLM VRAM Calculator estimates how much GPU memory you need to run or fine-tune a large language model. Pick a popular model from the dropdown (Llama 3.1, Qwen 2.5, Mistral, Mixtral, Gemma, DeepSeek, Phi-3, and more), choose a quantization format, set the context length and batch size, and the calculator returns total VRAM along with a breakdown of weights, KV cache, and runtime overhead. It also lists which GPUs and unified-memory machines have enough memory.

Use it before downloading a model to confirm it fits, when shopping for hardware to run a specific size class, or when deciding whether quantization (Q4_K_M, Q5_K_M, Q8_0) buys you enough headroom to run a tier larger. The math behind it is the same as what llama.cpp and vLLM use internally, so the numbers match what you'll actually see when you load the model.

Formula

Total VRAM = weights + KV cache + activation overhead. Weights = params × bytes_per_param (FP16 = 2, Q8 ≈ 1, Q4_K_M ≈ 0.56). KV cache = 2 × layers × hidden_size × context × batch × KV_dtype_bytes. Overhead is roughly 10% of weights + KV for inference, doubled for training plus 4x weights to cover Adam optimizer states.

When to use it

Ideal for choosing a model size that fits your existing GPU, planning a multi-GPU build for a specific model, or comparing quantization options before downloading multi-GB GGUF files. Pair with the Power Cost Estimator to factor running cost over the lifetime of an inference rig, and the RAM Latency Calculator if you're considering CPU offload as a fallback.

Quantization explained

Modern quantization (Q4_K_M, Q5_K_M, etc.) doesn't just lower precision uniformly. K-quants store most weights at the target bit width but keep a small fraction in higher precision (FP16 metadata, occasional Q6 blocks for important weights), which is why effective bytes-per-param numbers like 0.56 for Q4_K_M are slightly higher than the raw 4-bit value would suggest. The trade-off is favorable - you get most of the memory savings of pure 4-bit with quality very close to Q8_0 or FP16.

What we don't model

The calculator gives you a static memory ceiling. It doesn't capture transient memory spikes during long generations, batch scheduling overhead in serving engines, or the small reduction from techniques like prefix caching. As a practical rule, leave 10-20% headroom on top of the calculator's estimate to avoid out-of-memory errors during real workloads.

Pre-computed VRAM estimates for the model and quantization combinations visitors ask about most.

Frequently asked questions

How much VRAM do I need to run Llama 3.1 70B locally?
At Q4_K_M quantization with 8K context, Llama 3.1 70B needs about 46 GB of VRAM. That fits on a single 48 GB pro card (RTX 6000 Ada), two 24 GB consumer GPUs, or a Mac with 64 GB+ unified memory. FP16 needs roughly 159 GB, putting it firmly in datacenter territory.
What's the difference between Q4_K_M, Q5_K_M, and Q8_0?
These are GGUF quantization formats with different precision-vs-size trade-offs. Q8_0 stores weights in 8 bits (~1 byte/param) for near-FP16 quality. Q5_K_M and Q4_K_M use 5 and 4 bits respectively with mixed precision blocks. Q4_K_M is the canonical sweet spot - 4x smaller than FP16 with negligible quality loss for most use cases.
Why does context length affect VRAM so much?
The KV cache stores key and value vectors for every token in context, growing linearly with context length. Doubling from 16K to 32K context doubles the KV cache memory. For long-context models (128K+) the cache often exceeds the weight memory, especially at low quantization.
What is KV cache and can I reduce it?
The KV cache stores intermediate attention values to avoid recomputing them on every new token. You can reduce it by lowering context length (most effective), switching from FP16 to Q8 KV cache (halves memory), or using attention variants like sliding-window attention if your model supports it.
How much extra VRAM does training or fine-tuning need?
Full fine-tuning with the Adam optimizer needs roughly 4x more memory per parameter than inference - 1x for weights, 1x for gradients, and 2x for Adam's first and second moments. LoRA reduces this to inference + a few GB; QLoRA quantizes the base model on top of LoRA for further savings.
Can I split a model across multiple GPUs?
Yes. Tensor parallelism splits each layer across GPUs (best with NVLink), and pipeline parallelism splits whole layers between GPUs (works without NVLink). Most inference engines (vLLM, llama.cpp, exllama) handle this automatically. Multi-GPU adds 5-15% memory overhead for communication buffers.
Does Apple unified memory count as VRAM?
Effectively yes for inference. Apple Silicon uses a single memory pool shared between CPU and GPU, so a 192 GB Mac Studio can hold a model that needs 150 GB of 'VRAM'. Throughput is lower than discrete GPUs (especially for prompt processing) but for many local inference use cases it's the most cost-effective way to run large models.
Why doesn't my model fit even though the math says it should?
Inference engines reserve memory for activations during generation, batch buffers, CUDA kernels, and system overhead. Always leave 10-20% headroom on the GPU. If you're at the limit, lower the context length or batch size, switch to Q8 KV cache, or drop one quantization step.