LLM VRAM Calculator
Estimate the GPU memory needed to run or fine-tune a large language model. Pick a preset or enter your own values.
Estimated VRAM required
52.8 GB
9B params at Q4_K_M, 262,144 token context, batch 1, inference.
Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.
KV cache exceeds model weights: Consider lowering the context length to save on VRAM. Contexts between 8K and 64K are generally more typical for local setups.
Hardware that fits
Just barely too small
About this tool
The LLM VRAM Calculator estimates how much GPU memory you need to run or fine-tune a large language model. Pick a popular model from the dropdown (Llama 3.1, Qwen 2.5, Mistral, Mixtral, Gemma, DeepSeek, Phi-3, and more), choose a quantization format, set the context length and batch size, and the calculator returns total VRAM along with a breakdown of weights, KV cache, and runtime overhead. It also lists which GPUs and unified-memory machines have enough memory.
Use it before downloading a model to confirm it fits, when shopping for hardware to run a specific size class, or when deciding whether quantization (Q4_K_M, Q5_K_M, Q8_0) buys you enough headroom to run a tier larger. The math behind it is the same as what llama.cpp and vLLM use internally, so the numbers match what you'll actually see when you load the model.
Formula
Total VRAM = weights + KV cache + activation overhead. Weights = params × bytes_per_param (FP16 = 2, Q8 ≈ 1, Q4_K_M ≈ 0.56). KV cache = 2 × layers × hidden_size × context × batch × KV_dtype_bytes. Overhead is roughly 10% of weights + KV for inference, doubled for training plus 4x weights to cover Adam optimizer states.
When to use it
Ideal for choosing a model size that fits your existing GPU, planning a multi-GPU build for a specific model, or comparing quantization options before downloading multi-GB GGUF files. Pair with the Power Cost Estimator to factor running cost over the lifetime of an inference rig, and the RAM Latency Calculator if you're considering CPU offload as a fallback.
Quantization explained
Modern quantization (Q4_K_M, Q5_K_M, etc.) doesn't just lower precision uniformly. K-quants store most weights at the target bit width but keep a small fraction in higher precision (FP16 metadata, occasional Q6 blocks for important weights), which is why effective bytes-per-param numbers like 0.56 for Q4_K_M are slightly higher than the raw 4-bit value would suggest. The trade-off is favorable - you get most of the memory savings of pure 4-bit with quality very close to Q8_0 or FP16.
What we don't model
The calculator gives you a static memory ceiling. It doesn't capture transient memory spikes during long generations, batch scheduling overhead in serving engines, or the small reduction from techniques like prefix caching. As a practical rule, leave 10-20% headroom on top of the calculator's estimate to avoid out-of-memory errors during real workloads.
Popular models and configurations
Pre-computed VRAM estimates for the model and quantization combinations visitors ask about most.
Frequently asked questions
How much VRAM do I need to run Llama 3.1 70B locally?
What's the difference between Q4_K_M, Q5_K_M, and Q8_0?
Why does context length affect VRAM so much?
What is KV cache and can I reduce it?
How much extra VRAM does training or fine-tuning need?
Can I split a model across multiple GPUs?
Does Apple unified memory count as VRAM?
Why doesn't my model fit even though the math says it should?
Related tools
RAM Latency Calculator
Convert DDR3/DDR4/DDR5 timings (CL, tRCD, tRP, tRAS) into true latency in nanoseconds.
Use tool ➜Power Cost Estimator
Estimate annual electricity costs for your PC, Server, or TV.
Use tool ➜Data Transfer Calculator
Estimate transfer times for files over USB, WiFi, Ethernet, and more.
Use tool ➜Data Read Visualizer
Visualize the massive speed difference between CPU cache, RAM, and storage.
Use tool ➜