How much VRAM does Llama 3.1 8B need to fine-tune at FP16? Adam optimizer math

Full fine-tuning Llama 3.1 8B at FP16 with native 128K context needs about 167.8 GB of VRAM, demonstrating the massive scaling of Adam and activations at high context.

Total VRAM required
168 GB
Llama 3.1 8B at FP16
Weights
144 GB
Includes Adam optimizer states
KV cache
17.2 GB
128K tokens, FP16 KV

Calculator

Estimated VRAM required

41.1 GB

8B params at FP16, 131,072 token context, batch 1, training (Adam).

Weights
17.3 GB
KV cache
17.2 GB
Overhead
6.6 GB
Doesn't fit on a 32 GB consumer GPU at FP16. Try (25.0 GB) for the smallest quant that fits a single RTX 5090.

Estimate accuracy: Weights within ~2%. KV cache within ~5% for standard GQA models, ~10% for MLA (DeepSeek). Real VRAM may vary with framework (vLLM vs llama.cpp vs Transformers), Flash Attention, and driver overhead.

LoRA fine-tune sizing: Forward weights at FP16, only ~1% of params get optimizer state (FP32 master + grad + AdamW m + v). Real LoRA peak depends on rank and target modules; this is the typical r=16 ceiling.

Hardware that fits

Apple M3 Max 64GB
Unified
48 GB
86% used
RTX 6000 Ada
Pro
48 GB
86% used
A100 80GB
Datacenter
80 GB
51% used

Just barely too small

A100 40GB
Datacenter
40 GB
short by 1.1 GB

How this is calculated

Training weights + gradients + Adam optimizer buffers take 144 GB. The 128K KV cache takes 17.2 GB, and activation overhead adds roughly 6.6 GB, totaling 167.8 GB.

Verdict

Full fine-tuning at FP16 is the worst-case memory configuration. Use it only if you must. LoRA reduces memory to roughly inference + a few GB; QLoRA reduces it further by quantizing the base model weights. For 99% of fine-tuning use cases, QLoRA on a single 24 GB GPU produces results indistinguishable from full fine-tuning.

Frequently asked questions

Why does training need so much more memory than inference?
Adam's optimizer states (two FP32 buffers per parameter) plus gradients triple or quadruple the per-parameter cost vs inference. Activations needed for backprop add another large chunk.
Can I fine-tune Llama 3.1 8B on a single 24 GB GPU?
Not with full fine-tuning at native context. Use LoRA (low-rank adapters) or QLoRA, both of which fit comfortably on 24 GB and produce comparable results for most use cases.