The first question when self-hosting a large language model is deceptively simple: will it fit? The answer is a sum of three parts - the model weights, the KV cache and a little overhead - and getting the middle term right is where most calculators go wrong. This guide explains what actually consumes GPU memory, how precision and concurrency change it, and how a VRAM total becomes a GPU count. Put your own model and workload into the AI/GPU calculator to see the exact figures.
VRAM is three things, not one
Every byte of GPU memory an inference server uses falls into three buckets: the model weights, the KV cache and framework overhead. Weights are fixed - they are the model. The KV cache grows with how much text you process and how many users you serve. Overhead is a small, near-constant tax for the CUDA context and working buffers. Total VRAM is simply their sum, and a model fits when that sum sits inside the usable memory of your GPUs.
The reason sizing feels confusing is that people quote only the weights - the headline figure - and forget the KV cache, which under real concurrency can rival or exceed the weights. The calculator on this page adds all three so you see the true working set, not just the model on disk.
Weights: parameters times precision
Weight memory is the easy part: it is the number of parameters multiplied by the bytes used to store each one. At 16-bit precision that is two bytes per parameter, so a 70-billion-parameter model like Llama 3.3 70B needs about 131 GiB just for weights - comfortably more than a single 80GB H100, which is why it spreads across at least two cards.
Precision is the biggest lever you have. Dropping to 8-bit halves the weights; 4-bit quantisation with AWQ or GPTQ quarters them, bringing that same 70B model down to roughly 33 GiB and often onto a single high-capacity GPU with little quality loss for most tasks. The trade is fidelity, so evaluation and fine-tuning usually stay at higher precision while production serving quantises.
- •FP16 / BF16: 2 bytes per parameter - full fidelity
- •FP8 / INT8: 1 byte per parameter - half the weights
- •INT4: 0.5 bytes per parameter - a quarter of the weights
- •A 70B model: about 131 GiB at FP16, about 33 GiB at INT4
The KV cache: the term everyone forgets
As the model generates, it caches the key and value vectors for every token it has already seen so it does not recompute them - the KV cache. Its size grows with the context length, the number of concurrent requests and the model architecture, and at scale it is often the memory that decides how many GPUs you need, not the weights.
Modern models keep it in check with grouped-query attention, where many query heads share a handful of key and value heads. Llama 3 70B uses just eight key/value heads for its 64 query heads - an eight-fold reduction versus classic multi-head attention. This is the single most common mistake in VRAM calculators: sizing the KV cache from the query-head count overstates it four- to eight-fold on Llama-class models. DeepSeek goes further with Multi-head Latent Attention, caching one small compressed latent so even 128k context stays cheap.
Overhead, and turning VRAM into GPUs
The last slice is overhead: the CUDA runtime context and framework working buffers, a roughly one-to-two gigabyte near-constant. Serving frameworks such as vLLM also reserve headroom - commonly using about 90% of a card - so the usable figure per GPU is a little below the nameplate.
Divide the total VRAM by the usable memory per card and round up to a sensible parallel topology - 1, 2, 4 or 8 GPUs, then multiples of eight for multi-node - and you have your GPU count. A 70B model at 16-bit serving thirty-odd users at 8k context lands on four H100s; quantised to 4-bit it fits on two. See it worked through on the Llama 3 70B GPU requirements page, or size your own in the calculator.
From VRAM to a real build
VRAM tells you how many GPUs; a deployment needs the rest of the picture. Those GPUs sit in servers - an 8-GPU HGX or DGX node draws around ten kilowatts - which need power, cooling and rack space, and they cost money to buy or rent. That is why sizing memory is only the start: the same tool carries the GPU count through to a server and rack build, power and cooling, capex and finance, and a cloud-versus-own comparison.
If you are weighing whether to buy that hardware or rent it, see self-hosting LLMs vs cloud GPUs. To spec the servers themselves, our team builds them through the NVIDIA DGX and GPU accelerator ranges.