UK’s trusted IT infrastructure partner since 2003
Servnet
FinanceToolsConfiguratorGet in Touch

How many GPUs does it take to run an LLM?

There is one formula behind every honest answer — model weights, plus the KV cache, plus overhead, divided by usable VRAM. This page walks the method with a Mixture-of-Experts, latent-attention model as the worked example, then hands you a live calculator for exact figures.

The question sounds simple, but a trustworthy answer comes from one sum, not a rule of thumb. You add the model's weights, the KV cache that grows with context and concurrency, and a slice of overhead, then divide by the usable VRAM on a card. Precision and user load move the total sharply, and Mixture-of-Experts models with latent attention behave differently again. The calculator below runs the maths live for exact GPU counts, VRAM, power, cooling and cost — the copy here explains what it is doing.

ModelParamsGPUs (H100)VRAMFrom
Llama 3.1 8B8.03B1×48.9 GiB£1,720/mo
Mistral Small 3 (24B)23.6B2×86.6 GiB£2,279/mo
Gemma 2 27B27.2B4×146.5 GiB£3,397/mo
Llama 3.3 70B70.6B4×216.7 GiB£3,397/mo
Qwen2.5 72B72.7B4×220.7 GiB£3,397/mo
Mixtral 8x7B46.7B2×122.4 GiB£2,279/mo
DeepSeek-R1 (671B MoE)671B24×1293.3 GiB£16,512/mo
GPUs required by precisionLlama 3.1 8B1× H100Mistral Small 3 (24B)2× H100Gemma 2 27B4× H100Llama 3.3 70B4× H100Qwen2.5 72B4× H100Mixtral 8x7B2× H100DeepSeek-R1 (671B MoE)24× H100
Popular models on H100 (FP16, 8k, 32 users) — indicative.
Own (financed + power) vs on-demand cloud, £/moOwn it£4,784Cloud£3,889Cash purchase pays back vs cloud in ~64 months, then you keep the asset.
Own vs on-demand cloud for a Llama-70B-class build at 60% utilisation.
Size it precisely in the calculator →

All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.

Start with the sum: weights + KV cache + overhead

Every GPU count starts with one sum: model weights plus KV cache plus overhead. Weights are the fixed cost — parameters multiplied by the bytes each one occupies. The KV cache is the variable cost, growing with every token of context and every simultaneous request you serve. Overhead covers the CUDA context, activations and framework slack. Add the three, divide by a GPU's usable VRAM, and you have a floor for the number of cards.

Why a sparse MoE model still fills every card

The weights term is where this workload gets interesting. Its published parameter count is enormous, but it is a Mixture-of-Experts model: only a small fraction of experts fire on any given token. Every expert still has to be resident in VRAM, so the weights term is set by the total parameter count, not the active slice. That is why an MoE model of this scale needs a full multi-GPU node even though each token touches only a sliver of the network.

Latent attention keeps the KV cache small

KV cache is normally the term that explodes with long context, but this architecture uses Multi-head Latent Attention. Instead of caching keys and values for every attention head, it stores a single compressed latent per token per layer. The saving is dramatic against its large head count, so its cache stays modest even as context runs into six figures. Naive calculators that count query heads overstate this badly; the tool below caches the latent correctly.

FAQs

If only a few experts activate per token, why does it need so much VRAM?

Weights are set by the total parameter count because every expert must sit in VRAM even though only a handful activate per token. The KV cache, by contrast, tracks active compute and stays small thanks to Multi-head Latent Attention. So memory is dominated by the full weight set, which is what forces a multi-GPU node regardless of the low active-parameter figure.

Does the long context window blow up the memory?

Not the way you would expect — a compressed latent cache means context length costs far less memory here than on a comparable attention design. You can push toward its very large context ceiling without the KV cache dominating VRAM. It still grows with concurrency, so many long-context sessions at once add up. Set your real context and user count in the calculator to see the true figure.

Can I run it on a couple of cheaper GPUs without NVLink?

Not comfortably. This model's full weight set spans several high-VRAM cards that must act as one, which needs NVLink and NVSwitch bandwidth between them. Cards without NVLink, or with too little VRAM, either cannot hold it or bottleneck on cross-GPU traffic. The calculator steers you to a suitable NVLink node and shows the resulting count, power and cost.

Related