How do you choose a GPU server for LLM inference?

A UK buyer's guide to self-hosting large language models on your own hardware — sized against a mixture-of-experts model like DeepSeek-R1, where active parameters, latent KV cache and context length drive very different numbers to a dense Llama.

Self-hosting LLM inference is a sizing problem before it is a purchasing one. The wrong assumption about VRAM headroom, GPU interconnect or context length can double your card count and your power bill. This guide walks the whole decision the way we do it for UK clients — then hands you the live calculator above to render exact GPU counts, VRAM, power, cooling and costs for your model and concurrency.

Model	Params	GPUs (H100)	VRAM	From
Llama 3.1 8B	8.03B	1×	48.9 GiB	£1,720/mo
Mistral Small 3 (24B)	23.6B	2×	86.6 GiB	£2,279/mo
Gemma 2 27B	27.2B	4×	146.5 GiB	£3,397/mo
Llama 3.3 70B	70.6B	4×	216.7 GiB	£3,397/mo
Qwen2.5 72B	72.7B	4×	220.7 GiB	£3,397/mo
Mixtral 8x7B	46.7B	2×	122.4 GiB	£2,279/mo
DeepSeek-R1 (671B MoE)	671B	24×	1293.3 GiB	£16,512/mo

Popular models on H100 (FP16, 8k, 32 users) — indicative.

Own vs on-demand cloud for a Llama-70B-class build at 60% utilisation.

Size it precisely in the calculator →

All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.

VRAM headroom: weights plus a KV cache that depends on the model's attention design

VRAM is weights plus the key-value cache plus overhead — and a mixture-of-experts model breaks the intuition. DeepSeek-R1 loads all 256 experts into memory even though only a handful fire per token, so weight footprint tracks total parameters, not the small active count. Its Multi-head Latent Attention then caches one compressed latent per token rather than dozens of KV heads, keeping the cache far leaner than the literal head count suggests. Set your context and concurrency and the calculator sizes both precisely.

GPU count, NVLink and platform: HGX/DGX versus PCIe

A 671B model is sharded across GPUs with tensor parallelism, and every token forces the shards to swap activations. On NVLink and NVSwitch that traffic stays on-fabric; over PCIe it starves the silicon. So a model this size, with a very long context, wants the pooled memory of an HGX baseboard or a DGX node where all cards act as one — not discrete PCIe accelerators, which suit smaller or single-card serving. The calculator picks a count from whichever binds first: VRAM or throughput.

Power, cooling, capex versus finance, and cloud-vs-own

Dense GPU nodes concentrate heat that many UK server rooms and colo halls were never built for, so budget power draw, kilowatts of cooling and rack density alongside the cards. Then weigh outright capex against finance to smooth cashflow, and owning against renting: at steady, high utilisation, self-hosting a model like DeepSeek-R1 repatriates spend that per-GPU-hour cloud pricing never stops charging. The calculator shows the break-even so the decision is evidence-led, not a hunch.

FAQs

Does a mixture-of-experts model like DeepSeek-R1 need less GPU memory than a dense model?

Not for weights. Even though only a few of its 256 experts activate per token, every expert must sit in VRAM, so the memory footprint reflects the full 671B total, not the small active count. What MoE and its latent attention do save is compute per token and KV-cache size — which helps throughput and long context, not the number of cards needed to hold the weights.

Do I really need NVLink, or will PCIe GPUs do?

It depends entirely on model size. For small and mid models on a single card, PCIe accelerators are fine. For a 671B model sharded across several GPUs, tensor parallelism moves data between cards on every token — NVLink and NVSwitch keep that on a fast fabric, while PCIe becomes the bottleneck. The calculator flags when your chosen model outgrows a non-NVLink option.

Is self-hosting cheaper than cloud GPU inference in the UK?

At high, sustained utilisation, usually yes — owned hardware plus power and finance beats paying a per-GPU-hour rate indefinitely, and it keeps data on your premises. At low or bursty utilisation, cloud often wins. The break-even depends on your utilisation, term, electricity rate and finance choice, all of which the calculator lets you set to see the crossover for your case.

AI / GPU Calculator →

Size any model + workload precisely.

NVIDIA DGX systems →

The 8-GPU platforms behind these builds.

GPU accelerators →

H100, H200, L40S and A100 specs.

IT finance calculator →

Finance the cluster — HP, lease or subscription.

Server room cooling →

Turn the kW load into BTU/hr and tons.