UK’s trusted IT infrastructure partner since 2003
Servnet
FinanceToolsConfiguratorGet in Touch

How do you choose a GPU server for LLM inference?

A UK buyer's guide to self-hosting large language models on your own hardware — sized against a mixture-of-experts model like DeepSeek-R1, where active parameters, latent KV cache and context length drive very different numbers to a dense Llama.

Self-hosting LLM inference is a sizing problem before it is a purchasing one. The wrong assumption about VRAM headroom, GPU interconnect or context length can double your card count and your power bill. This guide walks the whole decision the way we do it for UK clients — then hands you the live calculator above to render exact GPU counts, VRAM, power, cooling and costs for your model and concurrency.

ModelParamsGPUs (H100)VRAMFrom
Llama 3.1 8B8.03B1×48.9 GiB£1,720/mo
Mistral Small 3 (24B)23.6B2×86.6 GiB£2,279/mo
Gemma 2 27B27.2B4×146.5 GiB£3,397/mo
Llama 3.3 70B70.6B4×216.7 GiB£3,397/mo
Qwen2.5 72B72.7B4×220.7 GiB£3,397/mo
Mixtral 8x7B46.7B2×122.4 GiB£2,279/mo
DeepSeek-R1 (671B MoE)671B24×1293.3 GiB£16,512/mo
GPUs required by precisionLlama 3.1 8B1× H100Mistral Small 3 (24B)2× H100Gemma 2 27B4× H100Llama 3.3 70B4× H100Qwen2.5 72B4× H100Mixtral 8x7B2× H100DeepSeek-R1 (671B MoE)24× H100
Popular models on H100 (FP16, 8k, 32 users) — indicative.
Own (financed + power) vs on-demand cloud, £/moOwn it£4,784Cloud£3,889Cash purchase pays back vs cloud in ~64 months, then you keep the asset.
Own vs on-demand cloud for a Llama-70B-class build at 60% utilisation.
Size it precisely in the calculator →

All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.

VRAM headroom: weights plus a KV cache that depends on the model's attention design

VRAM is weights plus the key-value cache plus overhead — and a mixture-of-experts model breaks the intuition. DeepSeek-R1 loads all 256 experts into memory even though only a handful fire per token, so weight footprint tracks total parameters, not the small active count. Its Multi-head Latent Attention then caches one compressed latent per token rather than dozens of KV heads, keeping the cache far leaner than the literal head count suggests. Set your context and concurrency and the calculator sizes both precisely.

GPU count, NVLink and platform: HGX/DGX versus PCIe

A 671B model is sharded across GPUs with tensor parallelism, and every token forces the shards to swap activations. On NVLink and NVSwitch that traffic stays on-fabric; over PCIe it starves the silicon. So a model this size, with a very long context, wants the pooled memory of an HGX baseboard or a DGX node where all cards act as one — not discrete PCIe accelerators, which suit smaller or single-card serving. The calculator picks a count from whichever binds first: VRAM or throughput.

Power, cooling, capex versus finance, and cloud-vs-own

Dense GPU nodes concentrate heat that many UK server rooms and colo halls were never built for, so budget power draw, kilowatts of cooling and rack density alongside the cards. Then weigh outright capex against finance to smooth cashflow, and owning against renting: at steady, high utilisation, self-hosting a model like DeepSeek-R1 repatriates spend that per-GPU-hour cloud pricing never stops charging. The calculator shows the break-even so the decision is evidence-led, not a hunch.

FAQs

Does a mixture-of-experts model like DeepSeek-R1 need less GPU memory than a dense model?

Not for weights. Even though only a few of its 256 experts activate per token, every expert must sit in VRAM, so the memory footprint reflects the full 671B total, not the small active count. What MoE and its latent attention do save is compute per token and KV-cache size — which helps throughput and long context, not the number of cards needed to hold the weights.

Do I really need NVLink, or will PCIe GPUs do?

It depends entirely on model size. For small and mid models on a single card, PCIe accelerators are fine. For a 671B model sharded across several GPUs, tensor parallelism moves data between cards on every token — NVLink and NVSwitch keep that on a fast fabric, while PCIe becomes the bottleneck. The calculator flags when your chosen model outgrows a non-NVLink option.

Is self-hosting cheaper than cloud GPU inference in the UK?

At high, sustained utilisation, usually yes — owned hardware plus power and finance beats paying a per-GPU-hour rate indefinitely, and it keeps data on your premises. At low or bursty utilisation, cloud often wins. The break-even depends on your utilisation, term, electricity rate and finance choice, all of which the calculator lets you set to see the crossover for your case.

Related