Meta's Llama 3.1 405B is the largest openly available dense model in the Llama family — 405 billion parameters across 126 transformer layers, with grouped-query attention and a 128K-token context window. Unlike sparse mixture-of-experts models, every parameter is active on every token, so it is genuinely demanding to host. The calculator on this page turns your precision, context and concurrency choices into exact hardware figures; the guidance around it explains why 405B behaves the way it does and what to buy.
| Precision | GPUs (H200) | VRAM | Throughput | From |
|---|---|---|---|---|
| FP16 | 8× | 900.6 GiB | 908 tok/s | £6,321/mo |
| FP8 | 8× | 515.1 GiB | 1,817 tok/s | £6,321/mo |
| INT4 | 4× | 322.3 GiB | 1,817 tok/s | £3,741/mo |
All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.
Why 405B needs a whole GPU node
Because Llama 3.1 405B is dense, its full parameter count must sit in VRAM simultaneously and stay active on every forward pass — there is no sparse routing to lighten the load. In practice it spans a complete HGX or DGX-class node of interconnected accelerators rather than a single card, with larger deployments chaining several nodes together. Spreading a dense model this way means constant tensor traffic between GPUs, so NVLink and NVSwitch fabric is load-bearing, not optional. The calculator shows whether one node suffices or you are into multi-node territory.
FP8, INT4 and the KV cache at 128K
Serving 405B at full BF16 is rarely practical. FP8 roughly halves the memory footprint with minimal quality loss and suits Hopper and Blackwell tensor cores natively, while INT4 compresses it further for cost-sensitive deployments. Precision is your biggest lever on GPU count and power. Separately, the 128K context window and concurrent users grow the KV cache — kept smaller here by grouped-query attention's eight KV heads, but still significant. The calculator splits weight VRAM from KV cache so you see where headroom goes.
Power, cooling and UK data-centre fit
A full accelerator node running a dense frontier model draws serious power and rejects serious heat, and that shapes where it can live. Many UK colocation halls and on-premise rooms were never provisioned for these rack densities, so power feeds, PDUs and cooling — increasingly direct-liquid — become the real constraint before GPU availability does. The calculator estimates node power and cooling load so you can check it against your facility. Servnet can supply the rack, power and liquid-cooling infrastructure alongside the compute and interconnect.
FAQs
Can I run Llama 3.1 405B on a single GPU?
No. As a dense 405-billion-parameter model, its weights vastly exceed any single accelerator's VRAM, even at INT4. It is designed to be sharded across a full multi-GPU HGX or DGX-class node using tensor and pipeline parallelism. The calculator above shows the minimum GPU count for your chosen precision and context length.
How much does quantisation reduce the hardware for 405B?
Substantially. Moving from BF16 to FP8 roughly halves the weight memory, and INT4 compresses it further, which can drop your GPU count and power draw significantly. FP8 preserves quality well on Hopper and Blackwell hardware; INT4 trades a little accuracy for cost. Use the precision toggle in the calculator to compare exact footprints side by side.
Does the 128K context window change my GPU requirement?
It can. Long contexts and many concurrent users grow the KV cache, which consumes VRAM on top of the weights. Llama 3.1 405B's grouped-query attention with eight KV heads keeps this smaller than full multi-head attention would, but heavy long-context serving still adds meaningful memory. The calculator breaks out KV cache separately so you can size for your real workload.