DeepSeek-R1 is unusual: it activates only about 37 billion of its 671 billion parameters per token, yet every expert still has to be resident in GPU memory. That makes weight footprint — not attention — the thing that decides your hardware. Multi-head Latent Attention (MLA) compresses the KV cache so aggressively that even very long reasoning traces stay cheap, leaving quantisation and GPU count as the real levers. Use the calculator above to turn your workload into an exact, UK-priced build.
| Precision | GPUs (H200) | VRAM | Throughput | From |
|---|---|---|---|---|
| FP16 | 16× | 1293.3 GiB | 19,926 tok/s | £12,449/mo |
| FP8 | 8× | 655.9 GiB | 19,926 tok/s | £6,321/mo |
| INT4 | 4× | 337.2 GiB | 19,926 tok/s | £3,741/mo |
All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.
Why MoE weights, not attention, drive the VRAM
R1 routes each token to a small handful of its 256 experts, so compute per token is modest for a model this size. Memory is a different story: because any expert can be selected next, the entire 671B-parameter set must sit in VRAM at once. That is why R1 is a multi-GPU proposition rather than a single-card one, and why the biggest saving on your build comes from precision. Drop from full precision to FP8 or a 4-bit quantisation and the weight footprint — and the GPU count the calculator returns — falls sharply.
MLA makes long context almost free
Most large models pay for context in KV cache that balloons with sequence length and concurrency. DeepSeek-R1 uses Multi-head Latent Attention, which projects keys and values into a compact latent space before caching them, shrinking per-token KV memory to a fraction of a conventional attention model's. The practical upshot: R1's long reasoning chains and large context windows add very little to VRAM. In the sizing above you can push context and simultaneous users hard and watch the weights, not the cache, stay firmly in control of the total.
High-VRAM GPUs, NVLink and quantisation
Fitting R1's resident weights favours the largest-memory accelerators — H200-class cards let you span the model across fewer GPUs, which keeps the tensor-parallel shards talking over high-bandwidth NVLink rather than slower fabric. Fewer, fatter GPUs generally beat many small ones for a model that must be sharded whole. Where budget or rack space is tight, aggressive quantisation is the alternative lever, trading a little quality for a dramatically smaller footprint. The calculator shows both routes so you can weigh cards, precision, power and cooling against each other.
FAQs
How much VRAM does DeepSeek-R1 need?
Enough to hold all 671 billion parameters plus a small MLA KV cache and overhead — so it scales with precision, not context. Full precision demands the most memory and the most GPUs; FP8 roughly halves it, and 4-bit quantisation cuts it dramatically. The calculator above shows the exact VRAM and GPU count for your chosen precision, context and concurrency.
Can I run DeepSeek-R1 on a single GPU?
Not at usable precision. Because R1 is a Mixture-of-Experts model, every expert must be resident in VRAM even though only about 37B parameters activate per token, so the full weight set far exceeds any single accelerator. Realistic self-hosting means several high-VRAM GPUs linked over NVLink, or heavy quantisation to shrink the footprint. Enter your requirements above for a concrete build.
Does a long context window make DeepSeek-R1 more expensive to host?
Barely. Multi-head Latent Attention compresses the KV cache into a small latent representation, so R1's memory cost per token of context is far lower than a comparable dense model's. Long reasoning traces and large context windows add little to total VRAM — the model weights dominate. You can raise the context and concurrency in the calculator and see this effect directly in the VRAM breakdown.