The H100 and H200 share the same Hopper architecture, the same FP8 Transformer Engine and the same 700W board — so per-GPU compute is effectively identical. What changes is memory: the H200 pairs faster HBM3e with far more capacity and bandwidth. For inference, where the token-generation phase is memory-bandwidth bound and the KV cache eats VRAM, that difference decides how many GPUs you need, how long a context you can serve, and what it costs.
| Model | On H100 | On H200 | VRAM (FP16) |
|---|---|---|---|
| Llama 3.1 8B | 1× | 1× | 48.9 GiB |
| Mistral Small 3 (24B) | 2× | 1× | 86.6 GiB |
| Llama 3.3 70B | 4× | 2× | 216.7 GiB |
| Qwen2.5 72B | 4× | 2× | 220.7 GiB |
| Mixtral 8x7B | 2× | 1× | 122.4 GiB |
All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.
VRAM headroom: where the H200 pulls ahead
Inference VRAM is model weights plus the KV cache plus overhead, and the KV cache grows with context length and concurrency. The H200's larger HBM3e pool gives you materially more room per GPU, so a model that spills across several H100s can often sit on fewer H200s, or the same model can serve much longer prompts and more concurrent users. For long-context RAG, agentic pipelines or MoE models with heavy weight residency, that extra headroom is frequently the deciding factor rather than raw compute.
Bandwidth and throughput: the decode phase
Token generation is memory-bandwidth limited: each new token re-reads the weights and KV cache, so throughput tracks bandwidth closely. The H200's faster HBM3e delivers a meaningful uplift over the H100's HBM3 at the same 700W, which lifts tokens-per-second per GPU on decode-heavy serving. In practice that means higher batch throughput or lower latency at a given batch size on the same node — most visible on large dense models and long sequences where the cache read dominates each step.
Same node, same power, different economics
Both GPUs ship in the familiar 8-way HGX and DGX form factor at 700W each, so an H200 node draws roughly the same power, occupies the same rack space and needs the same cooling and NVLink fabric as an H100 node. Fewer GPUs for a memory-bound workload can mean fewer nodes, less NVLink pressure and lower facility load. The H200 costs more per card, so the real question is total system cost for your target throughput — which the calculator works out live.
FAQs
Is the H200 faster than the H100 for inference?
For memory-bound token generation, yes — the H200's faster HBM3e lifts throughput at the same 700W. But the compute cores are identical Hopper silicon, so compute-bound work (like the prompt prefill phase) sees little gain. The biggest real-world wins come from long context, high concurrency and large models where memory bandwidth and capacity dominate.
When is the H100 still the better buy?
When your models and context lengths fit comfortably in H100 VRAM and you're throughput-satisfied, the H100 usually offers better value per token. It's a strong choice for smaller or quantised models, shorter contexts and steady batch serving. Run both through the calculator: if the H200 doesn't reduce your GPU or node count, the H100's lower price often wins.
Does the H200 let me use fewer GPUs?
Often, for memory-bound workloads. Its larger HBM3e pool can hold weights and KV cache that would otherwise force sharding across extra H100s, so a model may fit on fewer H200s or serve longer context and higher concurrency per GPU. Whether that reduces total cost depends on your throughput target — the live tool shows the exact counts and cost either way.