Both the NVIDIA H100 and A100 ship in an 80GB configuration, so raw VRAM capacity rarely decides the winner. What separates them for LLM inference is memory bandwidth, FP8 support and power draw. This page explains where each card earns its keep, then lets you model your own deployment live so the numbers reflect your model, context length and concurrency rather than a marketing benchmark.
| Model | On H100 | On A100 | VRAM (FP16) |
|---|---|---|---|
| Llama 3.1 8B | 1× | 1× | 48.9 GiB |
| Mistral Small 3 (24B) | 2× | 2× | 86.6 GiB |
| Llama 3.3 70B | 4× | 4× | 216.7 GiB |
| Qwen2.5 72B | 4× | 4× | 220.7 GiB |
| Mixtral 8x7B | 2× | 2× | 122.4 GiB |
All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.
Why the H100 pulls ahead on token throughput
LLM decode is memory-bandwidth bound: every generated token re-reads the model weights and the growing KV cache from VRAM. The H100's HBM3 moves data substantially faster than the A100's HBM2e, and that gap alone yields materially higher tokens per second on the same model. Add the Hopper Transformer Engine, native FP8 and NVLink 4's higher inter-GPU bandwidth for sharded models, and the H100 typically lands two to three times the A100's inference throughput on fewer cards.
FP8, KV cache and larger batches
The A100 tops out at BF16/FP16 for serving; it has no FP8 hardware path. The H100 runs FP8 natively, which halves the bytes per weight and per KV-cache element. In practice that lets you hold a bigger KV cache and push larger batch sizes at a given context length, so concurrency scales further before you add another GPU. For high-throughput chat or RAG endpoints where KV cache is the limiter, that headroom is often the deciding factor. It can also pull cost per million tokens below the A100's despite the higher sticker price.
Where the A100 still makes sense
The A100 draws far less power than the 700W H100, so it is easier to rack in air-cooled UK facilities without liquid cooling or dense-power upgrades. For steady FP16/BF16 serving of mid-sized models, batch or offline inference, or budget-constrained pilots, its lower acquisition and running cost frequently wins on total cost of ownership. If your latency targets are comfortable and you are not chasing FP8 efficiency, the A100 remains a genuinely capable, cost-effective serving card. Size both options against your real traffic in the calculator below.
FAQs
Is the H100 always worth the premium over the A100?
No. The H100 justifies its cost when you need FP8 efficiency, maximum tokens per second, or the highest concurrency per rack. For relaxed latency targets, mid-sized models or budget pilots, the A100's lower price and 400W draw often deliver better total cost of ownership. The calculator quantifies both so you can decide on your actual workload.
Do H100 and A100 have the same VRAM for LLM inference?
In the 80GB variants, yes, capacity is comparable. The difference is how fast that memory is read and whether FP8 is available. The H100's higher HBM3 bandwidth and native FP8 let it serve larger KV caches and batches efficiently, so equal VRAM does not mean equal serving capacity in practice.
Which GPU is easier to deploy in a UK data centre?
The A100 is generally simpler. At 400W it fits comfortably in air-cooled halls and standard rack power budgets. The H100 SXM draws up to 700W and, in dense 8-GPU nodes, typically expects tight inlet temperatures and often liquid cooling. Factor power and cooling readiness into your choice, not just GPU price.