UK’s trusted IT infrastructure partner since 2003
Servnet
FinanceToolsConfiguratorGet in Touch

H100 vs A100 for LLM inference: which GPU should you actually buy?

The H100 is roughly two to three times faster and unlocks native FP8, but the A100 is cheaper, cooler and still a very capable FP16/BF16 serving card. Use the calculator below to see exact GPU counts, VRAM, power and cost for your model.

Both the NVIDIA H100 and A100 ship in an 80GB configuration, so raw VRAM capacity rarely decides the winner. What separates them for LLM inference is memory bandwidth, FP8 support and power draw. This page explains where each card earns its keep, then lets you model your own deployment live so the numbers reflect your model, context length and concurrency rather than a marketing benchmark.

ModelOn H100On A100VRAM (FP16)
Llama 3.1 8B1×1×48.9 GiB
Mistral Small 3 (24B)2×2×86.6 GiB
Llama 3.3 70B4×4×216.7 GiB
Qwen2.5 72B4×4×220.7 GiB
Mixtral 8x7B2×2×122.4 GiB
GPUs required by precisionLlama 3.1 8B1× H100Mistral Small 3 (24B)2× H100Llama 3.3 70B4× H100Qwen2.5 72B4× H100Mixtral 8x7B2× H100
GPUs to serve each model on H100 (FP16, 8k, 32 users).
GPUs required by precisionLlama 3.1 8B1× A100Mistral Small 3 (24B)2× A100Llama 3.3 70B4× A100Qwen2.5 72B4× A100Mixtral 8x7B2× A100
The same workloads on A100 — more VRAM per card usually means fewer GPUs.
Size it precisely in the calculator →

All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.

Why the H100 pulls ahead on token throughput

LLM decode is memory-bandwidth bound: every generated token re-reads the model weights and the growing KV cache from VRAM. The H100's HBM3 moves data substantially faster than the A100's HBM2e, and that gap alone yields materially higher tokens per second on the same model. Add the Hopper Transformer Engine, native FP8 and NVLink 4's higher inter-GPU bandwidth for sharded models, and the H100 typically lands two to three times the A100's inference throughput on fewer cards.

FP8, KV cache and larger batches

The A100 tops out at BF16/FP16 for serving; it has no FP8 hardware path. The H100 runs FP8 natively, which halves the bytes per weight and per KV-cache element. In practice that lets you hold a bigger KV cache and push larger batch sizes at a given context length, so concurrency scales further before you add another GPU. For high-throughput chat or RAG endpoints where KV cache is the limiter, that headroom is often the deciding factor. It can also pull cost per million tokens below the A100's despite the higher sticker price.

Where the A100 still makes sense

The A100 draws far less power than the 700W H100, so it is easier to rack in air-cooled UK facilities without liquid cooling or dense-power upgrades. For steady FP16/BF16 serving of mid-sized models, batch or offline inference, or budget-constrained pilots, its lower acquisition and running cost frequently wins on total cost of ownership. If your latency targets are comfortable and you are not chasing FP8 efficiency, the A100 remains a genuinely capable, cost-effective serving card. Size both options against your real traffic in the calculator below.

FAQs

Is the H100 always worth the premium over the A100?

No. The H100 justifies its cost when you need FP8 efficiency, maximum tokens per second, or the highest concurrency per rack. For relaxed latency targets, mid-sized models or budget pilots, the A100's lower price and 400W draw often deliver better total cost of ownership. The calculator quantifies both so you can decide on your actual workload.

Do H100 and A100 have the same VRAM for LLM inference?

In the 80GB variants, yes, capacity is comparable. The difference is how fast that memory is read and whether FP8 is available. The H100's higher HBM3 bandwidth and native FP8 let it serve larger KV caches and batches efficiently, so equal VRAM does not mean equal serving capacity in practice.

Which GPU is easier to deploy in a UK data centre?

The A100 is generally simpler. At 400W it fits comfortably in air-cooled halls and standard rack power budgets. The H100 SXM draws up to 700W and, in dense 8-GPU nodes, typically expects tight inlet temperatures and often liquid cooling. Factor power and cooling readiness into your choice, not just GPU price.

Related