L40S vs H100 for AI inference: which GPU fits your model?

The L40S is the cost- and power-efficient pick for single-card small-to-mid models, while the NVLink-equipped H100 dominates multi-GPU 70B-class serving. Use the calculator below to size both for your exact workload.

Choosing between the L40S and H100 comes down to one question: does your model fit comfortably on a single card, or does it need to be split across several? The L40S offers GDDR6 memory and modest power draw at a far lower price, making it superb for standalone inference. The H100 brings HBM3, roughly four times the memory bandwidth and NVLink, so tensor-parallel serving of large models scales cleanly. The interactive calculator renders the GPU count, VRAM headroom, power, cooling and cost for each option against your specific model and concurrency.

Model	On L40S	On H100	VRAM (FP16)
Llama 3.1 8B	2×	1×	48.9 GiB
Mistral Small 3 (24B)	4×	2×	86.6 GiB
Llama 3.3 70B	8×	4×	216.7 GiB
Qwen2.5 72B	8×	4×	220.7 GiB
Mixtral 8x7B	4×	2×	122.4 GiB

GPUs to serve each model on L40S (FP16, 8k, 32 users).

The same workloads on H100 — more VRAM per card usually means fewer GPUs.

Size it precisely in the calculator →

All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.

VRAM and memory bandwidth: where the gap really bites

The L40S carries GDDR6, which is generous for a single-card design but delivers a fraction of the H100's HBM3 memory bandwidth. For inference, bandwidth governs token generation speed because every decode step re-reads the model weights and KV cache from memory. On smaller models the L40S keeps latency perfectly acceptable, but as parameter counts and context lengths grow, the H100's much wider memory pipe pulls ahead sharply on throughput per card and sustained tokens per second.

NVLink: the deciding factor for 70B-class serving

This is the cleanest dividing line between the two. The L40S has no NVLink and communicates over PCIe Gen4, so when a model is too large for one card, tensor-parallel splits pay a heavy inter-GPU communication tax. The H100 pairs with NVLink at roughly an order of magnitude more inter-GPU bandwidth, so it distributes a 70B-class model across GPUs with far less overhead. If your serving plan needs multiple GPUs to hold one model, the H100 wins decisively.

Power, cooling and cost efficiency

The L40S draws roughly half the power of an H100, which lightens rack power budgets, simplifies cooling and often lets it slot into existing PCIe servers without exotic thermal design. Combined with its much lower purchase price, that makes it the strong cost-per-token choice for small-to-mid models and batch or offline inference. The H100 costs and consumes more, but earns it back on large-model latency and dense concurrency where nothing cheaper keeps up.

FAQs

Can the L40S run a 70B model for inference?

It can run heavily quantised 70B weights on a single card, but with little room left for KV cache under real concurrency. Because the L40S lacks NVLink, splitting a 70B model across cards is inefficient over PCIe. For production 70B serving with headroom and low latency, the NVLink-equipped H100 is the sensible choice.

Why does NVLink matter so much for the H100?

NVLink gives H100 GPUs very high direct bandwidth to each other, so a model spread across several cards exchanges activations quickly during tensor-parallel inference. The L40S relies on slower PCIe links instead. When one model must span multiple GPUs, that inter-GPU bandwidth becomes the bottleneck, and NVLink is precisely what keeps large-model serving fast.

When is the L40S the better buy over the H100?

When your model fits on one card. For small-to-mid models, single-GPU deployments, batch inference or many independent workers, the L40S delivers strong cost-per-token at roughly half the power and a much lower price. You only need the H100's HBM3 bandwidth and NVLink once a single model outgrows one GPU's memory.

AI / GPU Calculator →

Size any model + workload precisely.

NVIDIA DGX systems →

The 8-GPU platforms behind these builds.

GPU accelerators →

H100, H200, L40S and A100 specs.

IT finance calculator →

Finance the cluster — HP, lease or subscription.

Server room cooling →

Turn the kW load into BTU/hr and tons.