What GPU do you need to run Mistral Small 3 (24B)?

Mistral Small 3 is the 24B dense sweet-spot model that delivers near-frontier quality on one or two GPUs. Use the calculator below to size VRAM, GPUs, power and cost for your workload.

Mistral Small 3 (24B) has become a default choice for private, self-hosted assistants because it lands on the right side of nearly every hardware trade-off. It is a dense 23.6-billion-parameter transformer — not a mixture-of-experts — so every parameter is active on every token, which keeps latency predictable and deployment simple. It shines as a private assistant, a RAG back-end or a coding and document-processing engine where data must stay on UK soil. The calculator on this page turns your precision, context length and concurrency choices into exact GPU, VRAM, power and cost figures live, so the guidance below stays qualitative.

Reference build · Mistral Small 3 (24B) · FP16 · 32 users · 8k context

2× H100

86.6 GiB VRAM · 10U · 3.2 kW · 2,725 tok/s

£2,279/mo · £106,000 capex

Precision	GPUs (H100)	VRAM	Throughput	From
FP16	2×	86.6 GiB	2,725 tok/s	£2,279/mo
FP8	1×	64.2 GiB	2,725 tok/s	£1,720/mo
INT4	1×	53 GiB	5,451 tok/s	£1,720/mo

Mistral Small 3 (24B) at FP16, 8k context, 32 concurrent users — indicative.

H100 count by weight precision. Quantising cuts hardware sharply.

Size it precisely in the calculator →

All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.

Why the 24B dense design changes your VRAM maths

Because Mistral Small 3 is dense, its full weight set must sit in VRAM at once — there is no expert routing to shrink the active footprint, as you would get with a mixture-of-experts model. That makes the choice of numerical precision the single biggest lever on your hardware bill. Full-precision BF16 is the most VRAM-hungry option, while INT8 and 4-bit quantisation shrink the footprint dramatically with only modest quality loss. The calculator shows each tier so you can see exactly where the model tips from two GPUs onto one.

Context length, GQA and the KV cache

Mistral Small 3 uses grouped-query attention, with many query heads sharing a smaller set of key/value heads across its 40 layers. That design deliberately keeps the KV cache compact, which is why the model behaves so well at long context. Its 128K-token window is generous, but every concurrent request and every extra token of context adds to VRAM on top of the weights. If you plan to serve long documents or many simultaneous users, set realistic context and concurrency in the calculator — that KV headroom often decides single- versus dual-GPU.

One GPU or two, NVLink and power in a UK rack

Quantised, Mistral Small 3 fits a single high-VRAM accelerator; full precision or heavy concurrency pushes you to a pair. When you do split across two GPUs, an NVLink bridge helps tensor-parallel throughput far more than PCIe alone. For UK deployments, factor in 230V single- or three-phase supply, per-rack power budgets and the cooling headroom your colocation or on-prem room can actually sustain. The calculator estimates draw and thermal load so you can match cards to what your facility genuinely supports.

FAQs

Is Mistral Small 3 24B a mixture-of-experts model?

No. Mistral Small 3 is a dense 23.6-billion-parameter transformer, so all parameters are active on every token. Unlike a mixture-of-experts model, there is no routing to reduce the active footprint, which means the complete weight set must fit in VRAM. That trait keeps latency consistent and deployment straightforward, and it makes your choice of precision the main driver of GPU count.

Can Mistral Small 3 24B run on a single GPU?

Often, yes — that is a large part of its appeal. Quantised to 8-bit or 4-bit, the 24B weights fit comfortably on one high-VRAM accelerator with room for a useful KV cache. Full BF16 precision, long-context serving or high concurrency can push you to two GPUs. Enter your precision, context and user count in the calculator to see which side of that line your workload falls on.

How much does long context cost in VRAM for this model?

Mistral Small 3 uses grouped-query attention, sharing key/value heads to keep the KV cache small, so it handles its 128K window efficiently. Even so, context is not free: VRAM used by the cache scales with sequence length multiplied by concurrent requests, stacking on top of the weights. Long documents or many simultaneous users can add meaningful headroom needs — the calculator accounts for this when it sizes your GPUs.

AI / GPU Calculator →

Size any model + workload precisely.

NVIDIA DGX systems →

The 8-GPU platforms behind these builds.

GPU accelerators →

H100, H200, L40S and A100 specs.

IT finance calculator →

Finance the cluster — HP, lease or subscription.

Server room cooling →

Turn the kW load into BTU/hr and tons.