UK’s trusted IT infrastructure partner since 2003
Servnet
FinanceToolsConfiguratorGet in Touch

What GPU do you need to run Mixtral 8x7B?

Mixtral 8x7B is a sparse Mixture-of-Experts model: every expert must live in VRAM, yet only a fraction fires per token. That split makes it deceptively cheap to serve — if you size the hardware correctly. Use the live calculator below to get exact GPU counts, VRAM, power and cost for your deployment.

Mixtral 8x7B is one of the most memory-versus-compute lopsided models you can self-host. It carries 46.7 billion parameters across eight experts, but a router selects just two of them per token, so it runs at roughly the speed and cost of a ~13B dense model while punching well above that weight on quality. The catch is memory: all eight experts have to be resident in GPU VRAM even though most sit idle at any moment. Get that footprint right and Mixtral is a genuinely efficient production model for UK teams; get it wrong and you either overspend or spill into painful offloading.

Reference build · Mixtral 8x7B · FP16 · 32 users · 8k context
2× H100
122.4 GiB VRAM · 10U · 3.2 kW · 4,986 tok/s
£2,279/mo · £106,000 capex
PrecisionGPUs (H100)VRAMThroughputFrom
FP162×122.4 GiB4,986 tok/s£2,279/mo
FP82×78 GiB9,972 tok/s£2,279/mo
INT41×55.8 GiB9,972 tok/s£1,720/mo
VRAM breakdown — 122 GiBWeights87.0 GiBKV cache32.0 GiBOverhead3.4 GiB
Mixtral 8x7B at FP16, 8k context, 32 concurrent users — indicative.
GPUs required by precisionFP162× H100FP82× H100INT41× H100
H100 count by weight precision. Quantising cuts hardware sharply.
Size it precisely in the calculator →

All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.

Why the MoE design changes how you size the GPU

With most models, VRAM and compute scale together. Mixtral breaks that link. Because it is a sparse Mixture-of-Experts, you must hold the weights for all eight experts in memory, so the VRAM footprint tracks the full 46.7B parameter count. But only two experts activate per token, so the arithmetic per forward pass matches a far smaller dense model. The practical upshot: Mixtral is memory-bound, not compute-bound. You buy GPUs for their VRAM capacity and bandwidth first, and raw FLOPS second — the opposite instinct to sizing a dense 70B model.

KV cache, GQA and the 32k context window

Mixtral uses grouped-query attention, which shares key and value projections across query heads and keeps the KV cache far smaller than multi-head attention would. That matters directly at its 32k-token context length, where the cache can otherwise rival the weights for VRAM. GQA is why Mixtral holds long context without a runaway memory bill. When you size a deployment, remember the calculator below accounts for both the resident expert weights and the per-request KV cache — long-context, high-concurrency workloads push that cache term hard, so model it honestly.

Quantisation, sharding and NVLink

Because the expert weights dominate the footprint, quantisation pays off handsomely on Mixtral — moving from full precision to 8-bit or 4-bit weights can bring it comfortably onto fewer, smaller cards while preserving most quality. If a single GPU cannot hold every expert, you shard across cards, and here interconnect matters: MoE routing shuffles activations between experts, so NVLink-connected GPUs sustain throughput far better than PCIe-only pairs. The calculator reflects these trade-offs, showing how precision and multi-GPU topology change the card count, power draw and cost for your target.

FAQs

Can I run Mixtral 8x7B on a single GPU?

Often yes, but it depends on precision. Because all eight experts must sit in VRAM, full-precision Mixtral needs substantial memory, whereas 4-bit or 8-bit quantisation can bring it onto a single high-memory data-centre card. Add headroom for the KV cache, especially at long context. The calculator above shows exactly which single-GPU options fit your chosen precision and concurrency.

Why does Mixtral need so much VRAM if it only uses 12.9B parameters?

The 12.9B figure is active compute per token, not memory. A router picks two of eight experts for each token, but it cannot predict which experts future tokens will need, so every expert must stay loaded in VRAM. You therefore pay memory for the full 46.7B parameters while paying compute for only a fraction — the defining trade-off of sparse Mixture-of-Experts inference.

Is Mixtral 8x7B cheaper to run than a dense 70B model?

Generally yes on running costs. Mixtral processes tokens at roughly the speed and compute cost of a ~13B dense model, delivering far faster inference than a dense 70B while landing near it on quality. The saving is throughput and power, not memory footprint — you still provision VRAM for all experts. The calculator quantifies the power, cooling and cost difference for your workload.

Related