What GPU do you need to run Gemma 2 27B?

Gemma 2 27B is a dense, mid-size Google model with unusually strong quality per parameter. Use the live calculator below to size GPUs, VRAM, power and cost for your deployment.

Gemma 2 27B is Google's dense 27.2-billion-parameter open model — 46 layers, grouped-query attention with 16 KV heads and an 8k context window. It punches well above its weight, rivalling far larger models, which makes right-sizing the hardware the difference between a cost-effective deployment and idle silicon. The calculator on this page turns your precision, context and concurrency choices into exact GPU counts, VRAM, power and running costs; the guidance below explains what actually drives those numbers.

Reference build · Gemma 2 27B · FP16 · 32 users · 8k context

4× H100

146.5 GiB VRAM · 10U · 5.4 kW · 4,729 tok/s

£3,397/mo · £158,000 capex

Precision	GPUs (H100)	VRAM	Throughput	From
FP16	4×	146.5 GiB	4,729 tok/s	£3,397/mo
FP8	2×	120.7 GiB	4,729 tok/s	£2,279/mo
INT4	2×	107.8 GiB	9,459 tok/s	£2,279/mo

Gemma 2 27B at FP16, 8k context, 32 concurrent users — indicative.

H100 count by weight precision. Quantising cuts hardware sharply.

Size it precisely in the calculator →

All figures on this page are indicative estimates for planning only and are subject to change; hardware and throughput vary with configuration and tuning, and any monthly finance figure is subject to credit approval and is not a quotation.

Why Gemma 2 27B sits comfortably on mainstream data-centre GPUs

At 27.2B dense parameters, Gemma 2 27B lands in the sweet spot between agile small models and heavyweight 70B-plus systems. In half precision the weights are the dominant memory cost, and quantising to 8-bit or 4-bit shrinks that footprint substantially with modest quality loss. The upshot is that it often fits a single high-memory accelerator, avoiding the multi-GPU tensor-parallel complexity larger models force on you. The calculator shows exactly where each precision tips you between one card and several.

KV cache, GQA and the 8k context window

Gemma 2 27B uses grouped-query attention with 16 KV heads across 46 layers and a head dimension of 128, so its per-token KV cache is far leaner than a full multi-head design would demand. Its 8,192-token context also caps how far the cache can grow per request — helpful for predictable sizing. Concurrency is what really inflates KV memory: every simultaneous session keeps its own cache. The tool models this so you can see headroom before batch traffic saturates a card.

Precision, throughput and total cost of ownership

Because Gemma 2 27B delivers quality typical of much larger models, it is frequently deployed at 8-bit or 4-bit to maximise throughput per pound. Lower precision frees VRAM for larger batches and longer effective context, lifting tokens per second on the same silicon. The calculator pairs each configuration with UK-relevant power draw, cooling load and running cost, so you can weigh a fully-utilised single GPU against a redundant pair before committing capital.

FAQs

How much VRAM does Gemma 2 27B need?

It depends on precision and context. In half precision the 27.2B weights dominate memory; 8-bit roughly halves that and 4-bit reduces it further, often bringing the model within a single high-memory GPU. You must also budget for the KV cache, which grows with concurrency and context length. Enter your settings above for an exact VRAM figure.

Can Gemma 2 27B run on one GPU?

Frequently, yes. As a dense mid-size model with grouped-query attention and an 8k context, Gemma 2 27B often fits a single high-memory data-centre GPU at 8-bit or 4-bit precision, sidestepping multi-GPU sharding. Half precision with heavy concurrency may push you to a second card. The calculator shows precisely where that boundary falls for your workload.

Does Gemma 2 27B's GQA reduce hardware requirements?

Yes. Grouped-query attention gives Gemma 2 27B only 16 KV heads rather than a KV head per query head, so its key-value cache is much smaller per token. Combined with the capped 8k context, this keeps memory growth predictable as you scale concurrent users, letting a single card serve more simultaneous sessions than a comparable full multi-head model.

AI / GPU Calculator →

Size any model + workload precisely.

NVIDIA DGX systems →

The 8-GPU platforms behind these builds.

GPU accelerators →

H100, H200, L40S and A100 specs.

IT finance calculator →

Finance the cluster — HP, lease or subscription.

Server room cooling →

Turn the kW load into BTU/hr and tons.