Server Infrastructure · Storage

HPC scratch storage: designing a fast parallel tier for compute clusters (UK 2026)

Servnet Editorial · Server Infrastructure Practice23 December 202511 min read

An HPC cluster is only as fast as the storage that feeds it, and the storage that feeds it is not the same storage that keeps it. Scratch is the fast, ephemeral working tier where jobs read inputs, write intermediate state and stage results at the full speed of many nodes hitting it at once. Get it wrong and expensive compute nodes sit idle waiting for I/O; get it right and the cluster runs at the speed you paid for. This is how to design an HPC scratch tier for UK research and engineering clusters, built around aggregate parallel throughput rather than the capacity or durability that home and archive storage need.

Scratch between compute and capacity

Scratch is a different tier with a different job

An HPC storage estate is layered. There is a home and project tier for durable, backed-up data, an archive tier for cold results, and between them and the compute nodes sits scratch: a high-throughput, low-latency working area that jobs hammer while they run and that is deliberately not treated as precious. Scratch exists to keep the compute nodes fed, so its single design goal is sustained aggregate bandwidth to many clients at once, not capacity or long-term safety.

Because scratch is ephemeral by design, you trade durability for speed. Data on scratch is working state that can be regenerated or that has been staged from the durable tier, so it is acceptable to build it leaner on redundancy in exchange for raw performance, and to purge it regularly. Conflating scratch with the home tier, and trying to make it both fast and safe, is how clusters end up with storage that is good at neither.

Parallel filesystems and aggregate bandwidth

The reason scratch is built on a parallel filesystem is that a single server cannot feed a cluster. A parallel filesystem stripes data across many storage targets and servers so that throughput scales with the number of targets, letting hundreds of compute processes read and write simultaneously without one server becoming the choke point. The aggregate bandwidth of the whole scratch tier, not the speed of any single drive, is the number that matters.

That makes scratch a scale-out design at heart: more storage servers and more drives mean more parallel throughput, and the filesystem coordinates them into one namespace the compute nodes mount. Design it like a building block you can grow, with the throughput rising as you add servers, much as other scale-out storage grows capacity and performance together. We build the underlying storage servers on dense platforms from the HPE Apollo range.

•Scratch is ephemeral working state: optimise for aggregate bandwidth, not durability or capacity
•A parallel filesystem stripes across many targets so throughput scales with target count
•It is a scale-out design: add storage servers and drives to add parallel throughput
•Purge scratch regularly and keep durable data on the home and project tier

Drives, fabric and the data path

Scratch lives on fast flash. NVMe drives deliver the low latency and high throughput parallel jobs need, and the data path from compute to scratch must be wide and low-latency, which is where a fast cluster fabric and, increasingly, NVMe over fabrics come in, letting compute nodes reach flash across the network at close to local speed. The drives, the host bus adapters and the network all have to be sized so none of them is the bottleneck under full parallel load.

Build the flash targets from high-performance, write-capable NVMe drives in our SSD and NVMe range, connect them through a clean controller path using parts from our host bus adapters range, and pair them with a fabric fast enough to carry the aggregate. Because scratch endures heavy, sustained writing from real workloads, choose drives with endurance appropriate to the write rate rather than read-optimised media that will wear quickly.

Scratch storage server, top down

View the data behind this chart

Scratch storage server, top down
Layer	Detail
Cluster fabric	NVMe-oF, wide and low-latency
NIC + HBA path	Clean pass-through, no bottleneck
NVMe targets	Write-capable, endurance-matched
Parallel FS	Stripes across many targets

Sizing scratch against the cluster

Size scratch from the compute it serves, not from a capacity target. The questions are how much aggregate bandwidth the cluster's jobs demand at peak, how big a working set a typical job stages, and how many jobs run concurrently. The bandwidth requirement drives the number of storage servers and drives; the working-set size drives capacity; and the answer is usually far smaller in capacity but far higher in throughput than the durable tiers around it.

Plan the lifecycle too: a purge policy that clears old working data keeps scratch from filling, and a clear separation from the home tier keeps users from treating it as permanent storage and losing work when it is cleared. We size scratch against the cluster's real throughput and concurrency, balancing servers, drives and fabric, in our server configuration service, so the compute nodes you have bought are never left waiting on I/O.

Putting a scratch tier together

For most UK HPC clusters the scratch tier lands on several flash-dense storage servers running a parallel filesystem, connected to the compute nodes over a fast low-latency fabric, sized for high aggregate throughput and a modest capacity relative to the durable tiers, with a purge policy keeping it lean. The number of servers and drives follows the cluster's peak bandwidth demand; the capacity follows the working-set size and job concurrency.

We design HPC scratch tiers as a throughput-first, scale-out building block, choosing the storage servers, the flash, the controller path and the fabric to the cluster's I/O profile, in our server configuration service. The home, project and archive tiers around it are sized and protected separately, so scratch can be exactly what it should be: fast, ephemeral and never the thing keeping the cluster waiting.

Key takeaways

✓Scratch is the fast, ephemeral tier that feeds compute; its goal is aggregate bandwidth, not durability or capacity.
✓A parallel filesystem stripes across many targets so throughput scales with target count and server count.
✓Build it on high-performance NVMe over a fast low-latency fabric, with endurance matched to heavy sustained writes.
✓Size from the cluster's peak bandwidth and job concurrency, not from a capacity target; it is high-throughput, modest-capacity.
✓Keep scratch separate from the durable home tier and purge it regularly so it stays lean and fast.

Frequently asked

FAQs — HPC scratch storage

What scratch is

What is HPC scratch storage?

Scratch is a fast, ephemeral working tier between compute nodes and durable storage, where jobs read inputs, write intermediate state and stage results at the full speed of many nodes at once. It is optimised for aggregate bandwidth, not durability, and is purged regularly. We design scratch tiers in server configuration.

Why does scratch use a parallel filesystem?

A single server cannot feed a cluster. A parallel filesystem stripes data across many storage targets and servers so throughput scales with target count, letting hundreds of processes read and write at once without one server becoming the choke point. It is a scale-out design built on dense platforms like the HPE Apollo range.

Sizing

How do I size an HPC scratch tier?

Size from the cluster's peak aggregate bandwidth, the working-set size of a typical job and how many jobs run concurrently, not from a capacity target. Scratch is usually high-throughput but modest in capacity relative to durable tiers. We balance servers, drives and fabric to the cluster's I/O profile in server configuration.

HPE Apollo storage servers →Server SSDs & NVMe →RAID controllers & HBAs →

Got a question this article didn't answer?

One conversation with an engineer who's done this before. No sales script.

Talk to Servnet →

Talk to a UK specialist

Get expert advice or a no-obligation quote — servers, storage, networking, maintenance, finance and cloud. We reply the same working day.

or call 0800 987 4111