An HPC cluster is only as fast as the storage that feeds it, and the storage that feeds it is not the same storage that keeps it. Scratch is the fast, ephemeral working tier where jobs read inputs, write intermediate state and stage results at the full speed of many nodes hitting it at once. Get it wrong and expensive compute nodes sit idle waiting for I/O; get it right and the cluster runs at the speed you paid for. This is how to design an HPC scratch tier for UK research and engineering clusters, built around aggregate parallel throughput rather than the capacity or durability that home and archive storage need.
Scratch is a different tier with a different job
An HPC storage estate is layered. There is a home and project tier for durable, backed-up data, an archive tier for cold results, and between them and the compute nodes sits scratch: a high-throughput, low-latency working area that jobs hammer while they run and that is deliberately not treated as precious. Scratch exists to keep the compute nodes fed, so its single design goal is sustained aggregate bandwidth to many clients at once, not capacity or long-term safety.
Because scratch is ephemeral by design, you trade durability for speed. Data on scratch is working state that can be regenerated or that has been staged from the durable tier, so it is acceptable to build it leaner on redundancy in exchange for raw performance, and to purge it regularly. Conflating scratch with the home tier, and trying to make it both fast and safe, is how clusters end up with storage that is good at neither.
Parallel filesystems and aggregate bandwidth
The reason scratch is built on a parallel filesystem is that a single server cannot feed a cluster. A parallel filesystem stripes data across many storage targets and servers so that throughput scales with the number of targets, letting hundreds of compute processes read and write simultaneously without one server becoming the choke point. The aggregate bandwidth of the whole scratch tier, not the speed of any single drive, is the number that matters.
That makes scratch a scale-out design at heart: more storage servers and more drives mean more parallel throughput, and the filesystem coordinates them into one namespace the compute nodes mount. Design it like a building block you can grow, with the throughput rising as you add servers, much as other scale-out storage grows capacity and performance together. We build the underlying storage servers on dense platforms from the HPE Apollo range.
- •Scratch is ephemeral working state: optimise for aggregate bandwidth, not durability or capacity
- •A parallel filesystem stripes across many targets so throughput scales with target count
- •It is a scale-out design: add storage servers and drives to add parallel throughput
- •Purge scratch regularly and keep durable data on the home and project tier
Drives, fabric and the data path
Scratch lives on fast flash. NVMe drives deliver the low latency and high throughput parallel jobs need, and the data path from compute to scratch must be wide and low-latency, which is where a fast cluster fabric and, increasingly, NVMe over fabrics come in, letting compute nodes reach flash across the network at close to local speed. The drives, the host bus adapters and the network all have to be sized so none of them is the bottleneck under full parallel load.
Build the flash targets from high-performance, write-capable NVMe drives in our SSD and NVMe range, connect them through a clean controller path using parts from our host bus adapters range, and pair them with a fabric fast enough to carry the aggregate. Because scratch endures heavy, sustained writing from real workloads, choose drives with endurance appropriate to the write rate rather than read-optimised media that will wear quickly.
Sizing scratch against the cluster
Size scratch from the compute it serves, not from a capacity target. The questions are how much aggregate bandwidth the cluster's jobs demand at peak, how big a working set a typical job stages, and how many jobs run concurrently. The bandwidth requirement drives the number of storage servers and drives; the working-set size drives capacity; and the answer is usually far smaller in capacity but far higher in throughput than the durable tiers around it.
Plan the lifecycle too: a purge policy that clears old working data keeps scratch from filling, and a clear separation from the home tier keeps users from treating it as permanent storage and losing work when it is cleared. We size scratch against the cluster's real throughput and concurrency, balancing servers, drives and fabric, in our server configuration service, so the compute nodes you have bought are never left waiting on I/O.
Putting a scratch tier together
For most UK HPC clusters the scratch tier lands on several flash-dense storage servers running a parallel filesystem, connected to the compute nodes over a fast low-latency fabric, sized for high aggregate throughput and a modest capacity relative to the durable tiers, with a purge policy keeping it lean. The number of servers and drives follows the cluster's peak bandwidth demand; the capacity follows the working-set size and job concurrency.
We design HPC scratch tiers as a throughput-first, scale-out building block, choosing the storage servers, the flash, the controller path and the fabric to the cluster's I/O profile, in our server configuration service. The home, project and archive tiers around it are sized and protected separately, so scratch can be exactly what it should be: fast, ephemeral and never the thing keeping the cluster waiting.