UK’s trusted IT infrastructure partner since 2003
Servnet
ConfiguratorGet in Touch
Liquid vs immersion cooling and RoCE: decoding AI data-centre terms (UK) — analysisLiquid vs immersion cooling and RoCE: decoding AI data-centre terms (UK) — analysis — reach
AI Infrastructure · Explainer

Liquid vs immersion cooling and RoCE: decoding AI data-centre terms (UK)

Servnet Editorial · AI Infrastructure Practice10 min read

Two clusters of jargon dominate any AI data-centre conversation: how the heat gets out, and how the network stays lossless. Air, direct liquid and immersion cooling sit on one axis; RoCE and RDMA on the other. Buyers nod along and then quietly wonder what they have agreed to. This explainer decodes the cooling options and the lossless-networking terms in plain English, so you can tell which your AI build actually needs and which is someone else's problem.

Air vs direct liquid vs immersion
AirDirect liquidImmersionDensityLowerHighHighestFacilityStandardPlumbingSpecialisedComplexitySimpleModerateHighestBest forLight GPUDense GPUExtreme density

Air, direct liquid and immersion: three ways to remove heat

Air cooling moves heat with fans and chilled air, and it remains perfectly adequate for most general-purpose servers and lighter GPU configurations. It is simple, well understood and needs no special facility plumbing. Its limit is density: beyond a certain power per rack, air can no longer carry the heat away fast enough, which is exactly the wall dense GPU servers hit.

Direct liquid cooling brings coolant to the hottest components through cold plates, carrying far more heat than air and enabling the high power densities that large GPU servers demand. Immersion cooling goes further again, submerging whole servers in a non-conductive fluid so every component is cooled at once, reaching the highest densities but requiring the most specialised facility design. The progression is air to direct liquid to immersion as density rises. Our AI server cooling guide covers the technology choice in depth.

Which cooling does your build need?

The honest answer for most buyers is less exotic than the marketing suggests. A handful of inference GPUs in a mainstream server are usually fine on air. Direct liquid becomes relevant when you deploy dense GPU servers whose power per rack exceeds what air can handle, which is the typical trigger for training-class hardware. Immersion is a deliberate, facility-level decision for the very highest densities, not something most organisations reach for first.

The trap is buying cooling complexity you do not need, or specifying dense hardware without checking the room can cool it. Match the cooling to the power density you are actually deploying, and treat immersion as a considered facility strategy rather than a default. The cooling method and the hardware density decision belong together.

RoCE and RDMA: keeping the AI network lossless

On the networking side, the goal for AI training is moving data between GPUs with very low latency and no loss, because the GPUs spend much of their time exchanging results and stalls are expensive. RDMA, remote direct memory access, lets one machine read and write another's memory directly, bypassing the operating system and the CPU overhead of ordinary TCP/IP networking. That is what keeps inter-GPU communication fast.

RoCE, RDMA over Converged Ethernet, runs RDMA across standard Ethernet, which lets organisations get low-latency, lossless behaviour on the Ethernet fabric they already understand rather than a separate specialised network. The catch is that lossless Ethernet must be configured correctly, with the right congestion control, or the loss it is meant to avoid creeps back in. Our network card guidance covers the adapters that support it.

  • Air cooling: simple, adequate for general servers and light GPU loads
  • Direct liquid: cold plates on hot components for dense GPU servers
  • Immersion: whole servers in fluid for the highest densities, facility-level decision
  • RDMA: direct memory-to-memory transfer, bypassing CPU and OS overhead
  • RoCE: RDMA over standard Ethernet - lossless if configured correctly
RoCE / RDMA vs TCP/IP path
4GPU memorysource / destination3RDMAdirect memory - bypass CPU/OS2RoCERDMA over lossless Ethernet1TCP/IPCPU/OS overhead - higher latency

Putting the terms to work

Decoded, these terms let you read an AI proposal critically. If someone specifies immersion cooling for a few inference GPUs, question it; if they specify dense training hardware on plain air, question that too. If a training cluster is described without any mention of RDMA or RoCE, ask how the GPUs communicate. Match cooling to density and fabric to the communication pattern, and bring the build to our on-prem AI cluster guide to size both correctly.

Key takeaways
  • Cooling escalates with density: air, then direct liquid, then immersion.
  • Most light GPU loads run on air; direct liquid is for dense GPU servers air cannot cool.
  • Immersion is a facility-level decision for the highest densities, not a default.
  • RDMA transfers memory to memory directly, bypassing CPU and OS overhead.
  • RoCE runs RDMA over standard Ethernet and is lossless only when configured correctly.
Frequently asked

FAQs — Liquid vs immersion cooling and RoCE

Cooling

Do I need liquid or immersion cooling for GPUs?

Not always. A handful of inference GPUs usually run on air. Direct liquid becomes relevant for dense GPU servers whose power per rack exceeds what air can handle, typically training hardware. Immersion is a facility-level choice for the highest densities. See our cooling guide.

Networking

What is the difference between RDMA and RoCE?

RDMA lets one machine read and write another's memory directly, bypassing CPU and OS overhead for low-latency transfer. RoCE is RDMA running over standard Ethernet, giving lossless behaviour on a familiar fabric - provided congestion control is configured correctly. See our NIC guidance.

Why does AI training need a lossless network?

Training GPUs spend much of their time exchanging results, so latency and packet loss directly stall the job and waste expensive GPU time. RDMA and RoCE keep that inter-GPU traffic fast and lossless. Size the fabric with our cluster guide.

Related

Got a question this article didn't answer?

One conversation with an engineer who's done this before. No sales script.

Talk to Servnet →