UK’s trusted IT infrastructure partner since 2003
Servnet
ConfiguratorGet in Touch
RAID reliability: AFR, MTBF, MTTDL & URE explained — analysisRAID reliability: AFR, MTBF, MTTDL & URE explained — analysis — reach
Storage · RAID

RAID reliability: AFR, MTBF, MTTDL & URE explained

Servnet Storage Team · Storage & Data Protection8 min read

How likely is a RAID array to actually lose data? Four numbers decide it — AFR, MTBF, URE and rebuild time — combined into MTTDL. Here's what each means and how RAID level changes the odds. See the URE part live in the RAID calculator.

Reliability by parity level
Single (5/Z1)Dual (6/Z2)Triple (TEC/Z3)Survives1 failure2 failures3 failuresURE in rebuildData lossRecoverableRecoverableRelative MTTDLBaseline≫ higher≫≫ higherVerdict (big disks)RiskyDefaultVery wide arrays

Drive failure: AFR and MTBF

MTBF (mean time between failures) is the headline reliability figure on a drive datasheet — often 1.2–2.5 million hours. It sounds enormous, but it's a population statistic, not a promise for your drive. The more useful number is AFR (annualised failure rate): roughly the percentage of drives that fail per year. A 1.5M-hour MTBF works out to an AFR of about 0.58% (8,760 hours ÷ MTBF), though real-world AFR from large fleets is often higher, especially for ageing drives.

The practical point: in an array of many drives, the chance that *some* drive fails this year is much higher than for one drive — which is the whole reason RAID exists.

Read errors: URE

The second failure mode is the unrecoverable read error (URE) — a sector the drive can't read back. Rated about 1 per 10¹⁴ bits (consumer HDD) to 10¹⁶–10¹⁷ (SSD). UREs matter most during a rebuild, when a single-parity array (RAID 5 / RAIDZ1) has no redundancy left, so a URE means data loss. Our calculator computes the URE-during-rebuild probability for your config, and is RAID 5 dead? covers it in depth.

On large modern drives, URE-driven rebuild failure often dominates real-world data-loss risk — more than the idealised double-failure maths below.

Putting it together: MTTDL

MTTDL (mean time to data loss) combines failure rate and rebuild time to estimate how long, on average, until an array loses data. The classic models capture the intuition: single-parity MTTDL scales roughly as MTBF² ÷ (N × (N−1) × MTTR), and dual-parity adds another MTBF/MTTR factor — so dual parity (RAID 6) is orders of magnitude more durable than single parity, and a shorter rebuild time (MTTR) directly improves durability.

Treat MTTDL as a comparative model, not a guarantee: it assumes independent, exponentially-distributed failures and usually ignores UREs, correlated batch failures and operator error — all of which matter in practice. Use it to rank levels (6 ≫ 5, triple ≫ dual), not to promise a date.

Improve array reliability
Biggest lever?
more parity
RAID 6 / Z2
faster rebuild
Distributed + spare
better drives
Lower AFR / URE

What actually improves reliability

Three levers move the needle: more parity (RAID 6/RAIDZ2 over RAID 5/RAIDZ1 — survives a URE and a second failure mid-rebuild), shorter rebuilds (distributed RAID, smaller groups, hot spares — less exposure window), and better drives (lower AFR, lower URE rate, enterprise over consumer). And none of it replaces a backup — MTTDL covers hardware failure, not deletion, corruption or ransomware.

Use the calculator to see fault tolerance, rebuild time and URE risk for any layout, then choose the level whose reliability matches the data's value.

Key takeaways
  • AFR (≈ failures/year) is more useful than the giant MTBF figure; real fleets often exceed datasheet AFR.
  • UREs cause data loss during single-parity rebuilds — often the dominant real-world risk on big drives.
  • MTTDL combines failure rate + rebuild time; dual parity is orders of magnitude more durable than single.
  • Improve reliability with more parity, faster rebuilds and better drives — but it's still not a backup.
Frequently asked

FAQs — RAID reliability

RAID reliability

What is MTTDL?

Mean Time To Data Loss — an estimate of how long, on average, before a RAID array loses data, combining drive failure rate and rebuild time. It's a comparative model (dual parity ≫ single parity), not a guarantee, and it usually ignores UREs and correlated failures, which matter in practice.

What is the difference between MTBF and AFR?

MTBF (mean time between failures) is a large hours figure on the datasheet; AFR (annualised failure rate) is roughly the percentage of drives that fail per year (≈ 8,760 ÷ MTBF). AFR is the more practical number, and real-world fleet AFR is often higher than the datasheet implies.

Does a higher MTBF mean my drive won't fail?

No. MTBF is a population statistic across many drives, not a lifespan for one drive. Plan for failure with appropriate RAID and backups regardless of the MTBF figure.

Related

Continue reading

More in Storage

Got a question this article didn't answer?

One conversation with an engineer who's done this before. No sales script.

Talk to Servnet →