I've been monitoring about 1100 ATA disk drives (IBM and Maxtor) on
two large linux data analysis clusters for the past two years. I've
seen more than forty failures -- SMART has predicted between half and
two-thirds of those. I've also had advanced warning of *other* types
of 'non-failure' problems, particularly unreadable (uncorrectable)
disk sectors which *would* have caused unpredictable and unrepeatable
errors with the OS and data analysis.
The bottom line is that SMART can not and will not predict all
failures. But in many cases (especially if you or the monitoring
software know what to watch for) it will predict a substantial
fraction of them.
Thank you very much for the real-world info.
If you have any breakdown of failure rates by manufacturer and/or model
that would be even more valuable.
Useful info on reliability is hard to come by.
If your disks were run 24/7 this comes out to an MTBF of under 240khr,
if run 8hr/day the MTBF is under 57khr.
As far back as 1995 IBM was claiming 1Mhr [New Media June 1995], and I think
it's safe to say it's been quite a while since any maker has claimed less than
500k, so we've got a sizable gap between claims and reality.
Subsequently IBM stated it does not quote MTBF numbers for its products
saying the numbers are just confusing, citing legitimate problems with the
varying methods but also making some lame excuses for arguing against the
idea of even trying to measure reliability
[
http://www-1.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/FQ101856]
Likewise Maxtor doesn't bother to provide an MTBF figure for their Diamondmax+9
(
www.maxtor.com/en/documentation/data_sheets/diamondmax_plus_9_data_sheet.pdf),
which has the advantage of not having anything to explain and thus nothing
that could be critiqued, but can only leave me wondering if this tells us
something about how serious they are about reliability.
Maxtor says "Historically the field MTBF, which includes
all returns regardless of cause, is typically 50-60% of projected MTBF"
[
http://maxtor.custhelp.com/cgi-bin/maxtor.cfg/php/enduser/olh_adp.php?p_faqid=545]
but this applies to their Quantum SCSI disks. If this also applies to the other
lines, and if we take their Diamondmax+9 Annualized Return Rate (ARR) <1% spec
[
www.maxtor.com/en/documentation/data_sheets/diamondmax_plus_9_data_sheet.pdf]
to be representative of their projected rather than actual rate, then this
suggests your failure rate of 2%/yr matches their average return rate, so who
knows, maybe the phantom numbers they produce with their undisclosed methods
are pretty accurate when the field vs projected ratio is taken into account.
Seagate earns brownie points for actually explaining how they came up with
their MTBF numbers (
www.seagate.com/newsinfo/docs/disc/drive_reliability.pdf).
But it strikes me as suspicious to test for only one month, guaranteeing
they will not see any effects due to aging or gradual wear and tear.
For desktop users, it's also questionable to test continual operation at
a constant temperature, guaranteeing they will not see effects
from thermal cycling and power-on surges.
While my own sample size is small (3 hard and 3 soft failures),
all were about a year or more after being placed in service, so based
on my experience, testing for less than 1.5 years is worthless.
This does however allow them to replace real data with impressive-sounding
mathematical gymnastics involving a bunch of fudge factors to come up
with an extrapolated MTBF that is suitably high.
The Samsung whitepaper on MTBF also provides some detail but is pathetic.
"SAMSUNG's MTBF for HDDs is 500,000 hours. That means that if you use
your PC for 9 hours every day, your HDD should operate for 152 years."
[
www.samsung.com/Products/HardDiskDrive/whitepapers/WhitePaper_05.htm]
They say they test 480 units for 4 hours or 120 units for 72 hours.
It is impossible to measure an MTBF in the 500k hours range from only
480x4 = 1920 hours or 120x72 = 8640 hours of data; these are insufficient
by 2 orders of magnitude.
I wasn't able to find anything from Western Digital on MTBF other than
some broken links.
The problem with company statements is we can't be sure there's any
correlation between the competence of the design and production
engineers and the competence of the tech writers.
The quality of the company's documentation on reliability may indicate
how seriously they take the issue, but then again
it's possible a company spends all its resources on producing the highest
possible reliability product and doesn't bother how shoddy the
documentation is, or conversely a company could hire a high-powered tech
writer wizard to make up for lack of attention in design and production.
But one thing is clear: providing an MTBF number without providing a
detailed explanation of how it was arrived at is a meaningless exercise.
Most usenet posts are about one disk failure, making it difficult to
evaluate them as indicators of reliability.
Likewise a visit to my local store revealed that they know they've had
lots more failures in 40g drives than 200g but they sell lots
of the small ones and few of the large ones and do not track percentages
so they have no idea what the relative failure _rates_ are.
The same phenomenon is going to apply to comparing different manufacturers.
So thanks in advance for any other real-world data you're able to share.