Fyi: google releases a hard drive failure study.

paulmd · Feb 20, 2007

http://216.239.37.132/papers/disk_failures.pdf

Paul · Feb 20, 2007

http://216.239.37.132/papers/disk_failures.pdf

One thing I don't see in the study, is inclusion of
temperature and humidity factors at the same time. A
couple of the drive manufacturers include curves which
show acceptable temperature and humidity conditions.
Which may be why their temperature results show little
impact from high drive operating temperature, if the air
was bone dry. Still, if we assume 40% R.H. in their
datacenters, seeing so little effect from temperature is
surprising.

I would have liked to see brand names for the drives too :-)

It would end a lot of arguments.

Paul

kony · Feb 20, 2007

One thing I don't see in the study, is inclusion of
temperature and humidity factors at the same time.

It also doesn't show excessive temperature (though we can't
be sure), and (at least based on manufacturer's claims) a
large % of drives are damaged in handling, no tracking of
the source, delivery, or installation. It also lacks any
conclusion about whether the drives measuring temp are doing
so at the same spot on the drive, if that reading is
accurate relative to the other drives. Suppose for example
all the (randomly picking on WD for no particularly reason)
WD drives that ran at 50C at a far higher failure rate,
because their temp report was significantly below some areas
on the drive, the drive itself was on average significantly
hotter than another brand reporting the same temp.

paulmd · Feb 20, 2007

It also doesn't show excessive temperature (though we can't
be sure), and (at least based on manufacturer's claims) a
large % of drives are damaged in handling, no tracking of
the source, delivery, or installation. It also lacks any
conclusion about whether the drives measuring temp are doing
so at the same spot on the drive, if that reading is
accurate relative to the other drives. Suppose for example
all the (randomly picking on WD for no particularly reason)
WD drives that ran at 50C at a far higher failure rate,
because their temp report was significantly below some areas
on the drive, the drive itself was on average significantly
hotter than another brand reporting the same temp.

"Before being put into production, all disk drives go
through a short burn-in process, which consists of a
combination of read/write stress tests designed to catch
many of the most common assembly, configuration, or
component-level problems. The data shown here do not
include the fall-out from this phase, but instead begin
when the systems are officially commissioned for use.
Therefore our data should be consistent with what a regular
end-user should see, since most equipment manufacturers
put their systems through similar tests before
shipment."

This should address the handling issue.

I get that the sense that they left out drive models in the report,
but DID track them internally. But dammit, I wish they'd publish a
lemon list at least.

"3.2 Manufacturers, Models, and Vintages
Failure rates are known to be highly correlated with drive
models, manufacturers and vintages [18]. Our results do
not contradict this fact. For example, Figure 2 changes
significantly when we normalize failure rates per each
drive model. Most age-related results are impacted by
drive vintages. However, in this paper, we do not show a
breakdown of drives per manufacturer, model, or vintage
due to the proprietary nature of these data."

Synapse Syndrome · Feb 20, 2007

kony said:
It also doesn't show excessive temperature (though we can't
be sure), and (at least based on manufacturer's claims) a
large % of drives are damaged in handling, no tracking of
the source, delivery, or installation. It also lacks any
conclusion about whether the drives measuring temp are doing
so at the same spot on the drive, if that reading is
accurate relative to the other drives. Suppose for example
all the (randomly picking on WD for no particularly reason)
WD drives that ran at 50C at a far higher failure rate,
because their temp report was significantly below some areas
on the drive, the drive itself was on average significantly
hotter than another brand reporting the same temp.

I just read about this on BBC News.

http://news.bbc.co.uk/1/hi/technology/6376021.stm

ss.

Rod Speed · Feb 20, 2007

Paul said:
(e-mail address removed) wrote

One thing I don't see in the study, is inclusion of
temperature and humidity factors at the same time. A
couple of the drive manufacturers include curves which
show acceptable temperature and humidity conditions.
Which may be why their temperature results show little
impact from high drive operating temperature, if the air
was bone dry.

Unlikely that that would matter reliability wise with a hard drive.

Still, if we assume 40% R.H. in their datacenters, seeing so little effect from temperature is
surprising.

Yeah, particularly when thats nothing like what everyone else has seen.

I would have liked to see brand names for the drives too

Yeah, particularly when they hint that some did rather poorly.

It would end a lot of arguments.

I doubt it, just like their temperature result wont either.

Rod Speed · Feb 20, 2007

(e-mail address removed) wrote

"Before being put into production, all disk drives go
through a short burn-in process, which consists of a
combination of read/write stress tests designed to catch
many of the most common assembly, configuration, or
component-level problems. The data shown here do not
include the fall-out from this phase, but instead begin
when the systems are officially commissioned for use.
Therefore our data should be consistent with what a regular
end-user should see, since most equipment manufacturers
put their systems through similar tests before shipment."

This should address the handling issue.

Not necessarily, bad handling doesnt necessarily
produce immediately visible effects in use.

I get that the sense that they left out drive models
in the report, but DID track them internally. But
dammit, I wish they'd publish a lemon list at least.

Yeah, looks rather like they didnt have the balls to do that.

"3.2 Manufacturers, Models, and Vintages
Failure rates are known to be highly correlated with drive
models, manufacturers and vintages [18]. Our results do
not contradict this fact. For example, Figure 2 changes
significantly when we normalize failure rates per each
drive model. Most age-related results are impacted by
drive vintages. However, in this paper, we do not show a
breakdown of drives per manufacturer, model, or vintage
due to the proprietary nature of these data."

Wota pathetic copout.

kony · Feb 21, 2007

"Before being put into production, all disk drives go
through a short burn-in process, which consists of a
combination of read/write stress tests designed to catch
many of the most common assembly, configuration, or
component-level problems. The data shown here do not
include the fall-out from this phase, but instead begin
when the systems are officially commissioned for use.
Therefore our data should be consistent with what a regular
end-user should see, since most equipment manufacturers
put their systems through similar tests before
shipment."

This should address the handling issue.

I'm not convinced that a mishandled drive would necessarily
fail before commissioned for use, even with a bit of testing
first.

8os.8 · Mar 3, 2007

http://news.bbc.co.uk/1/hi/technology/6376021.stm

The report also looked at the impact of scan errors - problems found on the surface of a disc - on hard drive failure.

"We find that the group of drives with scan errors are 10 times more likely to fail than the group with no errors," said the
authors.

They added: "After the first scan error, drives are 39 times more likely to fail within 60 days than drives without scan
errors."

suggests a value of a error-scan utility, with scheduled scans, to report susceptible drives?

Rod Speed · Mar 3, 2007

"Synapse Syndrome" <synapse@NOSPAMgomez404.elitemail.org> in
(e-mail address removed):

The report also looked at the impact of scan errors - problems found
on the surface of a disc - on hard drive failure.

"We find that the group of drives with scan errors are 10 times more
likely to fail than the group with no errors," said the authors.

They added: "After the first scan error, drives are 39 times more
likely to fail within 60 days than drives without scan errors."

suggests a value of a error-scan utility, with scheduled scans, to
report susceptible drives?

Nope, just monitoring the SMART data is all you need to do.

Fyi: google releases a hard drive failure study.

paulmd

Paul

kony

paulmd

Synapse Syndrome

Rod Speed

Rod Speed

kony

8os.8

Rod Speed