Two Disk Failures a Co-Incidence

Robert Inder · Jun 28, 2005

Two 2U servers in a rack in a machine room. One does the real work,
the other is a "warm" standby. They are about 30 months
old, with SCSI disks.

Just after midnight, one server crashes: nothing is logged, but the
Linux console messages say "IO error" with a (large) sector number.

So everything gets moved to the other server, in preparation
for tracking down the problem and deciding what to do.

((The disk subsequently tells SMART that it is fine, and has never had
a bad sector, or been out of normal temperature range))

7PM (19 hours later), the SECOND server crashes. Again, nothing in
the logs, but a console messages saying "IO Error" with a sector number.

So what is going on?

Was this just a coincidence? Two independent failures that happened to be
on the same day?

Or did something cause both crashes? If so, what could it have been.

We have no reason to suspect any kind of mechanical disturbance. Both
machines have been in the same rack, and on the same
UPS, since they were installed some 30 months ago, with no sign of
problems.

Neither machine had ever crashed before, and the last time they were
(both) re-booted was in November, to add an extra disk drive to each
machine.

The machine room had new air conditioning put in a couple of months
ago. And the ventilation was upgraded throughout the building was
upgraded earlier this year.

What ARE the odds of two SCSI disk systems (disks +
controllers) both failing on the same day after 30 months? And is it
a more (or less) likely explanation than the building work a few rooms
away or the air conditioning installed a good number of weeks ago?

It has been decided that the servers were pretty well due for
replacement anyway, so new ones will be ordered. But given this
rather surprising double wobbler, is there anything about the
environment that should be (double) checked?

Robert.

dg · Jun 29, 2005

Perhaps the parts (HDs and or controllers, etc) were manufactured in the
same batch and had a defect that kicks in like clockwork.

--Dan

Folkert Rienstra · Jun 29, 2005

Robert Inder said:
Two 2U servers in a rack in a machine room. One does the real work, the
other is a "warm" standby. They are about 30 months old, with SCSI disks.

Just after midnight, one server crashes:

nothing is logged, but the
Linux console messages say "IO error" with a (large) sector number.

A crash on a single IO error, that in itself is suspicious.

So everything gets moved to the other server, in preparation
for tracking down the problem and deciding what to do.

((The disk subsequently tells SMART that it is fine, and has never had
a bad sector, or been out of normal temperature range))

7PM (19 hours later), the SECOND server crashes. Again, nothing in
the logs, but a console messages saying "IO Error" with a sector number.

So what is going on?

Was this just a coincidence? Two independent failures that happened
to be on the same day?

Or did something cause both crashes? If so, what could it have been.

We have no reason to suspect any kind of mechanical disturbance.
Both machines have been in the same rack, and on the same UPS,
since they were installed some 30 months ago, with no sign of problems.

Neither machine had ever crashed before, and the last time they were
(both) re-booted was in November, to add an extra disk drive to each
machine.

The machine room had new air conditioning put in a couple of months
ago. And the ventilation was upgraded throughout the building was
upgraded earlier this year.

What ARE the odds of two SCSI disk systems (disks + controllers)
both failing on the same day after 30 months? And is it a more
(or less) likely explanation than the building work a few rooms
away or the air conditioning installed a good number of weeks ago?

It has been decided that the servers were pretty well due for
replacement anyway, so new ones will be ordered. But given this
rather surprising double wobbler, is there anything about the
environment that should be (double) checked?

For the machine to crash, that IO error has to be rather severe.
You have the block numbers, check them and try to find out if
they are in a file of crucial importance for the system to rely on.

Rod Speed · Jun 29, 2005

Perhaps the parts (HDs and or controllers, etc) were manufactured in the same
batch and had a defect that kicks in like clockwork.

Very unlikely indeed.

Plato · Jun 29, 2005

Robert said:
7PM (19 hours later), the SECOND server crashes. Again, nothing in
the logs, but a console messages saying "IO Error" with a sector number.

We had 2 pcs in the office built the same day with the same parts. Both
hard drives failed in the same day. They were both on 24/7. Upon calling
WD, it turned out that both hard drives were made on the same day.

Plato · Jun 29, 2005

Rod said:
Very unlikely indeed.

It was the day a disgruntled employee took some valve grinding compound
to work.

Rod Speed · Jun 29, 2005

Plato said:
Rod Speed wrote:

It was the day a disgruntled employee took
some valve grinding compound to work.

Again, very unlikely indeed. That would produce
something visible in the SMART stats.

Rod Speed · Jun 29, 2005

Plato said:
We had 2 pcs in the office built the same day with the same parts. Both
hard drives failed in the same day. They were both on 24/7. Upon calling
WD, it turned out that both hard drives were made on the same day.

But that would have produced SMART data, his didnt.

Mike Tomlinson · Jun 29, 2005

Robert Inder <[email protected]> said:
Just after midnight, one server crashes: nothing is logged, but the
Linux console messages say "IO error" with a (large) sector number.

Is the swap partition at the end of the disk?

Ron Reaugh · Jun 29, 2005

Plato said:
It was the day a disgruntled employee took some valve grinding compound
to work.

That wouldn't affect a HD.

Joeshmo · Jun 30, 2005

Sounds like a brown-out. Had several device go out in one computer for
literally no reported reason. Last thing was to test voltage and found that
the socket didn't produce enough wattage. Also, UPS's can do the same when
their battery goes bad.

Disk Boot Failure - Suggestions?	17	Feb 16, 2006
Co-incidences	1	May 13, 2008
SCAN.CO - £5,300 Gone. No PC. No Answers. One of the Worst Companies Ive Ever Dealt With.	3	Apr 22, 2025
hard disks just seemed to disconnect briefly and come back	3	Sep 13, 2010
disk failure after PC was switched off	11	Apr 20, 2006
Formatting a hard disk and handling of suspicious bad sectors	37	Feb 14, 2010
What doesn't PM8 like my partition? Errors 107 and 108	1	Sep 24, 2011
For Sale i3 Desktop Unit	0	May 4, 2018

Two Disk Failures a Co-Incidence

Robert Inder

dg

Folkert Rienstra

Rod Speed

Plato

Plato

Rod Speed

Rod Speed

Mike Tomlinson

Ron Reaugh

Joeshmo

Ask a Question

Similar Threads