Two Disk Failures a Co-Incidence

  • Thread starter Thread starter Robert Inder
  • Start date Start date
R

Robert Inder

Two 2U servers in a rack in a machine room. One does the real work,
the other is a "warm" standby. They are about 30 months
old, with SCSI disks.

Just after midnight, one server crashes: nothing is logged, but the
Linux console messages say "IO error" with a (large) sector number.

So everything gets moved to the other server, in preparation
for tracking down the problem and deciding what to do.

((The disk subsequently tells SMART that it is fine, and has never had
a bad sector, or been out of normal temperature range))

7PM (19 hours later), the SECOND server crashes. Again, nothing in
the logs, but a console messages saying "IO Error" with a sector number.

So what is going on?

Was this just a coincidence? Two independent failures that happened to be
on the same day?

Or did something cause both crashes? If so, what could it have been.

We have no reason to suspect any kind of mechanical disturbance. Both
machines have been in the same rack, and on the same
UPS, since they were installed some 30 months ago, with no sign of
problems.

Neither machine had ever crashed before, and the last time they were
(both) re-booted was in November, to add an extra disk drive to each
machine.

The machine room had new air conditioning put in a couple of months
ago. And the ventilation was upgraded throughout the building was
upgraded earlier this year.

What ARE the odds of two SCSI disk systems (disks +
controllers) both failing on the same day after 30 months? And is it
a more (or less) likely explanation than the building work a few rooms
away or the air conditioning installed a good number of weeks ago?

It has been decided that the servers were pretty well due for
replacement anyway, so new ones will be ordered. But given this
rather surprising double wobbler, is there anything about the
environment that should be (double) checked?

Robert.
 
Perhaps the parts (HDs and or controllers, etc) were manufactured in the
same batch and had a defect that kicks in like clockwork.

--Dan
 
Robert Inder said:
Two 2U servers in a rack in a machine room. One does the real work, the
other is a "warm" standby. They are about 30 months old, with SCSI disks.

Just after midnight, one server crashes:
nothing is logged, but the
Linux console messages say "IO error" with a (large) sector number.

A crash on a single IO error, that in itself is suspicious.
So everything gets moved to the other server, in preparation
for tracking down the problem and deciding what to do.

((The disk subsequently tells SMART that it is fine, and has never had
a bad sector, or been out of normal temperature range))

7PM (19 hours later), the SECOND server crashes. Again, nothing in
the logs, but a console messages saying "IO Error" with a sector number.

So what is going on?

Was this just a coincidence? Two independent failures that happened
to be on the same day?

Or did something cause both crashes? If so, what could it have been.

We have no reason to suspect any kind of mechanical disturbance.
Both machines have been in the same rack, and on the same UPS,
since they were installed some 30 months ago, with no sign of problems.

Neither machine had ever crashed before, and the last time they were
(both) re-booted was in November, to add an extra disk drive to each
machine.

The machine room had new air conditioning put in a couple of months
ago. And the ventilation was upgraded throughout the building was
upgraded earlier this year.

What ARE the odds of two SCSI disk systems (disks + controllers)
both failing on the same day after 30 months? And is it a more
(or less) likely explanation than the building work a few rooms
away or the air conditioning installed a good number of weeks ago?

It has been decided that the servers were pretty well due for
replacement anyway, so new ones will be ordered. But given this
rather surprising double wobbler, is there anything about the
environment that should be (double) checked?

For the machine to crash, that IO error has to be rather severe.
You have the block numbers, check them and try to find out if
they are in a file of crucial importance for the system to rely on.
 
Robert said:
7PM (19 hours later), the SECOND server crashes. Again, nothing in
the logs, but a console messages saying "IO Error" with a sector number.

We had 2 pcs in the office built the same day with the same parts. Both
hard drives failed in the same day. They were both on 24/7. Upon calling
WD, it turned out that both hard drives were made on the same day.
 
Plato said:
We had 2 pcs in the office built the same day with the same parts. Both
hard drives failed in the same day. They were both on 24/7. Upon calling
WD, it turned out that both hard drives were made on the same day.

But that would have produced SMART data, his didnt.
 
Sounds like a brown-out. Had several device go out in one computer for
literally no reported reason. Last thing was to test voltage and found that
the socket didn't produce enough wattage. Also, UPS's can do the same when
their battery goes bad.
 
Back
Top