R
Robert Inder
Two 2U servers in a rack in a machine room. One does the real work,
the other is a "warm" standby. They are about 30 months
old, with SCSI disks.
Just after midnight, one server crashes: nothing is logged, but the
Linux console messages say "IO error" with a (large) sector number.
So everything gets moved to the other server, in preparation
for tracking down the problem and deciding what to do.
((The disk subsequently tells SMART that it is fine, and has never had
a bad sector, or been out of normal temperature range))
7PM (19 hours later), the SECOND server crashes. Again, nothing in
the logs, but a console messages saying "IO Error" with a sector number.
So what is going on?
Was this just a coincidence? Two independent failures that happened to be
on the same day?
Or did something cause both crashes? If so, what could it have been.
We have no reason to suspect any kind of mechanical disturbance. Both
machines have been in the same rack, and on the same
UPS, since they were installed some 30 months ago, with no sign of
problems.
Neither machine had ever crashed before, and the last time they were
(both) re-booted was in November, to add an extra disk drive to each
machine.
The machine room had new air conditioning put in a couple of months
ago. And the ventilation was upgraded throughout the building was
upgraded earlier this year.
What ARE the odds of two SCSI disk systems (disks +
controllers) both failing on the same day after 30 months? And is it
a more (or less) likely explanation than the building work a few rooms
away or the air conditioning installed a good number of weeks ago?
It has been decided that the servers were pretty well due for
replacement anyway, so new ones will be ordered. But given this
rather surprising double wobbler, is there anything about the
environment that should be (double) checked?
Robert.
the other is a "warm" standby. They are about 30 months
old, with SCSI disks.
Just after midnight, one server crashes: nothing is logged, but the
Linux console messages say "IO error" with a (large) sector number.
So everything gets moved to the other server, in preparation
for tracking down the problem and deciding what to do.
((The disk subsequently tells SMART that it is fine, and has never had
a bad sector, or been out of normal temperature range))
7PM (19 hours later), the SECOND server crashes. Again, nothing in
the logs, but a console messages saying "IO Error" with a sector number.
So what is going on?
Was this just a coincidence? Two independent failures that happened to be
on the same day?
Or did something cause both crashes? If so, what could it have been.
We have no reason to suspect any kind of mechanical disturbance. Both
machines have been in the same rack, and on the same
UPS, since they were installed some 30 months ago, with no sign of
problems.
Neither machine had ever crashed before, and the last time they were
(both) re-booted was in November, to add an extra disk drive to each
machine.
The machine room had new air conditioning put in a couple of months
ago. And the ventilation was upgraded throughout the building was
upgraded earlier this year.
What ARE the odds of two SCSI disk systems (disks +
controllers) both failing on the same day after 30 months? And is it
a more (or less) likely explanation than the building work a few rooms
away or the air conditioning installed a good number of weeks ago?
It has been decided that the servers were pretty well due for
replacement anyway, so new ones will be ordered. But given this
rather surprising double wobbler, is there anything about the
environment that should be (double) checked?
Robert.