Am I endangering my RAID 5 array?

  • Thread starter Thread starter Yeechang Lee
  • Start date Start date
Y

Yeechang Lee

At home I have a Linux desktop and a headless Linux fileserver with a
software RAID 5 array (see
<URL:http://groups.google.ca/group/comp.sys.ibm.pc.hardware.storage/msg/f8479484a5254f5d>
for details).

A few days ago the UPS battery conked out. Besides the battery alarm
ringing every ten seconds and slowly driving me mad, the fileserver is
spontaneously rebooting several times a day; apparently momentary dips
and other irregularities in the power here in downtown San Francisco,
which the UPS had before filtered (and which likely prematurely aged
the battery after replacement only 19 months ago), are causing it to
reboot. (Interestingly, the Linux desktop hasn't hiccuped once;
apparently its power supply is less sensitive.)

The storage array is a pretty straightforward
JFS-on-LVM2-on-software-RAID 5 setup. Each time the server reboots it
usually causes the array to automatically rebuild. Sometimes the
reboots occur during the rebuilding process, causing it to restart.

My question: Am I risking data corruption through the repeated
rebuilds? Should I just shut the server down until the replacement UPS
battery arrives? Or, given that there haven't actually been any
hardware drive failures, are the RAID structure and filesystem robust
enough in the meanwhile?
 
Yeechang Lee wrote
At home I have a Linux desktop and a headless
Linux fileserver with a software RAID 5 array (see
<URL:http://groups.google.ca/group/comp.sys.ibm.pc.hardware.storage/msg/f8479484a5254f5d>
for details).
A few days ago the UPS battery conked out. Besides the battery alarm
ringing every ten seconds and slowly driving me mad, the fileserver is
spontaneously rebooting several times a day; apparently momentary
dips and other irregularities in the power here in downtown San Francisco,
which the UPS had before filtered (and which likely prematurely aged
the battery after replacement only 19 months ago), are causing it to reboot.

Its much more likely the UPS itself is the cause of the reboots.
(Interestingly, the Linux desktop hasn't hiccuped once;
apparently its power supply is less sensitive.)

Yeah, thats not unusual.
The storage array is a pretty straightforward
JFS-on-LVM2-on-software-RAID 5 setup. Each time the server reboots it
usually causes the array to automatically rebuild. Sometimes the
reboots occur during the rebuilding process, causing it to restart.
My question: Am I risking data corruption through the repeated rebuilds?
Yes.

Should I just shut the server down until the replacement UPS battery arrives?

It would be better to just plug it into the mains without the UPS.
Or, given that there haven't actually been any
hardware drive failures, are the RAID structure
and filesystem robust enough in the meanwhile?

You're risking a reboot while writing and that can produce
significant turds on the drives with some drives.
 
Rod said:
Its much more likely the UPS itself is the cause of the reboots.

Makes sense. Or, to put it more accurately, likely there are
fluctuations in the power which are relatively harmless in real life
but which the UPS dutifully tries to fix up any way, but of course
can't because the battery is out (it's a nice model, in which the
battery is always providing the power regardless of whether power is
actually available or not, thus eliminating downtime when the power
does go out).
It would be better to just plug it into the mains without the UPS.

I'll make the switch when I get home.

On the other hand, the only thing that's writing on the drive at the
moment is BitTorrent downloads, and that is an inherently
self-correcting mechanism, so I'm not too worried.

Filesystemwise, I'll run a fsck of the entire RAID once I remove the
UPS from the equation. (I'm curious as to how long it'll take on a
2.8TB array!)
 
Yeechang Lee wrote
Rod Speed wrote
Makes sense. Or, to put it more accurately, likely there
are fluctuations in the power which are relatively harmless
in real life but which the UPS dutifully tries to fix up any
way, but of course can't because the battery is out

Its much more likely that it isnt actually attempting to switch
to the battery due to sags in the mains at that high rate.
(it's a nice model, in which the battery is always providing the
power regardless of whether power is actually available or not,

And thats why its likely that its not sags in the mains, just
the UPS not being able to work properly with failing batterys.
Its likely got a shorted cell and that means that the voltage
available from the battery isnt enough to provide a high enough
UPS output voltage to keep the server power supply happy now.
thus eliminating downtime when the power does go out).

Yeah, always on UPSs are by far the best approach.

Tho they do have that downside if the batterys have gone bad.

I bet the reason the server reboots and the desktop
doesnt is just because the server has a much higher
load on its power supply and so its internal caps cant
ride thru much of a sag in the mains it sees from the UPS.
I'll make the switch when I get home.
On the other hand, the only thing that's writing on
the drive at the moment is BitTorrent downloads,

Thats not correct with the rebuilds.
and that is an inherently self-correcting
mechanism, so I'm not too worried.

Sure, the main potential problem is that some drives
dont handle a power down while writing very well and
can produce bad sectors on the drive as a result of that.
Filesystemwise, I'll run a fsck of the entire RAID
once I remove the UPS from the equation. (I'm
curious as to how long it'll take on a 2.8TB array!)

Yeah, it will be an interesting test.
 
Yeechang Lee wrote:

....
On the other hand, the only thing that's writing on the drive at the
moment is BitTorrent downloads, and that is an inherently
self-correcting mechanism, so I'm not too worried.

A more insidious problem could be corrupted parity data, which you might
never see until some other failure occurred and you suddenly had to
depend upon it. Validating (or, if that's not an available option, just
forcing a complete rebuild of) the parity data after you've eliminated
the problem of frequent restarts might be prudent (if you're truly
paranoid, you'll back all the data up first).

- bill
 
Bill said:
A more insidious problem could be corrupted parity data, which you
might never see until some other failure occurred and you suddenly
had to depend upon it. Validating (or, if that's not an available
option, just forcing a complete rebuild of) the parity data after
you've eliminated the problem of frequent restarts might be prudent

Good point. However, I'm not aware of a way of forcing a parity
rebuild in Linux software RAID except for marking a drive as failed
then reinserting it into the array. In any case, the resulting resync
shouldn't be any different than the automatic postboot resyncing the
array is doing right now (after indeed having eliminated the
random-restarting problem by bypassing the faulty UPS), right?
(if you're truly paranoid, you'll back all the data up first).

If you know of a cost- and time-effective way of backing up a 2.8TB
storage array being used for personal purposes, please let me
know. I'm not being flippant; if there is such a thing, I'd really
like to know! But I'm pretty sure there isn't one.
 
Yeechang said:
Good point. However, I'm not aware of a way of forcing a parity
rebuild in Linux software RAID except for marking a drive as failed
then reinserting it into the array. In any case, the resulting resync
shouldn't be any different than the automatic postboot resyncing the
array is doing right now (after indeed having eliminated the
random-restarting problem by bypassing the faulty UPS), right?

I'm not sufficiently familiar with the Linux design to say. If it makes
no attempt to log what it's doing and simply does a brute-force complete
rebuild of *all* the parity information after an interruption, then yes.
If you know of a cost- and time-effective way of backing up a 2.8TB
storage array being used for personal purposes, please let me
know. I'm not being flippant; if there is such a thing, I'd really
like to know! But I'm pretty sure there isn't one.

What is cost-and time-effective really depends upon the relationship
between the value you place on your data and the value you place on
other things. Or, to look at it another way, data you don't back up is
by definition not worth backing up (which means that any data that *is*
worth backing up must be placed on storage which it is feasible to back up).

A solid RAID implementation has its own built-in paranoia and shouldn't
make an already-bad situation worse during a rebuild (e.g., if it finds
a hard-to-read sector it will *really* try to read it rather than
immediately go to the rest of the stripe to rebuild it, just in case
whatever affected the original sector may have left something in the
rest of the stripe - parity being the most obvious possibility, since it
would have been being written at about the same time - inconsistent as
well). How solid the Linux implementation is in this regard I don't know.

- bill
 
In comp.sys.ibm.pc.hardware.storage Yeechang Lee said:
At home I have a Linux desktop and a headless Linux fileserver with a
software RAID 5 array (see
<URL:http://groups.google.ca/group/comp.sys.ibm.pc.hardware.storage/msg/f8479484a5254f5d>
for details).
A few days ago the UPS battery conked out. Besides the battery alarm
ringing every ten seconds and slowly driving me mad, the fileserver is
spontaneously rebooting several times a day; apparently momentary dips
and other irregularities in the power here in downtown San Francisco,
which the UPS had before filtered (and which likely prematurely aged
the battery after replacement only 19 months ago), are causing it to
reboot. (Interestingly, the Linux desktop hasn't hiccuped once;
apparently its power supply is less sensitive.)
The storage array is a pretty straightforward
JFS-on-LVM2-on-software-RAID 5 setup. Each time the server reboots it
usually causes the array to automatically rebuild. Sometimes the
reboots occur during the rebuilding process, causing it to restart.
My question: Am I risking data corruption through the repeated
rebuilds? Should I just shut the server down until the replacement UPS
battery arrives? Or, given that there haven't actually been any
hardware drive failures, are the RAID structure and filesystem robust
enough in the meanwhile?


This is a bit strange. Is this still a 2.4.x or older 2.6.x Kernel? The
newer ones only rebuild if the array was dirty.

On the quetion of risk: Yes, there is some risk, but but more that the
JFS gets corrupted (unless you have switched write-buffering off on
the disks, AFAIK cannot be done reliably at the moment) that that the
array itself dies. At least thet is my intuition with Linux software
RAID. You also have a pretty high risk of the PSU in that system
dying, so I would take the machine offline.

Arno
 
Back
Top