Bad sectors/blocks - automating discovery of hard drives 'going bad'

  • Thread starter Thread starter Phil
  • Start date Start date
P

Phil

I'm not sure if this is the right group for this discussion, but I had
a couple questions in relation to bad sectors and the correlation of a
hard drive nearing a point of failure.


We currently use software to monitor, among other things, event log
errors on Windows machines. Windows will write error messages to the
system log when it finds a bad disk block. Sometimes these come in
large numbers (groups of 10+ messages at a time) and/or appear
frequently even after running, say, chkdsk.


My questions primarily reside in the nature of stand-alone IDE or SATA
hard drives, not RAID configurations of any sort, though not sure of
potential SMART status given that I'm thinking in very general terms
with a large amount of different computers & networks. How accurate
are the Windows event log messages in indicating that a hard drive has
a good potential of going bad soon and should be replaced? Is there a
threshold of sorts? Are there better software tools (small Linux-
distro utilities, perhaps) to monitor the actual physical health of a
disk, or to get a better picture of disk health going forward?


In general, I'm looking for a good way to automate disk health
checking in order to accurately tell a client "You need to buy a new
hard drive" before the disk itself is mucked past the point of simple
data backup/recovery operations.
 
Phil said:
I'm not sure if this is the right group for this discussion,

Yes it is.
but I had a couple questions in relation to bad sectors and
the correlation of a hard drive nearing a point of failure.
We currently use software to monitor, among other things,
event log errors on Windows machines. Windows will write
error messages to the system log when it finds a bad disk block.
Sometimes these come in large numbers (groups of 10+ messages
at a time) and/or appear frequently even after running, say, chkdsk.

The hard drive SMART data is much better for bad sectors that show up.

Everest shows that data most readably and you need to
focus on the actual numbers reported, not just the OKs.
http://www.majorgeeks.com/download.php?det=4181
My questions primarily reside in the nature of stand-alone IDE or SATA
hard drives, not RAID configurations of any sort, though not sure of
potential SMART status given that I'm thinking in very general terms
with a large amount of different computers & networks. How accurate
are the Windows event log messages in indicating that a hard drive
has a good potential of going bad soon and should be replaced?

Nowhere near as good as the SMART data.
Is there a threshold of sorts?

Yes, one or two reallocated sectors are nothing to worry about, many
more than that and more showing up over time is and indication that
something is going bad. Not necessarily the hard drive tho, it can be just
the drive running at too high a temperature of a power supply going bad.
Are there better software tools (small Linux- distro utilities,
perhaps) to monitor the actual physical health of a disk,
or to get a better picture of disk health going forward?

Yes, everest or smartctl.
 
Previously Phil said:
I'm not sure if this is the right group for this discussion, but I had
a couple questions in relation to bad sectors and the correlation of a
hard drive nearing a point of failure.

We currently use software to monitor, among other things, event log
errors on Windows machines. Windows will write error messages to the
system log when it finds a bad disk block. Sometimes these come in
large numbers (groups of 10+ messages at a time) and/or appear
frequently even after running, say, chkdsk.

My questions primarily reside in the nature of stand-alone IDE or SATA
hard drives, not RAID configurations of any sort, though not sure of
potential SMART status given that I'm thinking in very general terms
with a large amount of different computers & networks. How accurate
are the Windows event log messages in indicating that a hard drive has
a good potential of going bad soon and should be replaced?

Not very.
Is there a
threshold of sorts?
No.

Are there better software tools (small Linux-
distro utilities, perhaps) to monitor the actual physical health of a
disk, or to get a better picture of disk health going forward?

Definitely. For bad sectors, look at the reallocated sector count in the
SMART attribute. It will give you a far more accurate bad sector
estimate than the event log, sicne marginal sectors are in here as well.
You can also look for other exceeded or suspicuous SMART attributes.
The tool would just be the smartmontools with automatic monitoring done
(actions and thresholds are user-defined) by smartd and smartctl for
direct querying.
In general, I'm looking for a good way to automate disk health
checking in order to accurately tell a client "You need to buy a new
hard drive" before the disk itself is mucked past the point of simple
data backup/recovery operations.

The thing I made good experiences with is to monitor the
realloacted sector count for an increase of, say, more than 10 in a
week and the others for exceeded threshold. I have smartd send email in
case the reallocated cound increases. Also a good idea is to
run a full smart selftest (smartctl -t long <device>) regularly.
I usually run one every 14 days from a cron0job (anacron for
not allways-on machines). YMMV.

Arno
 
Definitely. For bad sectors, look at the reallocated sector count in the
SMART attribute. It will give you a far more accurate bad sector
estimate than the event log, sicne marginal sectors are in here as well.
You can also look for other exceeded or suspicuous SMART attributes.
The tool would just be the smartmontools with automatic monitoring done
(actions and thresholds are user-defined) by smartd and smartctl for
direct querying.
The thing I made good experiences with is to monitor the
realloacted sector count for an increase of, say, more than 10 in a
week and the others for exceeded threshold. I have smartd send email in
case the reallocated cound increases. Also a good idea is to
run a full smart selftest (smartctl -t long <device>) regularly.
I usually run one every 14 days from a cron0job (anacron for
not allways-on machines). YMMV.



Thanks for the tips. I'll have to mess around with smartctl & smartd
more to figure out how to enumerate the reallocated sector count (if I
can get enough information from just smartctl, that'd be best, for I
can handle things like scheduling and automated email alerts
elsewhere) and any other pertinent SMART data I would need.
 
Previously Phil said:
Thanks for the tips. I'll have to mess around with smartctl & smartd
more to figure out how to enumerate the reallocated sector count (if I
can get enough information from just smartctl, that'd be best, for I
can handle things like scheduling and automated email alerts
elsewhere) nd any other pertinent SMART data I would need.

That is definitely possible. I used to have a cron-job that ran
smartctl every hour and evaluate the results with a perl-script and
the stored previous values. Took about a day to write and ran
for several years on 24 PCs without problems..

Arno
 
That is definitely possible. I used to have a cron-job that ran
smartctl every hour and evaluate the results with a perl-script and
the stored previous values. Took about a day to write and ran
for several years on 24 PCs without problems..


Did you just run a regex against a/specific line(s) of the smartctl -a
output? I was thinking something among those lines, or a conditional
on WHEN_FAILED and TYPE = Pre-fail.


I'm not sure which will take more time - getting smartd to run how I'd
want it (I'd like to run smartd selectively, if anything, so the
service wasn't running at all times on the machines...but still have
it able to throw errors to the event log), grinding my teeth through
trying to do regex in VB so I can easily call a cscript <script.vbs>
on Windows client machines, or touching up on my perl and distributing
a small Windows-based perl compiler out to all the managed
workstations so I can run a script.
 
Previously Phil said:
Did you just run a regex against a/specific line(s) of the smartctl -a
output? I was thinking something among those lines, or a conditional
on WHEN_FAILED and TYPE = Pre-fail.

I basically isolated temperature and reallocated count with regexps.
I'm not sure which will take more time - getting smartd to run how I'd
want it (I'd like to run smartd selectively, if anything, so the
service wasn't running at all times on the machines...but still have
it able to throw errors to the event log), grinding my teeth through
trying to do regex in VB so I can easily call a cscript <script.vbs>
on Windows client machines, or touching up on my perl and distributing
a small Windows-based perl compiler out to all the managed
workstations so I can run a script.

If you have to do this on windows, I would suggest trying
out smartd first. Although I do not know whether it can send
email on windows. If you write something yoruself, best
install perl for windows, I think, since regexp in perl
are superious to any other implementation I have seen.

Arno
 
Back
Top