SATA drives acting up, is Nforce to blame?

  • Thread starter Thread starter Yousuf Khan
  • Start date Start date
Y

Yousuf Khan

For the second time in a row, my system is acting up, locking up, or
just freezing briefly, sometimes even file corruption. Each time, the
problems seem to start a few days after I install a second SATA hard
drive on the system. It seems to be fine, so long as there is only one
SATA drive installed. The system uses an ASUS M2NPV-VM AM2 motherboard
with the Nvidia Nforce 430 chipset. First time this happened, I RMA'ed
one of the hard drives, but now I'm thinking it's the chipset's fault.
Nvidia chipsets were known to have crappy SATA support:

Nvidia SATA driver bricks Windows - The Inquirer
http://www.theinquirer.net/inquirer/news/546/1000546/nvidia-sata-driver-bricks

Now, I'm not using any of the RAID features on the board, just JBOD. I'm
even getting lock ups whenever I try to read one of the drive's SMART
status using HD Sentinel!

Yousuf Khan
 
Yousuf Khan wrote:
:: For the second time in a row, my system is acting up, locking up,
:: or just freezing briefly, sometimes even file corruption. Each
:: time, the problems seem to start a few days after I install a
:: second SATA hard drive on the system. It seems to be fine, so long
:: as there is only one SATA drive installed. The system uses an ASUS
:: M2NPV-VM AM2 motherboard with the Nvidia Nforce 430 chipset. First
:: time this happened, I RMA'ed one of the hard drives, but now I'm
:: thinking it's the chipset's fault. Nvidia chipsets were known to
:: have crappy SATA support:
::
:: Nvidia SATA driver bricks Windows - The Inquirer
::
http://www.theinquirer.net/inquirer/news/546/1000546/nvidia-sata-driver-bricks
::
:: Now, I'm not using any of the RAID features on the board, just
:: JBOD. I'm even getting lock ups whenever I try to read one of the
:: drive's SMART status using HD Sentinel!
::
Yousuf, interesting post. I've got an Asus M2N here that I built about a
year ago. I installed two Samsung Spinpoint 500GB SATA drives internally,
plus have an external USB drive. Almost off the bat I saw the occasional
file corruption complaints in Event Viewer, and running chkdsk /f confirmed
cross-linked files and all that jazz. I attributed it to DiskKeeper 2008
that, every time I ran "Boot time Defrag" would later end up with file
corruption. Of course I stopped doing that and haven't seen cross-linked
files since but still run chkdsk at least once a week as I'm still paranoid
about the situation.

Now I see your post WRT to corruption and also that you're running a very
similar mobo with the same chipset, and I have to say to myself,
"Hmmmm....."

Sorry I can't be of help but just wanted to confirm corruption problems on
my end also. Thanks for the link.

Jack
 
Ato_Zee said:
I've had trouble with JBOD.
P5VD2-X C:\ IDE OS drive, and STATA D:\ data drive works,
can add a second SATA drive on the SATA2 connector
no problems (actually via a wired 1to1 to a 1TB eSATA
usind a SATA to eSATA connector converter).
Using boards eSATA backplate connector, JBOD controller
problems.
So uninstalled the RAID/JBOD support, all runs fine.
Via chipset mobo.
During boot the SATA controller reports
"No any drive found" but the BIOS sees it and it works
so I guess "No any drive found" might be from the JBOD
controller. It appeared after I uninstalled it.

JBOD means Just a Bunch Of Disks, i.e. meaning there is no RAID, just
using them as regular disks.

Yousuf Khan
 
JBOD means Just a Bunch Of Disks, i.e. meaning there is no RAID, just
using them as regular disks.

JBOD has no fixed meaning. It can men RAID APPEND mode or
non-raided individual disks. The latter is sometimes also called
RAW mode.

Arno
 
For the second time in a row, my system is acting up, locking up, or
just freezing briefly, sometimes even file corruption. Each time, the
problems seem to start a few days after I install a second SATA hard
drive on the system. It seems to be fine, so long as there is only one
SATA drive installed. The system uses an ASUS M2NPV-VM AM2 motherboard
with the Nvidia Nforce 430 chipset. First time this happened, I RMA'ed
one of the hard drives, but now I'm thinking it's the chipset's fault.
Nvidia chipsets were known to have crappy SATA support:

I have possible anecdotal support for that. I was using software RAID 1 on a
Tyan S2895, which uses an nForce 2000 series chipset, and the root FS would go
read-only after a while, which was followed by an eventual system lockup.

Throwing in a $200 RAID card solved the problem.

Could have been just lousy Linux drivers.

On the other hand, my current desktop system has the nForce 590 chipset, and
none of the connected SATA drives have had any corruption problems.
 
Mike Ruskai said:
On or about Wed, 04 Mar 2009 10:56:10 -0500 did Yousuf Khan <[email protected]>
dribble thusly:
I have possible anecdotal support for that. I was using software RAID 1 on a
Tyan S2895, which uses an nForce 2000 series chipset, and the root FS would go
read-only after a while, which was followed by an eventual system lockup.
Throwing in a $200 RAID card solved the problem.
Could have been just lousy Linux drivers.

Pretty unlikely. The read-only is from bus errors in the hardware.
Of course it is possible that the driver need to compensate a
lot of hardware problems, but I would not call that lousy.

The corruption is also a strong indicator of hardware
problems.

Arno
 
Mike said:
I have possible anecdotal support for that. I was using software RAID 1 on a
Tyan S2895, which uses an nForce 2000 series chipset, and the root FS would go
read-only after a while, which was followed by an eventual system lockup.

Throwing in a $200 RAID card solved the problem.

Could have been just lousy Linux drivers.

On the other hand, my current desktop system has the nForce 590 chipset, and
none of the connected SATA drives have had any corruption problems.

Interesting thing is that it's looking like I'm having none of these
issues when I run it under Linux. Unfortunately, I can't run this system
for too long in Linux because people in the house depend on this system
running in Windows, and I haven't had time to get a virtualization
solution going on it yet. So without some long-term running in Linux, I
can't really say for sure that these things don't happen in Linux, but
so far it seems like it doesn't happen in it.

So I thought that maybe the problem might be with the Windows drivers.
So I replaced the Nvidia SATA drivers with the standard Microsoft
IDE/SATA drivers. I am not getting the full lockups anymore, but I am
occasionally getting stutters and short freezes (lasting seconds rather
than minutes).

The SMART Current Pending Sectors Count went up on the two SATA drives
simultaneously. It went up to 1 on the SATA#1 drive, and it went up to 6
on the SATA#2 drive. The Pending Sectors Count seems to be back down to
zero again now, so I assume that the weak sectors have been remapped.
But the fact that it went up simultaneously on both drives must be
related to a problem further up the stream from the drives.

Another thing I noticed is that the system seems to freeze briefly
whenever I try to read the SMART readings on the SATA#2 drive, under
Windows. I'll go see what the SMART reading behaviour is like under
Linux later.

Yousuf Khan
 
Yousuf said:
Interesting thing is that it's looking like I'm having none of these
issues when I run it under Linux. Unfortunately, I can't run this system
for too long in Linux because people in the house depend on this system
running in Windows, and I haven't had time to get a virtualization
solution going on it yet. So without some long-term running in Linux, I
can't really say for sure that these things don't happen in Linux, but
so far it seems like it doesn't happen in it.

I've now done more extensive testing under Linux, and I can confirm that
the problem is not occurring at all under Linux (Ubuntu 8.10). But it
does see the errors that have already occurred through SMART. I've saved
the SMART reports under Linux and these are the states for the two SATA
drives:

SATA 1:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 168 161 021 Pre-fail Always - 4600
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 51
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 118
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 49
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 48
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 51
194 Temperature_Celsius 0x0022 102 099 000 Old_age Always - 45
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
240 Head_Flying_Hours 0x0032 100 100 000 Old_age Always - 116
241 Unknown_Attribute 0x0032 200 200 000 Old_age Always - 1522680496
242 Unknown_Attribute 0x0032 200 200 000 Old_age Always - 502038374

SATA 2:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 873
3 Spin_Up_Time 0x0027 162 153 021 Pre-fail Always - 4883
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 114
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 051 Old_age Always - 0
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 1813
10 Spin_Retry_Count 0x0032 100 100 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 114
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 111
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 114
194 Temperature_Celsius 0x0022 103 100 000 Old_age Always - 44
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 6
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0

So as you can see SATA #1 has 1 Current Pending Sector, while SATA #2
has 6 of them.
So I thought that maybe the problem might be with the Windows drivers.
So I replaced the Nvidia SATA drivers with the standard Microsoft
IDE/SATA drivers. I am not getting the full lockups anymore, but I am
occasionally getting stutters and short freezes (lasting seconds rather
than minutes).

Going to the Microsoft drivers has helped tremendously, but the problems
are still occurring. They are maybe 10% as severe as they were under the
Nvidia drivers, not bad but still noticeable.
The SMART Current Pending Sectors Count went up on the two SATA drives
simultaneously. It went up to 1 on the SATA#1 drive, and it went up to 6
on the SATA#2 drive. The Pending Sectors Count seems to be back down to
zero again now, so I assume that the weak sectors have been remapped.
But the fact that it went up simultaneously on both drives must be
related to a problem further up the stream from the drives.

Now I'd like to get the Current Pending Sectors Count to go down again.
How would I go about getting them to go away? Should I schedule a
surface scan of the two disks, so that they can get read and written to
again?
Another thing I noticed is that the system seems to freeze briefly
whenever I try to read the SMART readings on the SATA#2 drive, under
Windows. I'll go see what the SMART reading behaviour is like under
Linux later.


I can confirm, the system freeze is still a problem under Windows, but
there's no such problem under Linux.

Yousuf Khan
 
I've now done more extensive testing under Linux, and I can confirm that
the problem is not occurring at all under Linux (Ubuntu 8.10). But it
does see the errors that have already occurred through SMART. I've saved
the SMART reports under Linux and these are the states for the two SATA
drives:
So as you can see SATA #1 has 1 Current Pending Sector, while SATA #2
has 6 of them.

Software (also drivers) should not be able to cause pending sectors.
My guess would be you have a problem withj the disks or PSU and the
drivers are just more or less able to cope with them.
Going to the Microsoft drivers has helped tremendously, but the problems
are still occurring. They are maybe 10% as severe as they were under the
Nvidia drivers, not bad but still noticeable.
Now I'd like to get the Current Pending Sectors Count to go down again.
How would I go about getting them to go away? Should I schedule a
surface scan of the two disks, so that they can get read and written to
again?

The only thing that works reliably is writing to the sectors in
question. You can do a long SMART selftest and hope the disk
recovers the sector contents, but that is more up to luck than
not.
I can confirm, the system freeze is still a problem under Windows, but
there's no such problem under Linux.

I expect that the Linux driver just deals bettwe with what the
real problem is. As both disks have pending sectors, and
the forst has one after only 5 days of operation, I would
suspect an external influence, such as bad PSU, mechanical
shock or vibration.

It _is_ possible that the pending sectors are just due to
the drives being new. It is a good idea to do a long
SMART selftest on any new drive to prevent this and to
re-run the long SMART sekftest every 2-4 weeks as preventative
maintenance.

However with two disks having the same issue, I would
first look at external influences.

Arno
 
Arno said:
Software (also drivers) should not be able to cause pending sectors.
My guess would be you have a problem withj the disks or PSU and the
drivers are just more or less able to cope with them.

Or the Nvidia chipset is screwing them up. The original post showed a
link saying people have noticed their SATA drives getting fried when
used with an Nvidia chipset. Nvidia has a reputation for poor quality
thermals on their chipsets, and a lot of pins get unseated simply due to
the heat their chip puts out.
The only thing that works reliably is writing to the sectors in
question. You can do a long SMART selftest and hope the disk
recovers the sector contents, but that is more up to luck than
not.

The SMART self-tests now fail with a "Read element failure". Both long
and short ones.
I expect that the Linux driver just deals bettwe with what the
real problem is. As both disks have pending sectors, and
the forst has one after only 5 days of operation, I would
suspect an external influence, such as bad PSU, mechanical
shock or vibration.

None of those are likely. Why does it always start to happen when there
are two SATA drives plugged in?
It _is_ possible that the pending sectors are just due to
the drives being new. It is a good idea to do a long
SMART selftest on any new drive to prevent this and to
re-run the long SMART sekftest every 2-4 weeks as preventative
maintenance.

However with two disks having the same issue, I would
first look at external influences.


An external influence like an Nvidia chipset?

Yousuf Khan
 
Or the Nvidia chipset is screwing them up. The original post showed a
link saying people have noticed their SATA drives getting fried when
used with an Nvidia chipset. Nvidia has a reputation for poor quality
thermals on their chipsets, and a lot of pins get unseated simply due to
the heat their chip puts out.

Ok, let me rephrase that: Besides bad power and bad physical
conditions (head, vibration,...) nothing external should be
able to create bad sectors on a drive.
The SMART self-tests now fail with a "Read element failure". Both long
and short ones.

That sounds like dead or dying hardware to me.
None of those are likely. Why does it always start to happen when there
are two SATA drives plugged in?

Load on the PSU too high? Drived borderline due to mechanical
shock and the added vibration of the second one sends them over
the edge?
An external influence like an Nvidia chipset?

I highly doubt that. Of course I cannot rule out a bad design
mistake in the HDDs. But anything that comes over the data cable
should not be able to cause bad sectors.

Arno
 
Bad PSU's can be unpredictable, if the reservoir capacitor isn't
big enough, no problem with light loads, voltage stays up,
sustained heavy load, like writing large files, voltage sags,
causing corruption.

I agree. Swithching regulators are inherently unstable and
getting a stable design can be a bit tricky. If a component
then degrades or fails, all sorts of all sorts of bizzare
behaviour can result. Incidentially, that is one of the reasons
why high quality PSUs (e.g. Enermax) have so many protection
circuits.
Can happen with PC's as well, cheap generic PSU sold
with case, occasional BSOD's, put in a quality branded
PSU, problem solved.

Indeed.

Arno
 
Ok, let me rephrase that: Besides bad power and bad physical
conditions (head, vibration,...) nothing external should be
able to create bad sectors on a drive.

Another possibility is a bad chipset that misses data coming over the
bus and interprets it as a bad sector.
That sounds like dead or dying hardware to me.

Yup, and unfortunately it looks like WD is going to have to bear the
brunt of the cost of the RMA again, even though it's not really its
fault. Try convincing Nvidia that their crap chipsets are destroying
hard drives, even though they already have a history of doing this.
Load on the PSU too high? Drived borderline due to mechanical
shock and the added vibration of the second one sends them over
the edge?

Previously the power supply had been handling two DVD burners, two IDE
HDDs, one SATA HDD, and and 8600GT PCIe video card with no problems.
Why would one additional SATA HDD drive it over the edge? If it was
already borderline, it would've exhibited problems even before now,
such as at startup when component loads more than double, especially
the CPU and hard drives. Even the GPU is nothing special, just a
middle of the road GPU, it's no SLI or X-fire power drainer. Also the
IDE burners and HDDs remain unchanged, working like before.
I highly doubt that. Of course I cannot rule out a bad design
mistake in the HDDs. But anything that comes over the data cable
should not be able to cause bad sectors.

Who can tell how the hardware and drivers interact with each other?
Maybe once the chipset misses some incoming data, it tries to
compensate by driving up voltages or currents going into the drives?

Yousuf Khan
 
Yousuf said:
Or the Nvidia chipset is screwing them up. The original post showed a
link saying people have noticed their SATA drives getting fried when
used with an Nvidia chipset. Nvidia has a reputation for poor quality
thermals on their chipsets, and a lot of pins get unseated simply due
to the heat their chip puts out.

Even that shouldnt produce pending sectors in a drive.

The drive itself sees nothing of the chipset.
The SMART self-tests now fail with a "Read element failure". Both long and short ones.

Then the drive is dying.
None of those are likely.

It isnt that clearcut.
Why does it always start to happen when there are two SATA drives plugged in?

Could be that the power supply is marginal and cant
deliver the specs when both drives are plugged in.
An external influence like an Nvidia chipset?

Cant see how that can produce a pending sector or a read element failure.

Corse it might conceivably produce a read element failure if that isnt the drive SMART data.
 
YKhan wrote
Another possibility is a bad chipset that misses data
coming over the bus and interprets it as a bad sector.

Nope, that cant produce a pending sector in the drive SMART report.

Thats something that the drive itself determined, that it had a problem
reading that sector and it plans to spare it on the next write to that sector.

It doesnt spare it immediately so you can try to get the data from it.
Yup, and unfortunately it looks like WD is going to have to bear the
brunt of the cost of the RMA again, even though it's not really its fault.

You dont know that last.
Try convincing Nvidia that their crap chipsets are destroying hard drives,

It isnt even possible for the chipset to produce
pending sectors, let alone a read element failure.
even though they already have a history of doing this.

No they dont.
Previously the power supply had been handling two DVD burners, two IDE
HDDs, one SATA HDD, and and 8600GT PCIe video card with no problems.

Power supplys go bad over time.
Why would one additional SATA HDD drive it over the edge?

Because it was marginal with the previous load and
that pushed it over the edge into visible symptoms.
If it was already borderline, it would've exhibited problems even before
now, such as at startup when component loads more than double,

Not necessarily. If the noise level on the 12V rail is higher
than spec, that may not interfere with the startup phase.
especially the CPU and hard drives.

Not if the hard drive isnt even being read while its spinning up.

The CPU load doesnt vary significantly in the startup phase.
Even the GPU is nothing special, just a middle of
the road GPU, it's no SLI or X-fire power drainer.

Its the total load that matters.
Also the IDE burners and HDDs remain unchanged, working like before.

The drive could be more sensitive to out of spec rails.
Who can tell how the hardware and drivers interact with each other?

Anyone who understands what pending sectors are and the fact that
the drive determines that for itself when attempting to read the sector.
The chipset isnt even involved at all in that operation. It just deals with
the contents of the sector once the drive has got that from the platter.
Maybe once the chipset misses some incoming data, it tries to
compensate by driving up voltages or currents going into the drives?

Nope, chipsets dont work like that.
 
Now I'd like to get the Current Pending Sectors Count to go down again.
How would I go about getting them to go away? Should I schedule a
surface scan of the two disks, so that they can get read and written to
again?

A defrag will probably rewrite the unreliable sectors, whereas
Scandisk's surface scan will not write to sectors that are in use.
AIUI, if the drive retests the pending sectors and finds that they are
OK, then they will be returned to service. If they fail again, then
they will be reallocated.

BTW I see that the Raw Read Error Rate for SATA #2 is 873. I don't
know what this means for a WD drive, but perhaps it is significant ???
Maybe you could research other SMART reports for the same model.

FWIW, I notice that SATA #1 has 1 pending sector, indicating a read
problem, but the read error rate is 0.

- Franc Zabkar
 
Another possibility is a bad chipset that misses data coming over the
bus and interprets it as a bad sector.
Yup, and unfortunately it looks like WD is going to have to bear the
brunt of the cost of the RMA again, even though it's not really its
fault. Try convincing Nvidia that their crap chipsets are destroying
hard drives, even though they already have a history of doing this.

I find this highly unlikely, unless there is something fundamentally
wrong with the WD drives. Chosets cannot destroy well-designed HDDs.
Previously the power supply had been handling two DVD burners, two IDE
HDDs, one SATA HDD, and and 8600GT PCIe video card with no problems.
Why would one additional SATA HDD drive it over the edge?

And why not? That is up to 20W with a power-pattern that is
very hard on the PSU. In addition, the PSU is older.
If it was
already borderline, it would've exhibited problems even before now,
such as at startup when component loads more than double, especially
the CPU and hard drives.

Not necessarily. At startup, copmponents do not expect clean power.
Also, a defective PSU does not need to have weak volate output.
It can fail in several other modes.
Even the GPU is nothing special, just a
middle of the road GPU, it's no SLI or X-fire power drainer. Also the
IDE burners and HDDs remain unchanged, working like before.
Who can tell how the hardware and drivers interact with each other?
Maybe once the chipset misses some incoming data, it tries to
compensate by driving up voltages or currents going into the drives?

The chipset cannot do such a thing. There simply is no circuitry
for it.

Arno
 
Franc said:
A defrag will probably rewrite the unreliable sectors, whereas
Scandisk's surface scan will not write to sectors that are in use.
AIUI, if the drive retests the pending sectors and finds that they are
OK, then they will be returned to service. If they fail again, then
they will be reallocated.

BTW I see that the Raw Read Error Rate for SATA #2 is 873. I don't
know what this means for a WD drive, but perhaps it is significant ???
Maybe you could research other SMART reports for the same model.

Which is why I suspect it's a chipset issue. This same drive was error
free prior to the addition of the second drive. The two drives are
actually the exact same model of 640GB WD drive, although the SATA#1 is
a revision or two newer.
FWIW, I notice that SATA #1 has 1 pending sector, indicating a read
problem, but the read error rate is 0.


Both drives are now unable to complete SMART reads, so I'm going to RMA
them both. But first I'm gonna get the data off of both of them, so I
bought a third drive, a Hitachi 1TB to hold their data. After I RMA
them, I'll sell them both off without even opening their anti-static bags.

Yousuf Khan
 
Back
Top