Drives that drop off line every couple of days

  • Thread starter Thread starter CJT
  • Start date Start date
C

CJT

I've got a couple of WD 1200AB drives in a server as master and slave
on the same cable (not that I think it matters, but for completeness).
I'm running Solaris 9. The machine is not heavily loaded. At random
intervals of 2-3 days, one or the other (usually the slave) will audibly
click a couple of times and drop off line. The log file will reflect a
series of timeouts followed by an indication that an error has occurred,
described as --

Sense key: aborted command
error code 0x3

Once it happens, I haven't found a way to get the disk to respond again
except hard rebooting, after which everything will appear fine for a
few days. When the slave goes down, the master stays up, and vice
versa.

They're not hot. The cables are good quality. There are two other
(Maxtor) drives on the other channel that don't exhibit the problem.

The motherboard is a Shuttle AK12A with an underclocked Athlon running
at 900 MHz. It seems pretty solid otherwise.

The disks in question are not used by the system (which has its own
SCSI disk) -- they're loaded with user files being served via Samba
and NFS shares.

Any thoughts?

--
After being targeted with gigabytes of trash by the
"SWEN" worm, I have concluded we must conceal our
e-mail address. Our true address is the mirror image
of what you see before the "@" symbol. It's a shame
such steps are necessary.

Charlie
 
CJT said:
I've got a couple of WD 1200AB drives in a server as master and slave
on the same cable (not that I think it matters, but for completeness).
I'm running Solaris 9. The machine is not heavily loaded. At random
intervals of 2-3 days, one or the other (usually the slave) will audibly
click a couple of times and drop off line. The log file will reflect a
series of timeouts followed by an indication that an error has occurred,
described as --

Sense key: aborted command
error code 0x3

Once it happens, I haven't found a way to get the disk to respond again
except hard rebooting, after which everything will appear fine for a
few days. When the slave goes down, the master stays up, and vice
versa.

They're not hot. The cables are good quality. There are two other
(Maxtor) drives on the other channel that don't exhibit the problem.

The motherboard is a Shuttle AK12A with an underclocked Athlon running
at 900 MHz. It seems pretty solid otherwise.

The disks in question are not used by the system (which has its own
SCSI disk) -- they're loaded with user files being served via Samba
and NFS shares.

Any thoughts?

What size power supply ?
 
Rod said:
What size power supply ?

350 W

Interesting thought.

--
After being targeted with gigabytes of trash by the
"SWEN" worm, I have concluded we must conceal our
e-mail address. Our true address is the mirror image
of what you see before the "@" symbol. It's a shame
such steps are necessary.

Charlie
 
CJT said:

That should be ok. It'd more likely to be the problem
if it was only say a 200W with an Athlon and 4 hard
drives specially those older Athlon cpus which could
be a bit hungry power wise and behave oddly if the
power supply was marginal.
 
CJT said:
I've got a couple of WD 1200AB drives in a server as master and slave
on the same cable (not that I think it matters, but for completeness).
I'm running Solaris 9. The machine is not heavily loaded. At random
intervals of 2-3 days, one or the other (usually the slave) will audibly
click a couple of times and drop off line. The log file will reflect a
series of timeouts followed by an indication that an error has occurred,
described as --

Sense key: aborted command
error code 0x3

That sounds like a SCSI error except there isn't a sense key of
aborted command: http://www.t10.org/lists/asc-num.htm#ASC_03
03/00 PERIPHERAL DEVICE WRITE FAULT
Once it happens, I haven't found a way to get the disk to respond again
except hard rebooting, after which everything will appear fine for a
few days. When the slave goes down, the master stays up, and vice
versa.

They're not hot. The cables are good quality. There are two other
(Maxtor) drives on the other channel that don't exhibit the problem.

The motherboard is a Shuttle AK12A with an underclocked Athlon running
at 900 MHz. It seems pretty solid otherwise.

The disks in question are not used by the system (which has its own
SCSI disk) -- they're loaded with user files being served via Samba
and NFS shares.

Any thoughts?

--
After being targeted with gigabytes of trash by the
"SWEN" worm, I have concluded we must conceal our
e-mail address. Our true address is the mirror image
of what you see before the "@" symbol. It's a shame
such steps are necessary.

That doesn't help. You have to get a new email address.
 
Folkert said:
That sounds like a SCSI error except there isn't a sense key of
aborted command: http://www.t10.org/lists/asc-num.htm#ASC_03
03/00 PERIPHERAL DEVICE WRITE FAULT

I should have made clear that these are ATA drives, not SCSI.

I think the Solaris drivers use similar terminology to SCSI, though.
That doesn't help. You have to get a new email address.


--
After being targeted with gigabytes of trash by the
"SWEN" worm, I have concluded we must conceal our
e-mail address. Our true address is the mirror image
of what you see before the "@" symbol. It's a shame
such steps are necessary.

Charlie
 
Mike said:
writes




How many amps on the 12v rail? You're running five drives (4 IDE and
one SCSI), judging by your description.

Checked power management isn't spinning them down?

That is an interesting line of inquiry, because I also have some hefty
12 volt fans. I'll have to check again that I'm ok on that score. When
I first put it together, I monitored the 12V line and it appeared to be
actually slightly overvoltage, if anything, so I thought I was ok. But
it is possible that it sags occasionally, which is all it would take to
generate my symptoms. Unfortunately, I don't really know how much 12V
the motherboard takes for things like serial ports, so it's hard to do
a precise accounting.

I've always assumed that if the 12V were weak, the most likely time to
see it would be at startup rather than after a day or two of uptime, but
that could be not entirely true. I do know my 120VAC is good, because
it's regulated.

Power management shouldn't be spinning them down according to its
settings, and it would be weird if for some reason it were spinning down
only the WD drives, which are considerably less power hungry than their
Maxtor neighbors.

Thanks for your comments.

--
After being targeted with gigabytes of trash by the
"SWEN" worm, I have concluded we must conceal our
e-mail address. Our true address is the mirror image
of what you see before the "@" symbol. It's a shame
such steps are necessary.

Charlie
 
CJT <[email protected]> said:
That is an interesting line of inquiry, because I also have some hefty
12 volt fans. I'll have to check again that I'm ok on that score. When
I first put it together, I monitored the 12V line and it appeared to be
actually slightly overvoltage, if anything, so I thought I was ok.

It's the ability of the supply to deal with sudden surges in demand that
is the issue (read on.)
But
it is possible that it sags occasionally, which is all it would take to
generate my symptoms. Unfortunately, I don't really know how much 12V
the motherboard takes for things like serial ports

Very little for serial ports, but some Athlon motherboards generate the
processor Vcore from the 12v line. This can cause sudden current
demands from the 12v line when the CPU starts working hard. On the
other hand, you said earlier you were running an Athlon 900, so the
current draw from this will not be as great as that for an XP. Does the
PC ever crash inexplicably?
I've always assumed that if the 12V were weak, the most likely time to
see it would be at startup rather than after a day or two of uptime, but
that could be not entirely true.

It can result in intermittent symptoms. You could try jury-rigging a
second power supply - connect it up to your 4 IDE drives and leave the
original PSU connected to the system SCSI drive, fans and motherboard.
Run the system for a while and see if the symptoms change.
 
That is an interesting line of inquiry, because I also have some
hefty 12 volt fans. I'll have to check again that I'm ok on that
score. When I first put it together, I monitored the 12V line
and it appeared to be actually slightly overvoltage, if anything,
so I thought I was ok. But it is possible that it sags occasionally,
which is all it would take to generate my symptoms.
Yep.

Unfortunately, I don't really know how much 12V
the motherboard takes for things like serial ports,

That stuff is completely trivial. The only thing that
matters is the drives, the fans and the motherboard
if it uses that. It likely doesnt given that its an Athlon.
so it's hard to do a precise accounting.

Thats not necessary, just count the major loads.
I've always assumed that if the 12V were weak, the most likely time
to see it would be at startup rather than after a day or two of uptime,

Correct, particularly with the IDE drives all spinning up at once.
but that could be not entirely true.

It likely still is, particularly with an Athlon that doesnt
use the 12V rail like a P4 and recent Celeron does.
 
CJT said:
I should have made clear that these are ATA drives, not SCSI.

Yes, I know.
I think the Solaris drivers use similar terminology to SCSI, though.

And that's why I posted the link anyway.
--
After being targeted with gigabytes of trash by the
"SWEN" worm, I have concluded we must conceal our
e-mail address. Our true address is the mirror image
of what you see before the "@" symbol. It's a shame
such steps are necessary.

That doesn't help. You have to get a new email address.
 
Mike said:
writes




It's the ability of the supply to deal with sudden surges in demand that
is the issue (read on.)




Very little for serial ports, but some Athlon motherboards generate the
processor Vcore from the 12v line. This can cause sudden current
demands from the 12v line when the CPU starts working hard. On the
other hand, you said earlier you were running an Athlon 900, so the
current draw from this will not be as great as that for an XP. Does the
PC ever crash inexplicably?

No, it's quite solid otherwise.
It can result in intermittent symptoms. You could try jury-rigging a
second power supply - connect it up to your 4 IDE drives and leave the
original PSU connected to the system SCSI drive, fans and motherboard.
Run the system for a while and see if the symptoms change.

Thanks again to you and others for all your comments. I'm going to
focus on the 12V for a while. It shouldn't be hard to set up something
to monitor it continuously. Or, I might just try another, heftier PS,
and see if that makes a difference.

One thing I've thought of, but assume would be a bad idea because of
the startup surge load it would put on the PS, is the addition of some
capacitance out at the end of the power cables -- like what car audio
freaks do (I think they call them "stiffeners").


--
After being targeted with gigabytes of trash by the
"SWEN" worm, I have concluded we must conceal our
e-mail address. Our true address is the mirror image
of what you see before the "@" symbol. It's a shame
such steps are necessary.

Charlie
 
Folkert Rienstra wrote:

That doesn't help. You have to get a new email address.

I've read that it is harvesting from the local usenet caches,
which will eventually expire, so I have some hope it might
work. I'd prefer to not change email address, for obvious
reasons. In fact, I'm only getting about half a gigabyte
per day of SWEN now (down from about double that), so perhaps
it's making a difference.


--
After being targeted with gigabytes of trash by the
"SWEN" worm, I have concluded we must conceal our
e-mail address. Our true address is the mirror image
of what you see before the "@" symbol. It's a shame
such steps are necessary.

Charlie
 
CJT said:
No, it's quite solid otherwise.


Thanks again to you and others for all your comments. I'm going to
focus on the 12V for a while. It shouldn't be hard to set up something
to monitor it continuously. Or, I might just try another, heftier PS,
and see if that makes a difference.

One thing I've thought of, but assume would be a bad idea because of
the startup surge load it would put on the PS, is the addition of some
capacitance out at the end of the power cables -- like what car audio
freaks do (I think they call them "stiffeners").

Not a good idea at all. If there is a problem with the
12V rail, the best approach is a better power supply.
 
I've read that it is harvesting from the local usenet caches,
which will eventually expire, so I have some hope it might
work. I'd prefer to not change email address, for obvious
reasons. In fact, I'm only getting about half a gigabyte
per day of SWEN now (down from about double that), so perhaps
it's making a difference.

It's just astarting to die out as a natural cycle.
You really don't need a new email addy, but you can easily munge your
addy like I have. It does cut the crap.
 
Mike said:
writes




How many amps on the 12v rail? You're running five drives (4 IDE and
one SCSI), judging by your description.

Checked power management isn't spinning them down?

As an additional followup, I've gone through and added up the 12V
current draws for all the disk drives and fans, using the disk drive
numbers that apply while they're active (rather than startup or idle),
and I get 8.03 amps. The sticker on the power supply says it is capable
of 14 amps. That seems like a reasonable margin of safety unless the
motherboard is drawing a few amps on its own (short of measuring it, I
don't know how to determine that, and to measure it properly I'll need
to buy some parts). I haven't yet tried monitoring voltages at the
drives but that's next, along with measuring the MB 12V current.

Of course, it's also possible that the power supply isn't up to spec.

--
After being targeted with gigabytes of trash by the
"SWEN" worm, I have concluded we must conceal our
e-mail address. Our true address is the mirror image
of what you see before the "@" symbol. It's a shame
such steps are necessary.

Charlie
 
I've read that it is harvesting from the local usenet caches,
which will eventually expire, so I have some hope it might
work. I'd prefer to not change email address, for obvious
reasons. In fact, I'm only getting about half a gigabyte
per day of SWEN now (down from about double that), so perhaps
it's making a difference.

Depends. If your email address gots onto one of the "confirmed" lists
that are bought and sold among spammers, you're screwed. Years ago
when the Internet was young, I was foolish enough to respond to a spam
with the suggested "Remove me" header. Oh oh...
 
CJT said:
As an additional followup, I've gone through and added up the 12V
current draws for all the disk drives and fans, using the disk drive
numbers that apply while they're active (rather than startup or idle),
and I get 8.03 amps. The sticker on the power supply says it is capable
of 14 amps. That seems like a reasonable margin of safety unless the
motherboard is drawing a few amps on its own (short of measuring it, I
don't know how to determine that, and to measure it properly I'll need
to buy some parts). I haven't yet tried monitoring voltages at the
drives but that's next, along with measuring the MB 12V current.

Of course, it's also possible that the power supply isn't up to spec.
FWIW, and for those who might follow in my footsteps, while I was
adjusting the wiring to allow close monitoring of the 12V I reseated
the power harness to the affected drives. It was apparently not making
a good connection, because since doing so over a week ago I haven't
experienced any problems. Monitoring indicates the 12V is precisely on
spec.

--
After being targeted with gigabytes of trash by the
"SWEN" worm, I have concluded we must conceal our
e-mail address. Our true address is the mirror image
of what you see before the "@" symbol. It's a shame
such steps are necessary.

Charlie
 
Back
Top