OCZ SDD drives destroyed by hard reset or power off?

  • Thread starter Thread starter John Doe
  • Start date Start date
J

John Doe

I am getting the unpleasant feeling that my SDD is dying, again.

I have done a button style power off and one or two hard resets
recently. And now I am getting a CHKDSK error.

In any case, if the indication is correct, this will be the second
OCZ SDD that has gone belly up in my system.
 
I am getting the unpleasant feeling that my SDD is dying, again.

I have done a button style power off and one or two hard resets
recently. And now I am getting a CHKDSK error.

In any case, if the indication is correct, this will be the second OCZ
SDD that has gone belly up in my system.

Even solid state disks are not immune to failure. I recently had one
that admittedly is not in an office environment (it is in an amateur
radio repeater system) go bad after 3 years of 24/7 operation. The
failure mode was interesting... the system was still running (CentOS
Linux) but I could no longer write to any of the filesystem. Reads
worked fine, and the kernel was still running (without logging anything,
though). Once I rebooted it, it refused to start back up. We replaced
it, and restored the system with a backup. You do have a current backup,
don't you??
 
david said:
John Doe rearranged some electrons to say:


Even solid state disks are not immune to failure. I recently
had one that admittedly is not in an office environment (it is
in an amateur radio repeater system) go bad after 3 years of
24/7 operation. The failure mode was interesting... the system
was still running (CentOS Linux) but I could no longer write to
any of the filesystem. Reads worked fine, and the kernel was
still running (without logging anything, though). Once I
rebooted it, it refused to start back up. We replaced it, and
restored the system with a backup. You do have a current
backup, don't you??

Lots. I would be off-line for at least 30 minutes...

I suppose a diagnostics utility from OCZ might be useful, but I
haven't looked (yet).
 
I am getting the unpleasant feeling that my SDD is dying, again.

I have done a button style power off and one or two hard resets
recently. And now I am getting a CHKDSK error.

In any case, if the indication is correct, this will be the second
OCZ SDD that has gone belly up in my system.

I recently (last week) had my OCZ SSD start acting badly. It was pretty
much unusable. I contacted OCZ to go through the procedure to send it
back. Jeff, a tech guy there emailed me to update the firmware on my
SSD. I did and it fixed the drive. I did have to erase the drive
(sometimes you don't) but I had a backup image on another drive, so
there was no loss.

Charlie
 
Even solid state disks are not immune to failure. I recently had one
that admittedly is not in an office environment (it is in an amateur
radio repeater system) go bad after 3 years of 24/7 operation. The
failure mode was interesting... the system was still running (CentOS
Linux) but I could no longer write to any of the filesystem. Reads
worked fine, and the kernel was still running (without logging anything,
though). Once I rebooted it, it refused to start back up. We replaced
it, and restored the system with a backup. You do have a current backup,
don't you??

What you're seeing isn't a failure per se, but rather the drive
reached the end of the expected write life. There were no sectors
left that were listed as ok to write to so the drive went read-only.
 
Loren Pechtel said:
What you're seeing isn't a failure per se, but rather the drive
reached the end of the expected write life. There were no
sectors left that were listed as ok to write to so the drive
went read-only.

Then the manufacturer claims of 1.5 million hours mean time
between failure (MTBF) were all bogus. In that case, hopefully
they have stopped making those silly claims.
 
In message <[email protected]> someone
claiming to be John Doe said:
Then the manufacturer claims of 1.5 million hours mean time
between failure (MTBF) were all bogus. In that case, hopefully
they have stopped making those silly claims.

I don't think MTBF means what you think it means; A sample size of 1
isn't sufficient to determine whether it's bogus or not.
 
DevilsPGD said:
In message <4eeeb35a$0$5498$c3e8da3$eb767761 news.astraweb.com> someone


I don't think MTBF means what you think it means; A sample size of 1
isn't sufficient to determine whether it's bogus or not.

Are you drunk?

--
 
Loren said:
What you're seeing isn't a failure per se, but rather the drive
reached the end of the expected write life. There were no sectors
left that were listed as ok to write to so the drive went read-only.

It's more likely to be a firmware issue. SSD drives have their own
processor inside, and a firmware load. And the SSD drive is "busy"
internally, even when you aren't using it. It's pretty hard to
test those firmwares, and remove all the bugs from them. The
firmware has to be "correct by design", because lab testing
simply won't uncover all the bugs. (This is something I learned
from my computer design days, is that lab testing, gets the
error rate in design down to a certain level. But if your
design techniques suck, it shows. And that's what modern
SSDs look like to me - inadequate designs, rushed out the door.)

I thought there was some SMART stat, which kept track of write life.
Maybe people should be eyeballing that, once in a while. Or perhaps,
simply recording the SMART stats every day. Then, when the device
fails, go back and look at the stats, and see if there is any reason
to suspect it was actually media.

http://en.wikipedia.org/wiki/S.M.A.R.T.

"233 0xE9 Media Wearout Indicator

Intel SSD reports a normalized value of 100 (when the SSD
is new) and declines to a minimum value of 1. It decreases
while the NAND erase cycles increase from 0 to the
maximum-rated cycles."

So there is a way to track it. Too bad there isn't much adherence
to standards out there. You can't really rely on all drives,
doing SMART exactly the same way. And that was true, even with
hard drives. The important parameters may be there, but many of
the others, differ from design to design.

You really need an article about SMART, that compares what the
various manufacturers and controller designs are doing, before
deciding indicators like that actually mean something.

Paul
 
Then the manufacturer claims of 1.5 million hours mean time
between failure (MTBF) were all bogus. In that case, hopefully
they have stopped making those silly claims.

That's not a failure, that's exhausting the media life. The drive
worked as designed.
 
Loren Pechtel said:
That's not a failure, that's exhausting the media life. The drive
worked as designed.

Apparently you don't know how long 1.5 million hours is. It's
approximately 171 years.
 
Loren said:
That's not a failure, that's exhausting the media life. The drive
worked as designed.

MTBF is intended to help predict how many additional items
should be kept in the stockroom. It's not a statement
that "this drive will last for 57 years".

http://en.wikipedia.org/wiki/Annualized_failure_rate

You should not "consume" an MTBF number, without knowing
the assumptions that went into it. When you see a figure
like that on an SSD datasheet, just ignore it. (If they
won't tell you how they arrived at the number, then the
number is meaningless.)

*******

Since wearout is a known factor, it should not be
included in the MTBF number. A person wishing to know
how many spare SSDs to keep in the stockroom, works out
the rate they're writing data to the pool of drives, and
buys extra drives (per year) to account for that. Then,
the MTBF number is used to supplement that purchase number,
by a few extra drives that account for random failure of
supporting electronics components on the SSD PCB
controller card.

If I was Google, perhaps I'd buy 100,000 SSD drives for
my server room. Based on the terabytes per day average
write rate to the drives, I buy an extra 2000 drives per
year, to account for wearout due to flash write-life.
And using the MTBF number, I may end up buying an extra
100 drives per year, to account for random failures of
things like power regulator chips on the SSD PCB or
failures of the flash controller chip, or failures of
the copper tracks on the SSD PCB. So I end up buying 2100
spares per year, and the MTBF number made a small contribution
to my purchase order. So in terms of a maintenance
budget, I need to budget for 2100 drives every year, in
addition to the original purchase of 100,000 units.
That's how you'd use that information.

Paul
 
Paul said:
MTBF is intended to help predict how many additional items
should be kept in the stockroom. It's not a statement that "this
drive will last for 57 years".

You should be telling the manufacturers that. They are using, or
were using, those numbers to sell their products.

By the way, 1.5 million hours is 171 years at 24/7. Hopefully you
are not suggesting that any ordinary consumer would know that an
advertised "1.5 million MBTF" could mean a life expectancy of only
three years.
 
John said:
You should be telling the manufacturers that. They are using, or
were using, those numbers to sell their products.

By the way, 1.5 million hours is 171 years at 24/7. Hopefully you
are not suggesting that any ordinary consumer would know that an
advertised "1.5 million MBTF" could mean a life expectancy of only
three years.

Well, I wouldn't know what the number meant, without a list of
assumptions.

The method for working this stuff out, starts with a MIL-Spec.
For example, here they mention MIL-HDBK-217F. We had a department
that did nothing but this stuff.

http://www.halthass.co.nz/reliability/services/mtbf-calc.htm

But once you get past the guideline stage, it's pretty
much open ended, as to how you abuse that spec. If you choose
to throw certain parts of the design, out of the analysis,
you can get very good numbers. I suspect that's how the
SSD numbers are done.

Paul
 
Apparently you don't know how long 1.5 million hours is. It's
approximately 171 years.

This isn't a drive failure. It's operation as designed at the end of
it's write life.
 
Back
Top