Do SSD drives really fail a lot ?

  • Thread starter Thread starter Lynn McGuire
  • Start date Start date
Do SSD drives really fail a lot ?

http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html

"… I feel ethically and morally obligated to let you in on a
dirty little secret I've discovered in the last two years of
full time SSD ownership. Solid state hard drives fail. A lot.
And not just any fail. I'm talking about catastrophic,
oh-my-God-what-just-happened-to-all-my-data instant gigafail.
It's not pretty. "

LM omitted from the next page:
"Solid state hard drives are so freaking amazing performance wise, and the
experience you will have with them is so transformative, that I don't even
care if they fail every 12 months on average! I can't imagine using a
computer without a SSD any more; it'd be like going back to dial-up internet
.. . . "
 
Lynn McGuire said:
"? I feel ethically and morally obligated to let you in on a
dirty little secret I've discovered in the last two years of
full time SSD ownership. Solid state hard drives fail. A lot.
And not just any fail. I'm talking about catastrophic,
oh-my-God-what-just-happened-to-all-my-data instant gigafail.
It's not pretty. "
Lynn

It depends on your usage pattern and the SSD. Failure rate is
a designed feature with SSDs, i.e. the manufacturers know pretty
well how much writing an SSD can take. By designing wear-leveling
and spare capacity, they can design a specific write load
that kills a drive. In the beginning, this process is shaky
though and whole drive series can have worse reliability.

The typical reliability design goal is a 5% failure rate
per year for an average usage pattern. Consumers are willing
to tolerate that. That is a real failure rate, but it is
not "all the time". There are people that think because SSDs
are not suceptible to mechanical damage, they could do without
backup. Thise people will lose their data, no matter what
storage medium it is on, untill some day no money can be saved
by aiming for that 5% and reliability slowly goes up.

That said, I think the coding horror person (which has some
prrry nice things about coding in his blog) has a census of
mostly early models. These, like any new technology, have
increased failure rates, as the manufacturers try to aim
for that 5%/year but make mistakes in the process. It could
also just be a statistical annomaly.

There is one additional thing: SSDs are susceptible to
heat, just like any other electronics and to bad power.
It is possible that the guy with the 8 of 8 dead deives
just killed them by overheating or by voltage-spikes
from a cheap/bad PSU. For heat, rule of thumb is half
the lifetime every 10C for semiconductors and this works
pretty well. I have seen it several times now, one a 22
unit network card sample. As SSDs contain power circutry,
some parts of them run much hotter (step-up regulators for
converting 5V to the write-voltage needed), and lifetime
of 5 years is typically calculated at 40C environmental
temperature. Run them at 60C and you get 1.25 years average
lifetime. Other example: Memory and logic chips have something
like 30 years at 25C (figure from a very old Intel databook).
Run them at 65C and you get around 2 years lifetime.
That means you get the first failured (depending on
sample size) after 1-1.5 years and after 3 years most are
dead. This incidentally was my intital measurement and
prediction for the 22 network cards and what happened
then. Note that high-performance CPUs are different, as
they are more designed as power semiconductors. But chipsets
are not. I have seen several fail from inadequate cooling
in 1-3 years.

There is one other effect at work here: A lot of people
expected SSDs to be much more reliable than HDDs.
They are not in general, see above. This can lead
to disappointments causing overstatement of the problem.

Altogether, I don't believe we are seeing more than
early-adopter problems, and they are always the same.
Also, there are certainly cheap SSDs and better
SSDs, just like allways and it is possible to treat SSDs
well or badly.

Arno
 

The most common reason for failure (90%) in flash drives appears to be
translator corruption (damaged lookup tables), especially if the power
fails while the translator is being updated. Afterwards the drive
powers up in safe mode with a very small capacity.

What are the Flash drives' typical failures [Public Forum]:
http://www.salvationdata.com/forum/topic1873.html

I suspect that SSDs may be similarly affected. Perhaps that's why some
newer models have large super capacitors for power backup.

- Franc Zabkar
 

The most common reason for failure (90%) in flash drives appears to be
translator corruption (damaged lookup tables), especially if the power
fails while the translator is being updated. Afterwards the drive
powers up in safe mode with a very small capacity.

What are the Flash drives' typical failures [Public Forum]:
http://www.salvationdata.com/forum/topic1873.html

I suspect that SSDs may be similarly affected. Perhaps that's why some
newer models have large super capacitors for power backup.

Be wary of the new Intel SSD 320 series. Currently, there's a bug in the
controller that can cause the device to revert to 8MB during a power
failure. AFAIK they have not yet publicly announced it, and won't have a
firmware fix ready for release until the end of July.

We had an SSD 320 600GB 2.5" SATA drive in for evaluation from our Intel
rep. I was able to kill it in two or three hours by power cycling it.
Apparently (according to the Intel rep) when the power failure is
happening, the SSD device tries to reconnect with the SATA port instead of
initiating a proper shutdown. Something to do with interrupt priority
being higher for reconnection rather than a proper shutdown.

I was able to kill their 80GB device as well. We've sent both drives back
to Intel and they're going to give us their pre-release firmware for
testing.
 
JW said:

The most common reason for failure (90%) in flash drives appears to be
translator corruption (damaged lookup tables), especially if the power
fails while the translator is being updated. Afterwards the drive
powers up in safe mode with a very small capacity.

What are the Flash drives' typical failures [Public Forum]:
http://www.salvationdata.com/forum/topic1873.html

I suspect that SSDs may be similarly affected. Perhaps that's why some
newer models have large super capacitors for power backup.
Be wary of the new Intel SSD 320 series. Currently, there's a bug in the
controller that can cause the device to revert to 8MB during a power
failure. AFAIK they have not yet publicly announced it, and won't have a
firmware fix ready for release until the end of July.
We had an SSD 320 600GB 2.5" SATA drive in for evaluation from our Intel
rep. I was able to kill it in two or three hours by power cycling it.
Apparently (according to the Intel rep) when the power failure is
happening, the SSD device tries to reconnect with the SATA port instead of
initiating a proper shutdown. Something to do with interrupt priority
being higher for reconnection rather than a proper shutdown.
I was able to kill their 80GB device as well. We've sent both drives back
to Intel and they're going to give us their pre-release firmware for
testing.

Interesting. Goes to show that firmware development is apparently
not done any better than other software development. I am tempted
to run my next SSD through similar tests before using it.

Arno
 

The most common reason for failure (90%) in flash drives appears to be
translator corruption (damaged lookup tables), especially if the power
fails while the translator is being updated. Afterwards the drive
powers up in safe mode with a very small capacity.

What are the Flash drives' typical failures [Public Forum]:
http://www.salvationdata.com/forum/topic1873.html

I suspect that SSDs may be similarly affected. Perhaps that's why some
newer models have large super capacitors for power backup.

Be wary of the new Intel SSD 320 series. Currently, there's a bug in the
controller that can cause the device to revert to 8MB during a power
failure. AFAIK they have not yet publicly announced it, and won't have a
firmware fix ready for release until the end of July.

We had an SSD 320 600GB 2.5" SATA drive in for evaluation from our Intel
rep. I was able to kill it in two or three hours by power cycling it.
Apparently (according to the Intel rep) when the power failure is
happening, the SSD device tries to reconnect with the SATA port instead of
initiating a proper shutdown. Something to do with interrupt priority
being higher for reconnection rather than a proper shutdown.

I was able to kill their 80GB device as well. We've sent both drives back
to Intel and they're going to give us their pre-release firmware for
testing.

The Pre-release firmware also had the problem. I ended up supplying Intel
SSD engineering with my test platform and they reproduced the problem and
have a fix pending. See:
http://communities.intel.com/thread/24121?tstart=0

The firmware is not yet released however.

Looks like this Usenet thread caused quite a bit of commotion on their
forum:
http://communities.intel.com/thread/22227?tstart=0

:)
 
JW said:
<[email protected]>:
Do SSD drives really fail a lot ?
http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html

The most common reason for failure (90%) in flash drives appears to be
translator corruption (damaged lookup tables), especially if the power
fails while the translator is being updated. Afterwards the drive
powers up in safe mode with a very small capacity.

What are the Flash drives' typical failures [Public Forum]:
http://www.salvationdata.com/forum/topic1873.html

I suspect that SSDs may be similarly affected. Perhaps that's why some
newer models have large super capacitors for power backup.

Be wary of the new Intel SSD 320 series. Currently, there's a bug in the
controller that can cause the device to revert to 8MB during a power
failure. AFAIK they have not yet publicly announced it, and won't have a
firmware fix ready for release until the end of July.

We had an SSD 320 600GB 2.5" SATA drive in for evaluation from our Intel
rep. I was able to kill it in two or three hours by power cycling it.
Apparently (according to the Intel rep) when the power failure is
happening, the SSD device tries to reconnect with the SATA port instead of
initiating a proper shutdown. Something to do with interrupt priority
being higher for reconnection rather than a proper shutdown.

I was able to kill their 80GB device as well. We've sent both drives back
to Intel and they're going to give us their pre-release firmware for
testing.
The Pre-release firmware also had the problem. I ended up supplying Intel
SSD engineering with my test platform and they reproduced the problem and
have a fix pending. See:
http://communities.intel.com/thread/24121?tstart=0

This is rather patheric on their side (not so at all on your side,
obviously).
The firmware is not yet released however.
Looks like this Usenet thread caused quite a bit of commotion on their
forum:
http://communities.intel.com/thread/22227?tstart=0

Understandable. The conclusion can only be to stay away from
Intel SSDs for the next few years, until they have
demonstrated they their Q/A under control and have started to take
the date safety of their customers seriously.

It also underlines somethign I have been saying for a while,
namely that SSDs should be regarded as less reliable than HDDs at
this time, because of engineering screw-ups like this one.

My SSDs are either in a RAID with non-SSDs (with "write mostly"
that gives SSD read-speeds under Linux software RAID) or do
not have critical data on them.

Arno
 
>> I don't even care if they fail every 12 months
if you get a ssd to last 12 months that is a miracle !

this is the lifespan of all the ssd's ive installed :
ocz solid - 47hrs
ocz vertex - 3 months
ocz agility - 11 months

compare that to mechanical drives ive installed, about 40 over the last 15 years, and only one developed a corrupt sector that warranted backing all my precious data up then reformat to fix. when ssd's fail, they fail bigtime with no warning and you are left with a brick. files - gone, emails - gone, windows - gone - all in a flash.

however its my own fault. the limitations of flash memory are well known. you can rewrite flash memory cells only about 3k ( cheap stuff 40p/gb ) - 100k ( expensive stuff $10/gb ) times before it freezes up and never be written again. of course my ssd are all going to fail - thats the nature of flash memory which is what ssd is !




"Lynn McGuire" <[email protected]> wrote in message
news:[email protected]...

> Do SSD drives really fail a lot ?
>
> http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html
>
> "… I feel ethically and morally obligated to let you in on a
> dirty little secret I've discovered in the last two years of
> full time SSD ownership. Solid state hard drives fail. A lot.
> And not just any fail. I'm talking about catastrophic,
> oh-my-God-what-just-happened-to-all-my-data instant gigafail.
> It's not pretty. "


LM omitted from the next page:
"Solid state hard drives are so freaking amazing performance wise, and the
experience you will have with them is so transformative, that I don't even
care if they fail every 12 months on average! I can't imagine using a
computer without a SSD any more; it'd be like going back to dial-up internet
.. . . "


--
Don Phillipson
Carlsbad Springs
(Ottawa, Canada)
 
Last edited:
Back
Top