Utility to test IDE cable connections?

  • Thread starter Thread starter David R
  • Start date Start date
If people simply refuse to buy systems without ECC memory that
sillyness will disappear. The point is that non-ECC memory
systems are completely vulnerable to such things as Cosmic Rays,
without any immediate warnings, and that a complete cure is
available for very moderate cost.

Nonsense, actually paying for ECC memory is quite unnecessary. Simply
relocate to a cave with at least 20m of rock above you (check it for
radioactivity, though).

By the way, the inherent cost of ECC (or parity memory) is an increase
in price of at least 1/8 -- for every 8 data chips you need one
redundant data chip to recover from errors. (Unless techniques have
changed since I looked). I don't know if there are any other significant
per-machine expenses once the ECC algorithms are integrated into chips.

Though, of course, you'll find ECC memory designed for your file server
(and covered by its warranty) to cost quite a bit more.

Best wishes,
 
David said:
I think he means that the odds of the stated symptom being caused by
cosmic rays is much less than more conventional sources. So much less
that it falls into the 'mystical' category.



That rate comes to 1 per 19 years, assuming 24/7 operation.

That's how many such errors per hour, taken over all the PCs in use?


Hey, if
you're lucky it was off when that one came through ;) Or you were doing
any of the 90% of the time non critical things people normally use a
home PC for.
Is there any reason not to use ECC besides some cost and a very small
loss of performance?


Two good reasons.
I suppose this comes down to what a "home computer" is. Some may be
used to play games and write letters; others may archive a lifetime's
worth of work.


It comes down to more than that. The odds of it happening and the
consequences if it does (which is an entire probability set of it's own)
vs the cost of taking preventative measures.

Once in 19 years is a rather rare event and even if it happened that
doesn't mean you automatically lose 'important' data. It would have to
occur at a particular time that affected a particular thing in a
particular manner. Ok, so maybe I lost a 'pixel' in a picture of pooch
or it blew a character in of those wonderful SPAM emails that come with
garbled text to begin with. Odds are the real impact [pun intended]
would be 'erp', unexplained program error, a few curse words about
'microsoft software', and restart [as if THAT never happens even without
the help of cosmic rays].

For the typical home user, the odds of losing EVERY thing from a hard
drive failure, combined with the traditionally lousy backup regimen, or
some other failure that causes the system to go 'nuts' is much, much,
higher than worrying about cosmic rays. The odds are higher it'll get
bumped at an inopportune time, or that a component will fail, or that a
connector will work lose from thermal creep, or any number of things.
Hell, the odds of the user screwing his data up HIMSELF is a thousand
times higher.

And we didn't even touch on getting a virus.
If not ECC memory, there is advantage to using parity-checked memory;
a memory error should cause the computer to halt with a warning,
rather than corrupting files.


I agree, if one is using it to calculate warp drive trajectories and an
'oops' may put you inside a sun somewhere. But then I'd be recommending
multiple redundant systems too.
 
CJT said:
That's how many such errors per hour, taken over all the PCs in use?

Why are you worried about someone else's PC, much less the 'collective'?
You Borg?
Hey, if
you're lucky it was off when that one came through ;) Or you were
doing any of the 90% of the time non critical things people normally
use a home PC for.
Nor do I agree that ECC should be used for an at home desktop.




Is there any reason not to use ECC besides some cost and a very small
loss of performance?



Two good reasons.
I suppose this comes down to what a "home computer" is. Some may be
used to play games and write letters; others may archive a lifetime's
worth of work.



It comes down to more than that. The odds of it happening and the
consequences if it does (which is an entire probability set of it's
own) vs the cost of taking preventative measures.

Once in 19 years is a rather rare event and even if it happened that
doesn't mean you automatically lose 'important' data. It would have to
occur at a particular time that affected a particular thing in a
particular manner. Ok, so maybe I lost a 'pixel' in a picture of pooch
or it blew a character in of those wonderful SPAM emails that come
with garbled text to begin with. Odds are the real impact [pun
intended] would be 'erp', unexplained program error, a few curse words
about 'microsoft software', and restart [as if THAT never happens even
without the help of cosmic rays].

For the typical home user, the odds of losing EVERY thing from a hard
drive failure, combined with the traditionally lousy backup regimen,
or some other failure that causes the system to go 'nuts' is much,
much, higher than worrying about cosmic rays. The odds are higher
it'll get bumped at an inopportune time, or that a component will
fail, or that a connector will work lose from thermal creep, or any
number of things. Hell, the odds of the user screwing his data up
HIMSELF is a thousand times higher.

And we didn't even touch on getting a virus.
If not ECC memory, there is advantage to using parity-checked memory;
a memory error should cause the computer to halt with a warning,
rather than corrupting files.



I agree, if one is using it to calculate warp drive trajectories and
an 'oops' may put you inside a sun somewhere. But then I'd be
recommending multiple redundant systems too.
 
David said:
Why are you worried about someone else's PC, much less the 'collective'?
You Borg?

I'm not. I was trying to highlight that with a shift of focus, it's not
so much how long it takes to encounter an error as whether or not you're
one of the unlucky ones this hour.
Hey, if
you're lucky it was off when that one came through ;) Or you were
doing any of the 90% of the time non critical things people normally
use a home PC for.

Nor do I agree that ECC should be used for an at home desktop.





Is there any reason not to use ECC besides some cost and a very
small loss of performance?




Two good reasons.

I suppose this comes down to what a "home computer" is. Some may be
used to play games and write letters; others may archive a
lifetime's worth of work.




It comes down to more than that. The odds of it happening and the
consequences if it does (which is an entire probability set of it's
own) vs the cost of taking preventative measures.

Once in 19 years is a rather rare event and even if it happened that
doesn't mean you automatically lose 'important' data. It would have
to occur at a particular time that affected a particular thing in a
particular manner. Ok, so maybe I lost a 'pixel' in a picture of
pooch or it blew a character in of those wonderful SPAM emails that
come with garbled text to begin with. Odds are the real impact [pun
intended] would be 'erp', unexplained program error, a few curse
words about 'microsoft software', and restart [as if THAT never
happens even without the help of cosmic rays].

For the typical home user, the odds of losing EVERY thing from a hard
drive failure, combined with the traditionally lousy backup regimen,
or some other failure that causes the system to go 'nuts' is much,
much, higher than worrying about cosmic rays. The odds are higher
it'll get bumped at an inopportune time, or that a component will
fail, or that a connector will work lose from thermal creep, or any
number of things. Hell, the odds of the user screwing his data up
HIMSELF is a thousand times higher.

And we didn't even touch on getting a virus.

If not ECC memory, there is advantage to using parity-checked
memory; a memory error should cause the computer to halt with a
warning, rather than corrupting files.




I agree, if one is using it to calculate warp drive trajectories and
an 'oops' may put you inside a sun somewhere. But then I'd be
recommending multiple redundant systems too.
 
Completely pointless to use ECC on systems that get booted every day.
Yes. A white paper I've recently found that was published last
January indicates that a PC with 512MB of memory running
24 hours a day will sustain a memory error on an average of
about every 10 days. See:
http://www.tezzaron.com/about/papers/Soft Errors 1_1 secure.pdf ,
Appendix B, Calculations, on page 6.
More interesting is the table on page 2. I shows errors rates for 1-2GB DRAM
of between 1/week and 2-4/year. The medium is about 1/month then.

Windows simply does not write out memory that is untouched for a month. Dirty
pages get written to disk in seconds. The only way you see disk corruption is
with persistent memory or data-path errors.

Most likely the error is in code page, which never gets written back. All you
get is an application that won't run.

PS, they say cosmic rays are neutrons/protons. The former do not affect
memory. Only gamma rays can reach the chip and cause errors.
 
Michael said:
What's "mystical" about cosmic rays? They do reach earth. Cosmic rays
and local radioactive decay can and do do cause computer memory errors
(IBM Journal of Research and Development, Volume 40, Number 1).

A test made by IBM on a 4Mbit DRAM found a soft error rate of about 6000
in a billion chip hours. A similar test in a vault under 20 tons of rock
produced no errors.


Is there any reason not to use ECC besides some cost and a very small
loss of performance?

I suppose this comes down to what a "home computer" is. Some may be used
to play games and write letters; others may archive a lifetime's worth
of work.

If not ECC memory, there is advantage to using parity-checked memory; a
memory error should cause the computer to halt with a warning, rather
than corrupting files.

With a 32 bit machine there is no cost advantage to using parity over
ECC--the reason ECC capability is common in 32 bit machines is that it can
be achieved with the 36 bits that you get in 4 bytes each with parity.

Personally I don't see why there's such strenuous objection to ECC by some
folks--anybody who has gotten burned by a bad bit in RAM once becomes a
believer.
 
Eric said:
Completely pointless to use ECC on systems that get booted every day.

Why? Booting does not do anything about hard errors.
More interesting is the table on page 2. I shows errors rates for 1-2GB
DRAM of between 1/week and 2-4/year. The medium is about 1/month then.

Windows simply does not write out memory that is untouched for a month.
Dirty pages get written to disk in seconds. The only way you see disk
corruption is with persistent memory or data-path errors.

It's amazing what happens when there's a stuck bit in the region from which
those dirty pages get written.
Most likely the error is in code page, which never gets written back. All
you get is an application that won't run.

Or corrupted data.
PS, they say cosmic rays are neutrons/protons. The former do not affect
memory. Only gamma rays can reach the chip and cause errors.

Cosmic rays are high energy particles that seldom reach the surface--what
reaches the surface is a particle cascade that results when the cosmic ray
strikes an atom of oxygen, nitrogen, or one of the other substances in the
atmosphere and causes that atom to fission (and now you're going to get off
on some half-baked physics tangent that ignores energy levels). Those
particles end up with a shitload of momentum picked up from the cosmic
ray--they're slow compared to the cosmic ray but they're still highly
energetic, with energies in the millions of electron volts. And they may
make their own cascades. When a particle from one of those cascades hits a
solid object, it usually cascades further--what hits the junction in the
chip is the product of one of those cascades, which may be any of a number
of subatomic particles and may still have quite high energy. While a
neutron spontaneously emitted from a decaying atom might not have any
effect on a memory location, one with an energy in the hundreds of
thousands of electron volts is a different story.
 
Eric Gisin said:
Completely pointless to use ECC on systems that get booted every day.

More interesting is the table on page 2. I shows errors rates for 1-2GB DRAM
of between 1/week and 2-4/year. The medium is about 1/month then.

The author of the paper summarizes that table by saying, "Judging
from these reports, 1000 to 5000 FIT [Failures In Time] per Mbit
seems to be a reasonable SER [Soft Error Rate] for modern
memory." Personally, I think the author is more credible than you.
1000 FIT per Mbit works out to an average of about one error
per 10 days for a computer running 24 hours a day. For 5000
FIT per Mbit, it works out to about an error every 2 days.
Thus, the authors estimate on page 6 that I quoted is rather
conservative.
Windows simply does not write out memory that is untouched for a month. Dirty
pages get written to disk in seconds. The only way you see disk corruption is
with persistent memory or data-path errors.
Nonsense.


Most likely the error is in code page, which never gets written back. All you
get is an application that won't run.

More nonsense. Even when that is true, a memory error
could modify a machine code instruction, which in turn could
cause hard drive or registry corruption.

Also, a prime time for hard drive corruption to occur due
to memory errors is when defragmenting a hard drive.
PS, they say cosmic rays are neutrons/protons. The former do not affect
memory. Only gamma rays can reach the chip and cause errors.

So?

-- Bob Day
 
David said:
.... snip ...

Yes. Soft error rates for one type of RAM doesn't necessarily
extrapolate directly to either chip-hour-MB and different chip
technologies/density.

As a case in point, these folks say they've licked it (almost)
entirely, for SRAMs anyway, using, amusingly enough, DRAM
technology.

http://neasia.nikkeibp.com/wcs/leaf/CID/onair/asabt/news/315057

Since you were asking about it I did a quick google for some
more recent numbers on DRAM and found this rather interesting
article, which includes some analysis on the things I mentioned,
like 'what is it doing when' and whether it's fatal, etc.

http://www.eecg.toronto.edu/~lie/papers/hp-softerrors-ieeetocs.pdf

It's interesting to note that early on they mention an error
rate for 'modern' 64MB ram chips in a 1 gig memory and while the
"300 reboots resulting from soft errors on 10000 machines in 1
year" sounds like a large number, because they're looking at how
'the industry' is affected, it translates to 1 per 33 years for
a 'user' on his one machine. (unfortunately they don't make it
clear if that's errors that get THROUGH the ECC or if they're
using the ECC to count the errors [my assumption])

Btw, CPUs have the same susceptibility.

Not really. The storage on a CPU is much more likely to be
static, rather than a dynamic cell which depends on the charge on
a miniscule capacitor, holding a countable number of electrons.
On chip caches though are a vulnerable point. At any rate, just
the physical area devoted to on chip memory, compared with main
memory, makes the probability of problems much smaller on the
CPU. All these soft-errors are probabalistic things.

Thanks for the references. Will get them later.
 
CBFalconer said:
Alex Fraser wrote: [snip]
The fact that the majority of chipsets on motherboards used for
desktop systems can't take advantage of ECC makes it largely a
moot point anyway.

If people simply refuse to buy systems without ECC memory that
sillyness will disappear.

True, but I don't see it happening, because I think that most people will
see no tangible benefit. Regardless of what actually is the case, it feels
to me - and I guess most people - like there is a huge discrepency between
papers such as the one posted elsethread and reality: just how do all these
soft errors go unnoticed?
The point is that non-ECC memory
systems are completely vulnerable to such things as Cosmic Rays,
without any immediate warnings, and that a complete cure is
available for very moderate cost.

Not complete. But quite a bit better ;).

Alex
 
Eric said:
Completely pointless to use ECC on systems that get booted every day.

nonsense


More interesting is the table on page 2. I shows errors rates for 1-2GB DRAM
of between 1/week and 2-4/year. The medium is about 1/month then.

Windows simply does not write out memory that is untouched for a month. Dirty
pages get written to disk in seconds. The only way you see disk corruption is
with persistent memory or data-path errors.

Most likely the error is in code page, which never gets written back. All you
get is an application that won't run.

PS, they say cosmic rays are neutrons/protons. The former do not affect
memory. Only gamma rays can reach the chip and cause errors.

wrong again
 
Alex said:
Alex Fraser wrote:
[snip]
The fact that the majority of chipsets on motherboards used for
desktop systems can't take advantage of ECC makes it largely a
moot point anyway.

If people simply refuse to buy systems without ECC memory that
sillyness will disappear.


True, but I don't see it happening, because I think that most people will
see no tangible benefit. Regardless of what actually is the case, it feels
to me - and I guess most people - like there is a huge discrepency between
papers such as the one posted elsethread and reality: just how do all these
soft errors go unnoticed?

I think most people just blame what they see on Windows (a plausible
assumption) and shrug. Then disaster hits, and they blame it on
themselves, well-trained lemmings that they are.
 
Bob said:
Eric Gisin said:
.... snip ...

More interesting is the table on page 2. I shows errors rates
for 1-2GB DRAM of between 1/week and 2-4/year. The medium is
about 1/month then.

The author of the paper summarizes that table by saying, "Judging
from these reports, 1000 to 5000 FIT [Failures In Time] per Mbit
seems to be a reasonable SER [Soft Error Rate] for modern
memory." Personally, I think the author is more credible than you.
1000 FIT per Mbit works out to an average of about one error
per 10 days for a computer running 24 hours a day. For 5000
FIT per Mbit, it works out to about an error every 2 days.
Thus, the authors estimate on page 6 that I quoted is rather
conservative.
Windows simply does not write out memory that is untouched for
a month. Dirty pages get written to disk in seconds. The only
way you see disk corruption is with persistent memory or data-path errors.

Nonsense.

The point is that the memory need not be quiescent for any
particular period - the critical time is between writing and
reading. We all agree that many, if not most, of the soft errors
that occur are likely to do no permanent damage. But I don't like
the remainder.

Using the numbers above, (one error every 2 days etc.) it is
obvious that the odds of that error being critical must be fairly
small, else practically nobody would have a functional system.
More nonsense. Even when that is true, a memory error
could modify a machine code instruction, which in turn could
cause hard drive or registry corruption.

Also, a prime time for hard drive corruption to occur due to
memory errors is when defragmenting a hard drive.

I think that is probably the most vulnerable and insidious
possible error. I have taken to recommending avoiding
defragmentation on systems without ECC.

An interesting thing about your reference, repeated below, is the
plethora of further references it gives, complete with links.


<http://www.tezzaron.com/about/papers/Soft Errors 1_1 secure.pdf>

Finally my repeated jabs at this subject have sparked an
intelligent discussion. This is one time I am not going to
complain about cross-posting. In a way this situation reminds me
of the low reliability of American cars about 30 years ago. The
Japanese improvements in quality lead to the great influence of
their products, and general improvement as the public realized the
cost of poor quality. Here we not only know about the problem, we
also know about the cure, and that the cost is fairly trivial.
 
CJT said:
I'm not. I was trying to highlight that with a shift of focus, it's not
so much how long it takes to encounter an error as whether or not you're
one of the unlucky ones this hour.

Actually, it is 'how long' because no matter what you do there *will* be an
error sooner or later from one source or another. Be it a 'cosmic ray',
component failure, power surge, 'act of god', or a random vehicle going out
of control, smashing through the front door and running over the computer
and you.

The question is just how much effort are you willing to spend, and what
kind of effort as there's more than one means of mitigating the problem, to
protect from each one vs the odds of them happening and the consequences if
it does.
Hey, if

you're lucky it was off when that one came through ;) Or you were
doing any of the 90% of the time non critical things people normally
use a home PC for.

Nor do I agree that ECC should be used for an at home desktop.


Is there any reason not to use ECC besides some cost and a very
small loss of performance?

Two good reasons.

I suppose this comes down to what a "home computer" is. Some may be
used to play games and write letters; others may archive a
lifetime's worth of work.

It comes down to more than that. The odds of it happening and the
consequences if it does (which is an entire probability set of it's
own) vs the cost of taking preventative measures.

Once in 19 years is a rather rare event and even if it happened that
doesn't mean you automatically lose 'important' data. It would have
to occur at a particular time that affected a particular thing in a
particular manner. Ok, so maybe I lost a 'pixel' in a picture of
pooch or it blew a character in of those wonderful SPAM emails that
come with garbled text to begin with. Odds are the real impact [pun
intended] would be 'erp', unexplained program error, a few curse
words about 'microsoft software', and restart [as if THAT never
happens even without the help of cosmic rays].

For the typical home user, the odds of losing EVERY thing from a
hard drive failure, combined with the traditionally lousy backup
regimen, or some other failure that causes the system to go 'nuts'
is much, much, higher than worrying about cosmic rays. The odds are
higher it'll get bumped at an inopportune time, or that a component
will fail, or that a connector will work lose from thermal creep, or
any number of things. Hell, the odds of the user screwing his data
up HIMSELF is a thousand times higher.

And we didn't even touch on getting a virus.

If not ECC memory, there is advantage to using parity-checked
memory; a memory error should cause the computer to halt with a
warning, rather than corrupting files.

I agree, if one is using it to calculate warp drive trajectories and
an 'oops' may put you inside a sun somewhere. But then I'd be
recommending multiple redundant systems too.
 
CBFalconer said:
David Maynard wrote:

... snip ...
Yes. Soft error rates for one type of RAM doesn't necessarily
extrapolate directly to either chip-hour-MB and different chip
technologies/density.

As a case in point, these folks say they've licked it (almost)
entirely, for SRAMs anyway, using, amusingly enough, DRAM
technology.

http://neasia.nikkeibp.com/wcs/leaf/CID/onair/asabt/news/315057

Since you were asking about it I did a quick google for some
more recent numbers on DRAM and found this rather interesting
article, which includes some analysis on the things I mentioned,
like 'what is it doing when' and whether it's fatal, etc.

http://www.eecg.toronto.edu/~lie/papers/hp-softerrors-ieeetocs.pdf

It's interesting to note that early on they mention an error
rate for 'modern' 64MB ram chips in a 1 gig memory and while the
"300 reboots resulting from soft errors on 10000 machines in 1
year" sounds like a large number, because they're looking at how
'the industry' is affected, it translates to 1 per 33 years for
a 'user' on his one machine. (unfortunately they don't make it
clear if that's errors that get THROUGH the ECC or if they're
using the ECC to count the errors [my assumption])

Btw, CPUs have the same susceptibility.


Not really.

Yes, really.
The storage on a CPU is much more likely to be
static, rather than a dynamic cell which depends on the charge on
a miniscule capacitor, holding a countable number of electrons.

If what you think is the situation were correct then no one would be
worrying about SRAMs as they're 'static' devices too. But they are.
On chip caches though are a vulnerable point. At any rate, just
the physical area devoted to on chip memory, compared with main
memory, makes the probability of problems much smaller on the
CPU.

The amount of exposed, susceptible, circuitry is certainly one aspect of
it, yes.
All these soft-errors are probabalistic things.

Yep. And so are the consequences of a hit: depends on what cell is
affected, whether it's being used, and what is using it if it were.
 
[snip]
Using the numbers above, (one error every 2 days etc.) it is
obvious that the odds of that error being critical must be fairly
small, else practically nobody would have a functional system.

Are most hard errors non-critical?

Alex
 
Alex said:
[snip]
Using the numbers above, (one error every 2 days etc.) it is
obvious that the odds of that error being critical must be fairly
small, else practically nobody would have a functional system.


Are most hard errors non-critical?

Alex

A hard error means it's broke.
 
Does, or can, Windows log corrected ECC errors anywhere? I'd be
interested in the answer for all versions, though XP & 2003 are the most
relevant these days.

Best wishes,
 
David Maynard said:
Alex said:
[snip]
Using the numbers above, (one error every 2 days etc.) it is
obvious that the odds of that error being critical must be fairly
small, else practically nobody would have a functional system.

Are most hard errors non-critical?

A hard error means it's broke.

What I mean is, for example, a stuck bit in a DRAM. That is, a hard memory
error as opposed to a soft memory error.

Alex
 
Back
Top