The case for ECC.

  • Thread starter Thread starter Alan Walpool
  • Start date Start date
A

Alan Walpool

Hi,

I know this issue has been pretty much been run in the ground, but the
last year has changed my mind on this issue.

Two cases.

1) Very old pentium pro system with ecc memory used for a
firewall/bridge. Immediately had a memory parity error show on the
screen and the system halted. Checked the memory with memtest and
sure enough it was bad. Replaced the memory and everything was back
to normal. It took a day to correct the problem, and data was
intact. Memory was noname and no warranty.

2) Had a 2 year old amd athlon system with non-ecc memory and the
system started locking up. One of the disks was corrupted. I
started trying to track the problem down, and continued to have
random system lockups. It got so bad the system was not booting.
Removed all cards but the video card, and still lockups. Finally
checked the memory with memtest, and sure enough the memory was
bad. System was never overclocked and did not have any heat
related problems. Well after data corruption, and 4-5 days of
pulling my hair out, I figured it out. Memory was name brand with a
lifetime warranty, I sent for a RMA on the memory.

The long story is I prefer case #1 over case #2. At least it is easier
to diagnosis the problem with ECC memory. I thought that memory was so
good now that home users did not need ECC memory, and that is what
many regular posters in this newsgroup have said over and over.

The next system I purchase will have ECC memory. My time is well worth
the minor difference in price. Since I don't overclock it is not an
issue. Heck, that fancy overclocking memory costs way more than ECC
memory.

Whatever,

Alan
 
Hi,

I know this issue has been pretty much been run in the ground, but the
last year has changed my mind on this issue.

Two cases.

1) Very old pentium pro system with ecc memory used for a
firewall/bridge. Immediately had a memory parity error show on the
screen and the system halted. Checked the memory with memtest and
sure enough it was bad. Replaced the memory and everything was back
to normal. It took a day to correct the problem, and data was
intact. Memory was noname and no warranty.

2) Had a 2 year old amd athlon system with non-ecc memory and the
system started locking up. One of the disks was corrupted. I
started trying to track the problem down, and continued to have
random system lockups. It got so bad the system was not booting.
Removed all cards but the video card, and still lockups. Finally
checked the memory with memtest, and sure enough the memory was
bad. System was never overclocked and did not have any heat
related problems. Well after data corruption, and 4-5 days of
pulling my hair out, I figured it out. Memory was name brand with a
lifetime warranty, I sent for a RMA on the memory.

The long story is I prefer case #1 over case #2. At least it is easier
to diagnosis the problem with ECC memory. I thought that memory was so
good now that home users did not need ECC memory, and that is what
many regular posters in this newsgroup have said over and over.

I'm not sure what regulars have said such here. I have ECC memory
even on my K6-III system. Memory has never been "so good" that it never
fails. My only "issue" with ECC is that I can't test whether it's really
working (why have I never seen an error?). How do I know that any errors
are actually getting reported somewhere so I can take corrective action?
The next system I purchase will have ECC memory. My time is well worth
the minor difference in price. Since I don't overclock it is not an
issue. Heck, that fancy overclocking memory costs way more than ECC
memory.

ECC memory prices dropped down to the 11% overhead number a long time ago.
Memory for the K6-III was cheap enough in '99 that I figured, "why not?"
 
Alan said:
1) Very old pentium pro system with ecc memory used for a
firewall/bridge. Immediately had a memory parity error show on the
screen and the system halted. Checked the memory with memtest and
sure enough it was bad. Replaced the memory and everything was back
to normal. It took a day to correct the problem, and data was
intact. Memory was noname and no warranty.

Did the motherboard/BIOS support ECC RAM? The thing about ECC ram is
that it should transparently fix 1-bit memory errors.

Or do you think that the RAM has been going bad and the system has been
fixing 1-bit errors and then finally got to the point where it
encountered a 2-bit error?

Either way, I do agree that ECC is nice to have in a system. My current
system has it, and I think it was a nice investment. The only thing I
think would be nice is if my motherboard had some sort of DMI logging
mechanism for memory errors. That way I'd be able to see if the ECC
has done its job at any point during the time I've owned it.
 
keith> I'm not sure what regulars have said such here. I have ECC
keith> memory even on my K6-III system. Memory has never been "so
keith> good" that it never fails. My only "issue" with ECC is that I
keith> can't test whether it's really working (why have I never seen
keith> an error?). How do I know that any errors are actually getting
keith> reported somewhere so I can take corrective action?

My old pentium pro motherboard has a memory error count in the bios.
At least on the bios I have you can monitor ECC corrections there. If
it gets really bad it will cause a parity error and shutdown the
system.

Depends on the bios and motherboard.

Interesting.

Alan
 
Will> Did the motherboard/BIOS support ECC RAM? The thing about ECC
Will> ram is that it should transparently fix 1-bit memory errors.

Will> Or do you think that the RAM has been going bad and the system
Will> has been fixing 1-bit errors and then finally got to the point
Will> where it encountered a 2-bit error?

Will> Either way, I do agree that ECC is nice to have in a system. My
Will> current system has it, and I think it was a nice investment.
Will> The only thing I think would be nice is if my motherboard had
Will> some sort of DMI logging mechanism for memory errors. That way
Will> I'd be able to see if the ECC has done its job at any point
Will> during the time I've owned it.

The motherboard bios reported and detected the ECC memory fine. The
bios in that old pentium pro motherboard logs ECC errors. It was
reporting some errors at first but nothing that it could not handle. I
guess it became so bad that it gave it trying to correct the memory
error and halted the system completely with a message saying memory
error. Didn't write down the exact error message.

I guess this all depends on the bios weather it reports errors or not.

I have not checked lately but I seriously doubt desktop PC's have any
logging for ECC errors. Really that old pentium pro system I have was
really a server motherboard at one time.

Interesting.

Alan
 
Alan said:
I have not checked lately but I seriously doubt desktop PC's have any
logging for ECC errors. Really that old pentium pro system I have was
really a server motherboard at one time.

Yes, that seems to be the case. The only machines I've used that log
ECC errors are SGI workstations and Dell servers. Nothing desktop-wise,
which is a shame.

There is a linux kernel module that supposedly monitors and reports ECC
errors, but I haven't been able to get it to compile on my Gentoo (2.6
kernel) system.

http://www.anime.net/~goemon/linux-ecc/

Would be nice if Windows had some sort of equivalent functionality...
 
Yes, that seems to be the case. The only machines I've used that log
ECC errors are SGI workstations and Dell servers. Nothing desktop-wise,
which is a shame.

The hardware hooks are there for the 925X series chipset.
I haven't looked very hard, but IIRC, they're pretty much
in all the older "high end" desktop chipsets as well, something
like the 875P.

Whether software uses those hooks and log (correctable) 1 bit ECC
errors or not is another story.
 
Hi,

I know this issue has been pretty much been run in the ground, but the
last year has changed my mind on this issue.

Two cases.

1) Very old pentium pro system with ecc memory used for a
firewall/bridge. Immediately had a memory parity error show on the
screen and the system halted. Checked the memory with memtest and
sure enough it was bad. Replaced the memory and everything was back
to normal. It took a day to correct the problem, and data was
intact. Memory was noname and no warranty.

2) Had a 2 year old amd athlon system with non-ecc memory and the
system started locking up. One of the disks was corrupted. I
started trying to track the problem down, and continued to have
random system lockups. It got so bad the system was not booting.
Removed all cards but the video card, and still lockups. Finally
checked the memory with memtest, and sure enough the memory was
bad. System was never overclocked and did not have any heat
related problems. Well after data corruption, and 4-5 days of
pulling my hair out, I figured it out. Memory was name brand with a
lifetime warranty, I sent for a RMA on the memory.

The long story is I prefer case #1 over case #2. At least it is easier
to diagnosis the problem with ECC memory. I thought that memory was so
good now that home users did not need ECC memory, and that is what
many regular posters in this newsgroup have said over and over.

The next system I purchase will have ECC memory. My time is well worth
the minor difference in price. Since I don't overclock it is not an
issue. Heck, that fancy overclocking memory costs way more than ECC
memory.

Whatever,

Alan
Man, you just made a case for socket 940 - the board is cheaper than
939, the CPU (Opteron) goes for roughly the same price as equivalent
939 (A64FX), the only complaint usually is that registered ECC RAM it
uses is somewhat slower and more expensive. But you want ECC, so 940
is the way to go, unless you are willing to pay an arm and a leg for
slower Xeon.
 
Man, you just made a case for socket 940 - the board is cheaper than
939, the CPU (Opteron) goes for roughly the same price as equivalent
939 (A64FX), the only complaint usually is that registered ECC RAM it
uses is somewhat slower and more expensive. But you want ECC, so 940
is the way to go, unless you are willing to pay an arm and a leg for
slower Xeon.

Registered (also known as buffered) and ECC are two separate features.
You can buy unbuffered ECC DDR SDRAM DIMMs, for example from Kingston:

http://www.ec.kingston.com/ecom/configurator/PartsInfo.asp?ktcpartno=KVR400X72C3A/512

Click on search to see a list of compatible motherboards.

There are Socket 754 and Socket 939 motherboards which support ECC
memory modules. For example:

http://www.asus.com/prog/spec.asp?m=K8N-E Deluxe
 
Alan said:
2) Had a 2 year old amd athlon system with non-ecc memory and the
system started locking up. One of the disks was corrupted. I
started trying to track the problem down, and continued to have
random system lockups. It got so bad the system was not booting.
Removed all cards but the video card, and still lockups. Finally
checked the memory with memtest, and sure enough the memory was
bad. System was never overclocked and did not have any heat
related problems. Well after data corruption, and 4-5 days of
pulling my hair out, I figured it out. Memory was name brand with a
lifetime warranty, I sent for a RMA on the memory.


I feel your pain. It's wise to consider ECC.

When a system has disk corruption, crashes, or blue screens I reach for
MEMTEST first (Disk Doctor second). You can screw up memory fiddling
with hardware, it can fail on it's own, or due to a power spike, it can
even glitch when a cosmic ray hits it (at least that use to be a
worry), it can be running under marginal and deteriorating conditions,
etc.

However, for non-critical use, if you buy "reasonable quality" (ps,
motherboard, memory, cooling), operate within manufacturer's
parameters, and perform burn in testing - you'll be ok. Keep MEMTEST
handy. In my experience memory failure hasn't been an issue for years
and years. If in doubt, ask your local hardware shop what they think of
current configurations. I respect the expertise of good local shops,
espcially if they warrant what they sell.
 
Man, you just made a case for socket 940 - the board is cheaper than
939, the CPU (Opteron) goes for roughly the same price as equivalent
939 (A64FX), the only complaint usually is that registered ECC RAM it
uses is somewhat slower and more expensive. But you want ECC, so 940
is the way to go, unless you are willing to pay an arm and a leg for
slower Xeon.

Perhaps more to the point, he just made a point for integrating a
memory controller onto your CPU that supports ECC, as is done in BOTH
the Opteron and the Athlon64.

You don't need Socket 940 at all to use ECC, ALL Athlon64 boards
support it (unless the BIOS goes to lengths to intentionally disable
this feature). It's all built into the processor, and all
Athlon64/Opteron chips, whether they be Socket 754, Socket 939 or
Socket 940, support it.

Unregistered (aka unbuffered) ECC chips actually only add a small cost
over standard unregistered/non-ECC memory, they do not carry as large
of a price premium as registered (buffered) memory. For example, if
you check Crucial's prices:

http://www.crucial.com/store/listmodule.asp?module=DDR+PC3200&Attrib=Package&cat=RAM

For 512MB, the unbuffered/non-ECC memory costs $81, unbuffered ECC
costs $106 and buffered ECC memory costs $123. The only problem here
is that Crucial doesn't sell unbuffered ECC memory at all sizes (and
they don't sell buffered non-ECC at any size, though the demand for
such a setup is pretty small), ie for 1GB modules they only sell
unbuffered non-ECC and buffered ECC.
 
On Mon, 13 Dec 2004 04:15:31 GMT, "(e-mail address removed)"
You don't need Socket 940 at all to use ECC, ALL Athlon64 boards
support it (unless the BIOS goes to lengths to intentionally disable
this feature). It's all built into the processor, and all
Athlon64/Opteron chips, whether they be Socket 754, Socket 939 or
Socket 940, support it.

Unregistered (aka unbuffered) ECC chips actually only add a small cost
over standard unregistered/non-ECC memory, they do not carry as large
of a price premium as registered (buffered) memory. For example, if
you check Crucial's prices:

http://www.crucial.com/store/listmodule.asp?module=DDR+PC3200&Attrib=Package&cat=RAM

For 512MB, the unbuffered/non-ECC memory costs $81, unbuffered ECC
costs $106 and buffered ECC memory costs $123. The only problem here
is that Crucial doesn't sell unbuffered ECC memory at all sizes (and
they don't sell buffered non-ECC at any size, though the demand for
such a setup is pretty small), ie for 1GB modules they only sell
unbuffered non-ECC and buffered ECC.

Maybe it was my luck, but when I was building my current system, I
found DDR 3200 ECC reg. 512 MB modules for just over $100. The
cheapest ECC unbuffered modules at that moment were priced even
higher, as well as buffered non-ECC. Yes, I bought them not from the
likes of Crucial, but rather from one of pricewatch bottom-feeders,
and these vendors probably don't store not-so-common varieties. As
you mentioned, the choice was between unbuffered non-ECC and buffered
ECC. Since I wanted SMP, the choice was between Opteron and Xeon, not
between 940 and 939. Obviously Xeon looked like a loser in both
performance and price departments ;-) but this one is a whole
different topic. But back to the memory: if you are already set on
ECC, and it quite likely will be registered as a side-effect, 940 only
makes sense because both CPU and motherboard would likely come a tad
cheaper (though had not checked prices for a few months - things could
have changed since).
 
keith> I'm not sure what regulars have said such here. I have ECC
keith> memory even on my K6-III system. Memory has never been "so
keith> good" that it never fails. My only "issue" with ECC is that I
keith> can't test whether it's really working (why have I never seen
keith> an error?). How do I know that any errors are actually getting
keith> reported somewhere so I can take corrective action?

My old pentium pro motherboard has a memory error count in the bios.
At least on the bios I have you can monitor ECC corrections there. If
it gets really bad it will cause a parity error and shutdown the
system.

That's a server system. I'm *quite* sure IBM's Z-Series logs memory
errors and reports them to the mothership too. That doesn't give me a
wonderful feeling with my commodity desktop system.
Depends on the bios and motherboard.

Obviously. That's the point! How does one *know*. BTW, IMO BIOS
reporting isn't good enough. I want somehting that I can querry (indeed
be promped with) from the OS, perhaps as root if security demands it.
Interesting.

Long ago in a galaxy far-far away I proposed to test for ECC function on
motherboards that said they supported it. I couldn't figure out a
reliable way of doing it, so that idea went west.
 
On 13 Dec 2004 12:34:14 -0800 said:
When a system has disk corruption, crashes, or blue screens I reach for
MEMTEST first (Disk Doctor second).

http://cquirke.mvps.org/9x/bthink.htm :-)
In my experience memory failure hasn't been an issue for years

Oh, it's common in the context of PCs that just don't work properly.

I see more bad HDs than bad RAM, but it's close, and more bad RAM than
bad motherboards or SVGA cards. Bad PSUs are common too, but they
usually present in less ambiguous ways.


-------------------- ----- ---- --- -- - - - -
Running Windows-based av to kill active malware is like striking
a match to see if what you are standing in is water or petrol.
 
Back
Top