System files corrupted because of a DIMM memory failure

  • Thread starter Thread starter for.fun
  • Start date Start date
F

for.fun

Hi all,

Last month, my PC began to do strange things:

Programs that were authorized to get through my firewall were suddenly
considered as new programs and needed to be re-authorized.
All my DVD burns failed.
Some of my applications' licenses expired and I was asked to type the
license again.
When I copied a file from one disk to another, the file was changed.
Finally, when I ran an MD5 checking software, the soft gave me a
different MD5 for the same file each time.

I scanned by computer for virus, trojans but found nothing.
I checked my memory using MemTest86 but found nothing.
I swapped my IDE cables and found the IDE1 to IDE0 failed 4 times out
of 5.

So I bought 2 IDE cables and a new 512 Mo DIMM (I had 256 Mo
installed)
Changing the IDE cables did not change anything but replacing the 256
Mo DIMM but my new 512 Mo DIMM solved all my problems (I plugged the
new DIMM on the same socket as the old one)

It proved that the 256 Mo DIMM was bad.

I do not understand why Windows XP OS did not alert me : I know that
HD controllers and even RAM include CRC check, parity check and
probably other security algorithms.
Instead of this, my system let me copy and consequenlty corrupt many
of my files.
My system was so unstable that I had to installed it again from
scratch. Moreover, I can not trust anymore the files that are stored
on my disk.


=> Could you tell me how this could happen ?

=> Why CRC/parity did not alert me something was going wrong ?

=> Does Windows XP OS implements data controls ?

=> Finally, is there a way to strengthen data control under Windows
XP so I avoid this problem ?


Thanks in advance for your replies.


My config is the following one:

OS: Windows XP Pro SP2
CPU: AMD Athlon, 1400 MHz (10.5 x 133)
MB: MSI K7T266 Pro (MS-6380) / MS-6380LE (5 PCI, 1 AGP, 1 CNR, 3
DIMM, Audio)
RAM: 512 Mo (PC2100 DDR SDRAM)
GA: ATI Radeon 9550 (RV350)
BIOS: American Megatrends Inc. v062710 (MS-6380)

IDE HD1: IBM IC35L040AVER07-0 (40 Go, 7200 RPM, Ultra-ATA/100)
IDE HD2: IBM IC35L060AVV207-0 (60 Go, 7200 RPM, Ultra-ATA/100)
 
.... snip long woeful tale ...

=> Finally, is there a way to strengthen data control under Windows
XP so I avoid this problem ?

Yes. And under other OSs too, such as Linux. Just get ECC memory,
after ensuring that your MB can handle it.
 
Hi all,

Last month, my PC began to do strange things:

Programs that were authorized to get through my firewall were suddenly
considered as new programs and needed to be re-authorized.
All my DVD burns failed.
Some of my applications' licenses expired and I was asked to type the
license again.
When I copied a file from one disk to another, the file was changed.
Finally, when I ran an MD5 checking software, the soft gave me a
different MD5 for the same file each time.

I scanned by computer for virus, trojans but found nothing.
I checked my memory using MemTest86 but found nothing.
I swapped my IDE cables and found the IDE1 to IDE0 failed 4 times out
of 5.

So I bought 2 IDE cables and a new 512 Mo DIMM (I had 256 Mo
installed)
Changing the IDE cables did not change anything but replacing the 256
Mo DIMM but my new 512 Mo DIMM solved all my problems (I plugged the
new DIMM on the same socket as the old one)

It proved that the 256 Mo DIMM was bad.

I do not understand why Windows XP OS did not alert me : I know that
HD controllers and even RAM include CRC check, parity check and
probably other security algorithms.
Instead of this, my system let me copy and consequenlty corrupt many
of my files.
My system was so unstable that I had to installed it again from
scratch. Moreover, I can not trust anymore the files that are stored
on my disk.


=> Could you tell me how this could happen ?

=> Why CRC/parity did not alert me something was going wrong ?

=> Does Windows XP OS implements data controls ?

=> Finally, is there a way to strengthen data control under Windows
XP so I avoid this problem ?


Thanks in advance for your replies.


My config is the following one:

OS: Windows XP Pro SP2
CPU: AMD Athlon, 1400 MHz (10.5 x 133)
MB: MSI K7T266 Pro (MS-6380) / MS-6380LE (5 PCI, 1 AGP, 1 CNR, 3
DIMM, Audio)
RAM: 512 Mo (PC2100 DDR SDRAM)
GA: ATI Radeon 9550 (RV350)
BIOS: American Megatrends Inc. v062710 (MS-6380)

IDE HD1: IBM IC35L040AVER07-0 (40 Go, 7200 RPM, Ultra-ATA/100)
IDE HD2: IBM IC35L060AVV207-0 (60 Go, 7200 RPM, Ultra-ATA/100)

http://www.crucial.com/store/listparts.aspx?model=MS-6380 (K7T266 Pro)

Q: Does my computer support ECC memory?
A: No. Your system does not support ECC.

ECC is the memory feature you are seeking. To work, both the motherboard
(chipset) and the DIMMs have to support it. A non ECC DIMM might have eight
chips on one side, while the ECC version has nine chips. The difference is,
the non-ECC memory has a 64 bit data interface, while the ECC one has a
72 bit data interface.

The chipset on the motherboard is the other issue. Some chipsets only have
the 64 bit interface, so if an ECC DIMM is used, the extra 8 lines float
and are not connected to anything.

As an example, on recent Intel Core2 Duo motherboards, the Intel 975X chipset
has ECC support. The P965 doesn't. And most people buy the P965, so the
majority of people have no possibility of ECC protection, even if they
bought the right DIMMs for it.

Motherboards used in servers, take this more seriously. Virtually all
servers have ECC. And the memory used in the server, whether registered
ECC or fully buffered DIMMs, all have ECC coverage as well.

It is just desktops, where the majority of shipping product has no protection.
You must be a careful shopper, to get protection. For someone building a
Core2 Duo system, not only would they need to buy a 975X based motherboard,
they'd also have to hunt around, to find DDR2 ECC equipped DIMMs. They are
not that easy to find. (And looking yesterday, for the new DDR3 DIMMs, I
couldn't find any with ECC. They'll show up, eventually.)

So even if you bought an unbuffered ECC DIMM right now, it won't help, because
your motherboard won't use the ECC bits on it. You'd need to change motherboards
as well.

Paul
 
Hi all,

Last month, my PC began to do strange things:

Programs that were authorized to get through my firewall were suddenly
considered as new programs and needed to be re-authorized.
All my DVD burns failed.
Some of my applications' licenses expired and I was asked to type the
license again.
When I copied a file from one disk to another, the file was changed.
Finally, when I ran an MD5 checking software, the soft gave me a
different MD5 for the same file each time.

I scanned by computer for virus, trojans but found nothing.
I checked my memory using MemTest86 but found nothing.
I swapped my IDE cables and found the IDE1 to IDE0 failed 4 times out
of 5.

So I bought 2 IDE cables and a new 512 Mo DIMM (I had 256 Mo
installed)
Changing the IDE cables did not change anything but replacing the 256
Mo DIMM but my new 512 Mo DIMM solved all my problems (I plugged the
new DIMM on the same socket as the old one)

It proved that the 256 Mo DIMM was bad.

I do not understand why Windows XP OS did not alert me : I know that
HD controllers and even RAM include CRC check, parity check and
probably other security algorithms.
Instead of this, my system let me copy and consequenlty corrupt many
of my files.
My system was so unstable that I had to installed it again from
scratch. Moreover, I can not trust anymore the files that are stored
on my disk.


=> Could you tell me how this could happen ?

=> Why CRC/parity did not alert me something was going wrong ?

=> Does Windows XP OS implements data controls ?

=> Finally, is there a way to strengthen data control under Windows
XP so I avoid this problem ?


Thanks in advance for your replies.


My config is the following one:

OS: Windows XP Pro SP2
CPU: AMD Athlon, 1400 MHz (10.5 x 133)
MB: MSI K7T266 Pro (MS-6380) / MS-6380LE (5 PCI, 1 AGP, 1 CNR, 3
DIMM, Audio)
RAM: 512 Mo (PC2100 DDR SDRAM)
GA: ATI Radeon 9550 (RV350)
BIOS: American Megatrends Inc. v062710 (MS-6380)

IDE HD1: IBM IC35L040AVER07-0 (40 Go, 7200 RPM, Ultra-ATA/100)
IDE HD2: IBM IC35L060AVV207-0 (60 Go, 7200 RPM, Ultra-ATA/100)

The simple option is always run memtest overnight before a defrag, and
backup regularly. Tighter measures, ie ECC, will add cost and effort.


NT
 
The simple option is always run memtest overnight before a defrag, and
backup regularly. Tighter measures, ie ECC, will add cost and effort.

Thanks to all about the ECC memory : I heard of it but did not know
exactly what it was but now, I do !

In facts, I ran "memtest" (not "memtest86" but only "memtest" => the
version which run under Windows, not from DOS) overnight (during 8
hours) but it did not find anything. That's why I lost some time
looking for a virus or a HD problem (I forgot to tell that I checked
my IBM/Hitachi HD using the IBM tools running from DOS and all was OK)
In facts, it was the memory because since I changed the DIMM,
everything is OK.
 
Thanks to all about the ECC memory : I heard of it but did not know
exactly what it was but now, I do !

In facts, I ran "memtest" (not "memtest86" but only "memtest" => the
version which run under Windows, not from DOS) overnight (during 8
hours) but it did not find anything. That's why I lost some time
looking for a virus or a HD problem (I forgot to tell that I checked
my IBM/Hitachi HD using the IBM tools running from DOS and all was OK)
In facts, it was the memory because since I changed the DIMM,
everything is OK.

Need to use the self booting memtest86, tha tests 99% of the
RAM, which as as good as it gets.


NT
 
Hi all,

Last month, my PC began to do strange things:

Programs that were authorized to get through my firewall were suddenly
considered as new programs and needed to be re-authorized.
All my DVD burns failed.
Some of my applications' licenses expired and I was asked to type the
license again.
When I copied a file from one disk to another, the file was changed.
Finally, when I ran an MD5 checking software, the soft gave me a
different MD5 for the same file each time.

I scanned by computer for virus, trojans but found nothing.
I checked my memory using MemTest86 but found nothing.
I swapped my IDE cables and found the IDE1 to IDE0 failed 4 times out
of 5.

So I bought 2 IDE cables and a new 512 Mo DIMM (I had 256 Mo
installed)
Changing the IDE cables did not change anything but replacing the 256
Mo DIMM but my new 512 Mo DIMM solved all my problems (I plugged the
new DIMM on the same socket as the old one)

It proved that the 256 Mo DIMM was bad.

Not necessarily. While it is theoretically possible for
memory to fail, presuming this was a system that had been
stable (had you ever checked it for memory errors up until
this point with memtest86+, not a tester that runs on a
large OS consuming a lot of memory?), more likely something
new had happened to cause the memory subsystem instability.

It is important to note whether the memory addresses with
errors are seemingly random, or always (and only) the same
ones over and over again. A program like memtest86+ will
show you this.

If it is always the same addresses and none seem to appear
and disappear from an error state, it is most likely a
physical problem with the memory module. If it is at all
varying in addresses, it is more likely the motherboard has
become instable to a small (perhaps becoming progressively
worse) extent, and when relief is seen by swapping in a
different module, it would tend to be one of two things,
either the memory to slot contact was poor and that is
improved, or the new memory has a larger stability margin
than the old one did. If it is the latter case and the
memory subsystem is progressively getting worse, you may
eventually find the new module is similarly instable.


I do not understand why Windows XP OS did not alert me : I know that
HD controllers and even RAM include CRC check, parity check and
probably other security algorithms.

For the OS to do this, it would have to read back everything
written, or you'd need ECC memory.

Instead of this, my system let me copy and consequenlty corrupt many
of my files.
My system was so unstable that I had to installed it again from
scratch. Moreover, I can not trust anymore the files that are stored
on my disk.

Agreed, BUT if these files were written a fair amount of
time ago, when the system had not yet exhibited any signs of
instability, the odds are fair that the files are mostly, if
not entirely intact. Only you can know the applications and
importance of minor errors... in some documents it would be
a minor problem, while in others or in applications it could
be more problematic. Until you are certain the system seems
100% stable including passing a 24 hour memtest86+ test, I
suggest that if you need access to the files that you pull
the hard drive out and copy them off onto another media on
another, known stable computer.

Above all, do not defrag your hard drive again until you
have some confidence the system is remaining 100% stable,
AND if it happens that the situation I briefly described
above is true (that the system is slowly becoming less and
less stable on it's memory subsystem and that the memory
module swap is only a temporary improvement) then you will
need to periodically retest the memory. Frankly, on any
critical system this should be a regularly scheduled event,
without ECC memory.

=> Could you tell me how this could happen ?

=> Why CRC/parity did not alert me something was going wrong ?

=> Does Windows XP OS implements data controls ?

Windows is definitely not fault tolerant. Remember that
even if it were, it still has to run on hardware that must
be stable for an assurance of the integrity and proper
function of any potential "data controls". Any application
you are running that generates data, can corrupt that data
long before it is even written to the drive.


=> Finally, is there a way to strengthen data control under Windows
XP so I avoid this problem ?

Use ECC memory, periodically check memory subsystem,
periodically check CPU with a stress test like Prime 95's
Torture Test (again needing to run several hours if the
system is important, or a less thorough but faster check
would be to run Prime 95 torture test's "large in place
FFTs" setting. In other words, if the CPU produces errors,
having a stable main memory subsystem won't necessarily
guarantee data integrity.

Thanks in advance for your replies.


My config is the following one:

OS: Windows XP Pro SP2
CPU: AMD Athlon, 1400 MHz (10.5 x 133)
MB: MSI K7T266 Pro (MS-6380) / MS-6380LE (5 PCI, 1 AGP, 1 CNR, 3
DIMM, Audio)
RAM: 512 Mo (PC2100 DDR SDRAM)
GA: ATI Radeon 9550 (RV350)
BIOS: American Megatrends Inc. v062710 (MS-6380)

IDE HD1: IBM IC35L040AVER07-0 (40 Go, 7200 RPM, Ultra-ATA/100)
IDE HD2: IBM IC35L060AVV207-0 (60 Go, 7200 RPM, Ultra-ATA/100)

See the following page, on which there is mention of
significant difference in memory stability from use of
different timings based on a resistor on certain board
versions.
http://www.xbitlabs.com/articles/mainboards/display/msi-k7t266-pro.html

On a related note, relaxing the memory timings in the bios
to higher numbers may improve (regain) stability, and/or
provide a larger stability margin in cases where the
stability is declining over time.

Finally, I can't know about your particular specimen of this
model, but around this era I had an MSI board (also Skt.
462/A) that had a barely stable memory bus, due to MSI
omitting capacitors on the board where there were empty
capacitor positions. I had initially wondered why a _very_
slight overclock had so quickly introduced instablity with
memory that had exhibited it could run quite a bit faster at
same timings on an equivalent different make and model
motherboard. Since I had a fair stock of capacitors from
other board failures/repairs during this era, I decided to
see if adding a couple helped. It did improve stability to
at least several MHz higher, but this was long enough ago
that I don't recall the exact numbers, except that I do
recall the default 133 MHz clocked (DDR266) memory rate was
originally instable a mere 3 MHz higher. At the time I had
not seen the webpage linked above, and IIRC my board was a
revision 2 that was red in color so I'm not even sure the
resistor issue was applicable to this different board.

In summary, since data integrity seems to be of high
importance to you, it may be time to think about replacing
the motherboard with one supporting ECC memory, and of
course some ECC memory. At this late date I would not try
to reuse the processor, it would be better to now upgrade
the entire platform to something modern like Athlon 64
(budget build) or Core2Duo (or quad core).
 
(e-mail address removed) wrote:
.... snip ...


Need to use the self booting memtest86, tha tests 99% of the
RAM, which as as good as it gets.

However, memtest (of any variety) is not as good as ECC. ECC will
continuously watch for anything in the memory system that loses
data for any reason, such as a cosmic ray, and restore the accurate
data. These are errors that are not due to the memory itself, but
to uncontrollable external events. At the same time an ECC system
can keep a record of these corrections, so that a bad chip can be
found and replaced. That last depends on the software available.
 
First of all, thanks a lot for taking time to write this very
interesting and instructive article.

Not necessarily. While it is theoretically possible for
memory to fail, presuming this was a system that had been
stable (had you ever checked it for memory errors up until
this point with memtest86+, not a tester that runs on a
large OS consuming a lot of memory?), more likely something
new had happened to cause the memory subsystem instability.

I agree : testing while Windows XP was running was not a good idea and
the test probably did not cover all the memory.

For the OS to do this, it would have to read back everything
written, or you'd need ECC memory.

I do not think I am going to buy ECC : first of all, I think they are
very expensive for slower performances than standard memory.
Second, I change my computer every 7 years.
Third, I think many other problems can happen in a 7 years period
(like a power failure, hard disk failure, motherboard failure) and ECC
will not help in theses cases.

Agreed, BUT if these files were written a fair amount of
time ago, when the system had not yet exhibited any signs of
instability, the odds are fair that the files are mostly, if
not entirely intact. Only you can know the applications and
importance of minor errors... in some documents it would be
a minor problem, while in others or in applications it could
be more problematic. Until you are certain the system seems
100% stable including passing a 24 hour memtest86+ test, I
suggest that if you need access to the files that you pull
the hard drive out and copy them off onto another media on
another, known stable computer.

I copied all my files on a new computer I just bought a few days ago.
In facts, the image, audio or video files are still readable even if
it contains some errors.
Unfortunately, binary files like softwares setup/install are now
unusables (or CRC errors occurs while uncompressing packages)

Above all, do not defrag your hard drive again until you
have some confidence the system is remaining 100% stable,

Too late. I noticed the problem while burning a DVD : all my burned
DVD contained error upon checking.
I thought it was because my HD was fragmented and generated some
buffer underrun so I defragged my HD (using JkDefrag) and after this
my system became very unstable !

In summary, since data integrity seems to be of high
importance to you, it may be time to think about replacing
the motherboard with one supporting ECC memory, and of
course some ECC memory. At this late date I would not try
to reuse the processor, it would be better to now upgrade
the entire platform to something modern like Athlon 64
(budget build) or Core2Duo (or quad core).

In fact, data integrity was not so important for me before I had this
problem !
I just bought a new Intel Quad Core-based configuration including SATA
II hard-drives and this is a real pleasure to work with (it is so fast
compared to my old PC)
All my files are now safe on this new computer.
I removed the 256 Mo DIMM from the old computer and replaced it with a
512 Mo and nothing bad happened since this time (it was 3 weeks ago) :
I have no system files restore alerts, no strange errors (like
unpredictable crashes of some apps which used to working fine), I did
not miss any DVD burning and my firewall did not notice that the
binary images of my applications changed in memory so I am quite
confident in the motherboard (I forgot to tell that after upgrading
the memory, I formatted again my HD and reisntalled the OS from
scratch)

Thanks for your help.
 
.... snip ...


I do not think I am going to buy ECC : first of all, I think they
are very expensive for slower performances than standard memory.
Second, I change my computer every 7 years. Third, I think many
other problems can happen in a 7 years period (like a power
failure, hard disk failure, motherboard failure) and ECC will not
help in theses cases.

You are mistaken about the price. ECC will cost you about 20%
more. It should be less than a 10% premium, but the volume gets in
the way. The extra 20 to 50 dollars (or so) can easily save your
complete system. If your machine will handle it, price the two
varieties at Crucial. Remember that, without ECC, memory is the
one component on your machine that has no validity checks
whatsoever, and that a single dropped bit can leave something to
appear years later.
 
Back
Top