Disk to disk copying with overclocked memory

  • Thread starter Thread starter Mark M
  • Start date Start date
M

Mark M

I use a partition copier which boots off a floppy disk before any
other OS is launched.

If I copy a partition from one hard drive to another, then is there
any risk of data corruption if the BIOS has been changed to
aggressively speed up the memory settings?

For example the BIOS might set the memory to CAS=2 rather than
CAS=3. Or other memory timing intervals might also be set to be
shorter than is normal.

I am thinking that maybe the IDE cable and drive controllers handle
data fairly independently of the memory on the motherboard. So
maybe data just flows up and down the IDE cable and maybe the
motherboard is not involved except for sync pulses.

There are three scenarios I am thinking about:

(1) Copying a partition from one hard drive on one IDE cable to
another hard drive on a different IDE cable.

(2) Copying a partition from one hard drive to another which is on
the same IDE cable.

(3) Copying one partition to another on the same hard drive.

How much effect would "over-set" memory have on these situations?

Do the answers to any of the above three scenarios change if the
copying of large amounts of data files is done from within WinXP?
Personally, I would guess that it is more likely that motherboard
memory comes into play if Windows is involved.
 
I use a partition copier which boots off a floppy disk before any
other OS is launched.

If I copy a partition from one hard drive to another, then is there
any risk of data corruption if the BIOS has been changed to
aggressively speed up the memory settings?

For example the BIOS might set the memory to CAS=2 rather than
CAS=3. Or other memory timing intervals might also be set to be
shorter than is normal.

I am thinking that maybe the IDE cable and drive controllers handle
data fairly independently of the memory on the motherboard. So
maybe data just flows up and down the IDE cable and maybe the
motherboard is not involved except for sync pulses.

There are three scenarios I am thinking about:

(1) Copying a partition from one hard drive on one IDE cable to
another hard drive on a different IDE cable.

(2) Copying a partition from one hard drive to another which is on
the same IDE cable.

(3) Copying one partition to another on the same hard drive.

How much effect would "over-set" memory have on these situations?

Do the answers to any of the above three scenarios change if the
copying of large amounts of data files is done from within WinXP?
Personally, I would guess that it is more likely that motherboard
memory comes into play if Windows is involved.

1. All copies go through memory using at least a block sized buffer of ram.
Buffers at least large enough to hold an entire track will be used,
probably larger for more effeciency. Data is always copied from a drive to
a memory buffer first. Might be directly, using DMA (the M is memory), but
it will be to and from memory. What part of memory is used will vary
depending on the program and whether you are running it under windows, but
a single bit error in the wrong place in memory can be a major problem.

2. If your memory timing is aggressive enough that errors are likely, then
there are a number of things that could go wrong. There could be an error
in the data that gets copied. You could also have the wrong disk address
stored in ram so the data goes to the wrong place. Could be the wrong
instruction so the program crashes. Could be any one of hundreds of
possible single bit failures that might go unnoticed. ECC would help here
(would catch most possible memory errors). If you want reliability in
anything (not just copying disks) then don't push your memory (or other
components) to the edge.

JT
 
If I can add a bit to JT's reply...

If you are overclocking your memory you risk getting more errors than the
guys who built the memory planned on. If the memory is not ECC memory then
you may get more single bit errors which will cause your machine to stop
when they occur. ECC memory can correct single bit errors but non-ECC memory
can only detect them and when that happens windows will blue screen. Most
home PCs have non-ECC memory because it's cheaper.

Overclocking could also cause the occasional double bit error which non-ECC
memory cannot detect. This would be bad. As JT indicates, this could cause
all sorts of mayhem. If you're lucky, windows could execute a broken
instruction or reference a memory address in outer space and then blue
screen. If you are unlucky it could blunder on using bad data and do
something nasty to your file system (or it could harmlessly stick an umlaut
onto the screen somewhere.) Hard to predict.

cp
 
I use a partition copier which boots off a floppy disk before any
other OS is launched.

If I copy a partition from one hard drive to another, then is there
any risk of data corruption if the BIOS has been changed to
aggressively speed up the memory settings?

Yes, a relatively high risk.

For example the BIOS might set the memory to CAS=2 rather than
CAS=3. Or other memory timing intervals might also be set to be
shorter than is normal.

Yes, that'll _potentially_ cause errors, corrupt the data.
I am thinking that maybe the IDE cable and drive controllers handle
data fairly independently of the memory on the motherboard. So
maybe data just flows up and down the IDE cable and maybe the
motherboard is not involved except for sync pulses.

It's involved. Hint: Consider what "DMA" stands for.
There are three scenarios I am thinking about:

(1) Copying a partition from one hard drive on one IDE cable to
another hard drive on a different IDE cable.

(2) Copying a partition from one hard drive to another which is on
the same IDE cable.

(3) Copying one partition to another on the same hard drive.

How much effect would "over-set" memory have on these situations?

It has the same effect on all, that "IF" the memory is set incorrectly or
defective (or motherboard issue, etc), that if errors occur all of the
above scenarios are a risk.
Do the answers to any of the above three scenarios change if the
copying of large amounts of data files is done from within WinXP?
Personally, I would guess that it is more likely that motherboard
memory comes into play if Windows is involved.

It's the same risk, but with more memory used there's even a greater
chance of errors, not necessarily all occuring in the data transfer but
ALSO the OS, so both the backup AND the OS would potentially be using
corrupt data... never boot to the OS if there's any question of memory
instability, else be prepared and expecting to reinstall everything unless
you can restore or recreate every file written during that interval of
operation.
 
Colin said:
If I can add a bit to JT's reply...

If you are overclocking your memory you risk getting more errors
than the guys who built the memory planned on. If the memory is
not ECC memory then you may get more single bit errors which will
cause your machine to stop when they occur. ECC memory can
correct single bit errors but non-ECC memory can only detect them
and when that happens windows will blue screen. Most home PCs
have non-ECC memory because it's cheaper.

Correction here - non ECC memory won't even detect any errors, it
will just use the wrong value. Sometimes that MAY cause the OS to
crash. Unfortunately the rest of the thread is lost due to
top-posting.
 
CBFalconer said:
Correction here - non ECC memory won't even detect any errors, it
will just use the wrong value. Sometimes that MAY cause the OS to
crash. Unfortunately the rest of the thread is lost due to
top-posting.
You seem to have confused ECC and parity. ECC means error checking
and correcting, which involves more redundancy than simple single bit
parity error checking.
 
You seem to have confused ECC and parity.

Or you have. **** all ram is parity anymore.
ECC means error checking and correcting, which involves
more redundancy than simple single bit parity error checking.

Which isnt seen much anymore.
 
CJT said:
You seem to have confused ECC and parity. ECC means error checking
and correcting, which involves more redundancy than simple single bit
parity error checking.

Nothing uses parity checking today - that requires writing
individual 9 bit bytes. Expanded to a 64 bit wide word (for the
various Pentia etc.) the parity or ECC bits both fit in an extra 8
bits, i.e. a 72 bit wide word. If todays systems have no ECC they
have no checking of any form. ECC is actually no harder to handle
on wide words.

Memory configurations that can use parity can use ECC, the reverse
is not true.

Exception - some embedded systems with smaller memory paths may
use parity.
 
CBFalconer said:
Nothing uses parity checking today - that requires writing
individual 9 bit bytes. Expanded to a 64 bit wide word (for
the various Pentia etc.) the parity or ECC bits both fit in an
extra 8 bits, i.e. a 72 bit wide word. If todays systems have
no ECC they have no checking of any form. ECC is actually no
harder to handle on wide words.

Memory configurations that can use parity can use ECC, the
reverse is not true.

Exception - some embedded systems with smaller memory paths
may use parity.


Does the motherboard have to support ECC?

Or can you always put a stick of ECC memeory where there had been
non-ECC memory before?
 
Does the motherboard have to support ECC?

Or can you always put a stick of ECC memeory where there had been
non-ECC memory before?

For ECC to work, the motherboard has to support it. If you put ECC in a
motherboard that doesn't support ECC, it will usually operate just fine,
but with no Error correction.

JT
 
Mark said:
Does the motherboard have to support ECC?

Or can you always put a stick of ECC memeory where there had been
non-ECC memory before?

The motherboard has to support it, otherwise at best the additional bits
that support ECC will be ignored.

I find it highly annoying that Intel etc have decided that "consumers" don't
need this and so not implemented it in most of the available chipsets.
 
Mark said:
Does the motherboard have to support ECC?
Yes


Or can you always put a stick of ECC memeory where there had been
non-ECC memory before?

No. Even if the MB supports ECC you have to have ALL the memory
ECC capable before it will function. Otherwise the system can't
tell a massive failure from a non-ECC area.
 
Correction here - non ECC memory won't even detect any errors, it
will just use the wrong value. Sometimes that MAY cause the OS to
crash. Unfortunately the rest of the thread is lost due to
top-posting.

Crashes are not your worst enemy. Undetected data corruption is.

I once debugged a fileserver that did flip one bit on average per
2GB read or written. This thing had been used in this condition for
several months by several people on a daily basis. Then one person
noted that he got a corrupted archive sometimes (was a large file)
when reading it, and sometimes not. There where likely quite
a few changed files on disk at that time. If you have files that
react badly to changed bits, that is a desaster.

The solution was just to set the memory timing more conservatively.
I made it two steps slower, without noticable impact on performance.

Note on ECC: If you get very little single bit-errors without
ECC active, ECC will likely solve your problem. If you a lot of
single-bit errors, or even only very fwe multiple-bit errors, then
ECC wil not really help and will let errors through. For my scenario
(single, random bit every 2GB), ECC would have done fine.

Arno
 
Arno said:
Crashes are not your worst enemy. Undetected data corruption is.

I once debugged a fileserver that did flip one bit on average per
2GB read or written. This thing had been used in this condition for
several months by several people on a daily basis. Then one person
noted that he got a corrupted archive sometimes (was a large file)
when reading it, and sometimes not. There where likely quite
a few changed files on disk at that time. If you have files that
react badly to changed bits, that is a desaster.

The solution was just to set the memory timing more conservatively.
I made it two steps slower, without noticable impact on performance.

Note on ECC: If you get very little single bit-errors without
ECC active, ECC will likely solve your problem. If you a lot of
single-bit errors, or even only very fwe multiple-bit errors, then
ECC wil not really help and will let errors through. For my scenario
(single, random bit every 2GB), ECC would have done fine.

The ECC implemented on PCs can typically correct 1-bit errors and detect
2-bit errors.

One machine I worked with came up with a parity error one day. It was about
a week old at the time so I sent it back to the distributer, who, being one
of these little hole in the wall places and not Tech Data or the like,
instead of swapping the machine or the board, instead had one of his
high-school dropout techs "fix" it. The machine came back sans parity
error. Ran fine for a while, then started getting complaints of data
corruption. Tracked it down finally to a bad bit in the memory. Sure
enough the guy had "fixed" it by disabling parity. Should have sued.

This is one of the pernicious notions surrounding the testing of PCs--the
notion that the only possible failure mode is a hang, totally ignoring the
possibility that there will be data corruption that does not cause a hang,
at least not of the machine, although it may cause the tech to be hung by
the users.

But if you're getting regular errors then regardless of the kind of memory
you're using something is broken. Even with ECC if you're getting errors
reported in the log you should find out why and fix the problem rather than
just trusting the ECC--ECC is like RAID--it lets you run a busted machine
without losing data--doesn't mean that the machine isn't busted and doesn't
need fixing.
 
In comp.sys.ibm.pc.hardware.storage J. Clarke said:
Arno Wagner wrote:
In comp.sys.ibm.pc.hardware.storage CBFalconer <[email protected]>
wrote: [...]
Crashes are not your worst enemy. Undetected data corruption is.

I once debugged a fileserver that did flip one bit on average per
2GB read or written. This thing had been used in this condition for
several months by several people on a daily basis. Then one person
noted that he got a corrupted archive sometimes (was a large file)
when reading it, and sometimes not. There where likely quite
a few changed files on disk at that time. If you have files that
react badly to changed bits, that is a desaster.

The solution was just to set the memory timing more conservatively.
I made it two steps slower, without noticable impact on performance.

Note on ECC: If you get very little single bit-errors without
ECC active, ECC will likely solve your problem. If you a lot of
single-bit errors, or even only very fwe multiple-bit errors, then
ECC wil not really help and will let errors through. For my scenario
(single, random bit every 2GB), ECC would have done fine.
The ECC implemented on PCs can typically correct 1-bit errors and detect
2-bit errors.

Yes, I know. There is also a second mode where it will not correct,
but detect up to 3 bit-errors.
One machine I worked with came up with a parity error one day. It was about
a week old at the time so I sent it back to the distributer, who, being one
of these little hole in the wall places and not Tech Data or the like,
instead of swapping the machine or the board, instead had one of his
high-school dropout techs "fix" it. The machine came back sans parity
error. Ran fine for a while, then started getting complaints of data
corruption. Tracked it down finally to a bad bit in the memory. Sure
enough the guy had "fixed" it by disabling parity. Should have sued.

Nice. Possibly on the paradigm that the ucstomer is stupid and
complains without need...
This is one of the pernicious notions surrounding the testing of PCs--the
notion that the only possible failure mode is a hang, totally ignoring the
possibility that there will be data corruption that does not cause a hang,
at least not of the machine, although it may cause the tech to be hung by
the users.

Yes, indeed. Usually things do not hang with defective bits. I had
another machine run for some weeks before a user found a hard error
in memory, i.e. a bit that allways was constant. This was a Infinion
original module. Since then I allways run memtest86 for at least a day
on new memory.
But if you're getting regular errors then regardless of the kind of memory
you're using something is broken. Even with ECC if you're getting errors
reported in the log you should find out why and fix the problem rather than
just trusting the ECC--ECC is like RAID--it lets you run a busted machine
without losing data--doesn't mean that the machine isn't busted and doesn't
need fixing.

Agreed. But there is 'frequent' and 'seldomly'. With the setup I described
one bit per 2GB was perhaps once a week. To get a double/tripple error in
a single bit is very unlikely with these border conditions. However I
completely agree with you. I did express myself not clearly enough, I
meant to say "ECC would have corrected these" and not "ECC would have
been a fix".

The real problem is that it is hard to diagnose these faults. What
if they get more frequent? What if they are not random, but frequently
get past ECC?

ECC is a fix only for very rare bit-faults that are not due to a
defect. These are soft-errors that happen with a probability of
somewhere each 1-100 years or so per gigabyte memory and are due to
outside influence like a charged particle from space impacting a
memory cell. In servers people often care about these, so many
server-boards support or even require ECC memory. I have a dual
Athlon and a dual Opteron board operational that only work with ECC
RAM. With ECC these errors will go from "seldomly" to "forget about
them".

For more frequent (i.e. repeatable to some degree) failures, tracking
down and fixing the problem is the only way to go.

ECC will
a) give you some time to fix the problem if it is not too bad
b) help you find it on the first occurence and not when something
goes wrong with your software/data

ECC will not fix the problem!

Arno
 
I've had an MB, which occasionally corrupted bit 0x80000000, but only during
disk I/O! And the corrupted bit position was unrelated to I/O buffers! Of
course, standalone memory test didn't find anything. I've had to modify the
test to make it run under Windows and also run parallel disk I/O threads. In
that mode, the failure was detected in a minute. Had to dump the MB.
Replacing memory and CPU didn't help.
 
J. Clarke said:
The ECC implemented on PCs can typically correct 1-bit errors and
detect 2-bit errors.

One machine I worked with came up with a parity error one day. It
was about a week old at the time so I sent it back to the distributer,
who, being one of these little hole in the wall places and not Tech
Data or the like, instead of swapping the machine or the board,
instead had one of his high-school dropout techs "fix" it. The
machine came back sans parity error. Ran fine for a while, then
started getting complaints of data corruption. Tracked it down
finally to a bad bit in the memory. Sure enough the guy had "fixed"
it by disabling parity. Should have sued.

This is one of the pernicious notions surrounding the testing of
PCs--the notion that the only possible failure mode is a hang,
totally ignoring the possibility that there will be data corruption
that does not cause a hang, at least not of the machine, although
it may cause the tech to be hung by the users.

But if you're getting regular errors then regardless of the kind of
memory you're using something is broken. Even with ECC if you're
getting errors reported in the log you should find out why and fix
the problem rather than just trusting the ECC--ECC is like RAID--it
lets you run a busted machine without losing data--doesn't mean
that the machine isn't busted and doesn't need fixing.

Well, this is somewhat refreshing. Usually when I get on my horse
about having ECC memory I am greeted with a chorus of pooh-poohs,
and denials about sneaky soft failures, cosmic rays, useless
backups, etc. etc. In fact, walk into most computer stores and
start talking about ECC and you will be greeted with blank stares.
 
CBFalconer said:
Correction here - non ECC memory won't even detect any errors,
it will just use the wrong value. Sometimes that MAY cause the OS to
crash.
Unfortunately the rest of the thread is lost due to top-posting.

Bash topposters for topposting, not for your bad choice of News client
or your failure to set it up properly.
 
In comp.sys.ibm.pc.hardware.storage Alexander Grigoriev said:
I've had an MB, which occasionally corrupted bit 0x80000000, but only during
disk I/O! And the corrupted bit position was unrelated to I/O buffers! Of
course, standalone memory test didn't find anything. I've had to modify the
test to make it run under Windows and also run parallel disk I/O threads. In
that mode, the failure was detected in a minute. Had to dump the MB.
Replacing memory and CPU didn't help.

Really nasty. Shows that these things have gotten far to complex...

Arno
 
Alexander Grigoriev said:
I've had an MB, which occasionally corrupted bit 0x80000000, but only
during disk I/O!
And the corrupted bit position was unrelated to I/O buffers!
Meaning?

Of course, standalone memory test didn't find anything. I've had to modify the
test to make it run under Windows and also run parallel disk I/O threads.

What happened to that memory test. Last time I heard about it was when c't complained about you not supporting it anymore.
 
Back
Top