Anybody testing the reliability of computers ?

  • Thread starter Thread starter Skybuck Flying
  • Start date Start date
S

Skybuck Flying

Hello,

I worry sometimes about cosmic interference or electric magnetic
interference or just mobile energy signal interference with computers.

Is anybody measuring "rate of computer crashes" ?

It will be hard to do because the computer industry changes all the time.

New software, new hardware, according to Balmer to only constant is change.

If nobody is measuring the reliability of computer systems then I can
recommend the following:

1. Build a computer

2. Install software.

3. Let it run forever.

4. Don't change a thing, except when hardware breaks.

Then during all of this:

Try to measure two things:

1. Rate of failure/system crashes.

2. Rate of freezes/hangs.

Perhaps it's already to late to start this measurement, however better late
than never.

Nowadays the world full with mobile phones and mobile signals.

It would be interesting to see if there is a sudden change in rate failures
or freezes.

For intel or amd or any other processor manufacturer and also hardware
producers it could be interesting to measure these failure rates.

And perhaps try and design computer chips which will fail less and be more
tolerant towards these kinds of external sources of interference.

To have a competive edge over competitors.

It may even save your life on day. Perhaps you will be onboard a
flight/plane and if it's computers fail you may die if not you may life
another day.

Bye,
Skybuck.
 
Skybuck said:
Hello,

I worry sometimes about cosmic interference or electric magnetic
interference or just mobile energy signal interference with computers.

Is anybody measuring "rate of computer crashes" ?

It will be hard to do because the computer industry changes all the time.

New software, new hardware, according to Balmer to only constant is change.

If nobody is measuring the reliability of computer systems then I can
recommend the following:

1. Build a computer

2. Install software.

3. Let it run forever.

4. Don't change a thing, except when hardware breaks.

Then during all of this:

Try to measure two things:

1. Rate of failure/system crashes.

2. Rate of freezes/hangs.

Perhaps it's already to late to start this measurement, however better
late than never.

Nowadays the world full with mobile phones and mobile signals.

It would be interesting to see if there is a sudden change in rate
failures or freezes.

For intel or amd or any other processor manufacturer and also hardware
producers it could be interesting to measure these failure rates.

And perhaps try and design computer chips which will fail less and be
more tolerant towards these kinds of external sources of interference.

To have a competive edge over competitors.

It may even save your life on day. Perhaps you will be onboard a
flight/plane and if it's computers fail you may die if not you may life
another day.

Bye,
Skybuck.

Computer memory buses are getting better with time. The
signal quality on the bus now, is better than some
of the previous standards.

If your computer has ECC, that can be used to count
single-bit errors. The single-bit errors can be
corrected by ECC, and then don't have an impact on
the computer. AMD is better at supporting ECC than
Intel is (my Intel system is ECC capable, had ECC
DIMMs installed, and the ECC did not work - it is
turned off due to a design defect in the Intel chipset).

The rate of memory errors is constantly studied.
Intel studies this, every time that the size of a
DIMM doubles. Intel needs to study this, to determine
whether ECC needs to be added to *all* computer systems,
and not just a few. And surprisingly, the answer is that
desktops can still be run without ECC. The majority of
Intel desktops, are not ECC capable.

I don't think my current system has ever crashed. I suspect
it has thrown memory errors, but without ECC, I have no way
to count the errors. My system is memtest86+ and Prime95
clean. I had memory errors on this system at one time - increasing
Vnb by 0.1V or so, fixed the problem, and it hasn't come back.

Paul
 
The rate of memory errors is constantly studied.
Intel studies this, every time that the size of a
DIMM doubles. Intel needs to study this, to determine
whether ECC needs to be added to *all* computer systems,
and not just a few. And surprisingly, the answer is that
desktops can still be run without ECC. The majority of
Intel desktops, are not ECC capable.

I don't think my current system has ever crashed. I suspect
it has thrown memory errors, but without ECC, I have no way
to count the errors. My system is memtest86+ and Prime95
clean. I had memory errors on this system at one time - increasing
Vnb by 0.1V or so, fixed the problem, and it hasn't come back.

Is that memory rated for a faster speed than its chips are?
Because Ocaholic.ch and XbitLabs.com have removed heatsinks from
modules rated for over 2000 MHz to reveal 1600 MHz or even 1333
MHz chips underneath. My failure rate for such modules has been
about 10%, compared to almost zilch for modules made from major
branded chips not overclocked.

I've never had MemTest86+ find an error with any RAM that
passed the computer's boot-up test, which is weird because the
very similar MemTest86 has often reported errors on the same
memory running in the same system. So I use MemTest86 and
www.GoldMemory.cz, and a few people said Gold Memory found
bad bits when nothing else did.
 
Is that memory rated for a faster speed than its chips are?
Because Ocaholic.ch and XbitLabs.com have removed heatsinks from
modules rated for over 2000 MHz to reveal 1600 MHz or even 1333
MHz chips underneath. My failure rate for such modules has been
about 10%, compared to almost zilch for modules made from major
branded chips not overclocked.

I've never had MemTest86+ find an error with any RAM that
passed the computer's boot-up test, which is weird because the
very similar MemTest86 has often reported errors on the same
memory running in the same system. So I use MemTest86 and
www.GoldMemory.cz, and a few people said Gold Memory found
bad bits when nothing else did.

Actually, I have a counter-example for your last paragraph.
My Asus A7N8X Deluxe board, one of the DIMMs had a completely
dead chip on it (emits random bytes), and the BIOS memory test
didn't even blink. And the errors just scrolled off the screen
in memtest86+. I think the memory configuration was interleaved,
and it looked like the BIOS memory test wasn't testing all the
locations. That's all I could figure. I think by moving DIMMs
around, I could get BIOS beep patterns for some install
cases, but not for others. And for the cases, where the
bad memory allowed booting to continue, memtest86+ didn't
have a problem detecting all the stuck-ats.

As for the memory error problem, it was rather strange.
The system was error free. Then one evening, all of a sudden,
things are going downhill. I had a program exit and throw
an error. I immediately rebooted, loaded my memtest86+ floppy
(sits beside the monitor), and yes, there were errors present.
Maybe two or three in a pass. But the errors were random, and
the fault addresses didn't repeat. On a whim, I looked at the
available settings in the BIOS, noticed a Northbridge voltage
adjustment, and just gave it a tiny adjustment. And things
returned to normal again. The problem has not repeated. Chipset
is X48 I think, and has a huge copper heatsink on it. Mildly
warm to the touch.

The memory was Kingston, and not particularly an enthusiast
grade of memory. Could have been made from stock speed bin
at the factory (they're within JEDEC range).

Paul
 
Back
Top