MCE - Non fatal, correctible incident occurred on CPU 0

  • Thread starter Thread starter Will Dormann
  • Start date Start date
W

Will Dormann

Hello,

My Gentoo box has recently started spewing out Machine Check Exception
errors to my log files. They're correctable, and the machine appears to
be running OK, but I'm just wondering if this is a foreshadowing of
impending doom.

I get four repeating MCE errors, from the moment the system starts up.
I've run memtest86 for hours and it shows no error. I'm having a hard
time figuring out what exactly the error is. Nothing is overclocked
and the system is not overheating as far as I can tell. It's a 2.0 GHz
Celeron in an Asus Pundit with latest BIOS.

Here are the errors, followed by the parsemce output. Any ideas?

----

MCE: The hardware reports a non fatal, correctable incident occurred on
CPU 0.
Bank 0: cc00003820040189

../parsemce -e 1 -b 0 -s cc00003820040189 -a 0
Status: (1) Restart IP valid.
parsebank(0): cc00003820040189 @ 0
External tag parity error
Address in addr register valid
MISC register information valid
Error overflow
Memory heirarchy error
Request: Generic error
Transaction type : Generic
Memory/IO : Reserved


MCE: The hardware reports a non fatal, correctable incident occurred on
CPU 0.
Bank 1: c000000000000135

../parsemce -e 1 -b 1 -s c000000000000135 -a 0
Status: (1) Restart IP valid.
parsebank(1): c000000000000135 @ 0
External tag parity error
Error overflow
Memory heirarchy error
Request: Generic error
Transaction type : Data
Memory/IO : Reserved


MCE: The hardware reports a non fatal, correctable incident occurred on
CPU 0.
Bank 2: 9000000000000153

../parsemce -e 1 -b 2 -s 9000000000000153 -a 0
Status: (1) Restart IP valid.
parsebank(2): 9000000000000153 @ 0
External tag parity error
Error enabled in control register
Memory heirarchy error
Request: Generic error
Transaction type : Instruction
Memory/IO : Other


MCE: The hardware reports a non fatal, correctable incident occurred on
CPU 0.
Bank 2: d000000000000153

../parsemce -e 1 -b 2 -s d000000000000153 -a 0
Status: (1) Restart IP valid.
parsebank(2): d000000000000153 @ 0
External tag parity error
Error enabled in control register
Error overflow
Memory heirarchy error
Request: Generic error
Transaction type : Instruction
Memory/IO : Other
 
My Gentoo box has recently started spewing out Machine Check Exception
errors to my log files. They're correctable, and the machine appears
to be running OK, but I'm just wondering if this is a foreshadowing of
impending doom.

I get four repeating MCE errors, from the moment the system starts up.
I've run memtest86 for hours and it shows no error. I'm having a hard
time figuring out what exactly the error is. Nothing is overclocked
and the system is not overheating as far as I can tell. It's a 2.0
GHz Celeron in an Asus Pundit with latest BIOS.

Try replacing your processor.
 
Alex said:
Try replacing your processor.


That could be it. I guess I'll have to see if this is something that's
covered by the warranty, as it's within the 3-year period.

Is it true that all MCE codes indicate an error that's internal to the
CPU? Or could something external to the CPU trigger an MCE?


Thanks
-WD
 
That could be it. I guess I'll have to see if this is something that's
covered by the warranty, as it's within the 3-year period.

Is it true that all MCE codes indicate an error that's internal to the
CPU? Or could something external to the CPU trigger an MCE?


Thanks
-WD

Normally I find them to be errors from the cache memory on the cpu.
 
Will Dormann said:
That could be it. I guess I'll have to see if this is something
that's covered by the warranty, as it's within the 3-year period.

Is it true that all MCE codes indicate an error that's internal to the
CPU? Or could something external to the CPU trigger an MCE?

A badly seated CPU, or overheating motherboard components on the FSB, or
a few other things, but these errors would *usually* mean a CPU problem.
In your case, it *may* be the memory controller, since you see:

External tag parity error
...
Memory heirarchy error


The tag cache is part of the internal cache that's set aside to index
external banks of memory. If the memory controller has a problem, you
could presumably get this error. Of course, you might also see this if
the cache on the CPU is bad (or overheated).

I'd first try reseating the CPU and RAM and blow away any dust that
might have accumulated on the motherboard or in the CPU HSF assembly.
It might not help, but it's free and worth a try.

Oh, and send an email to the maintainer of parsemce and tell him he
means hierarchy and not heirarchy. The latter would be a society ruled
by the eldest child of still living parents... :-)

Regards,
 
Arthur said:
I'd first try reseating the CPU and RAM and blow away any dust that
might have accumulated on the motherboard or in the CPU HSF assembly.
It might not help, but it's free and worth a try.


Thanks for the follow-up. Earlier today I did exactly the above, but
it didn't have any effect on the MCE errors.

I tried running Prime95 for a few hours, and it ran without error.
Although I feel like I'm doing the equivalent of ignoring the "check
engine" light on my car, I might just live with it until I actually see
symptoms other than the MCE.



-WD
 
In comp.sys.ibm.pc.hardware.chips Will Dormann said:
I tried running Prime95 for a few hours, and it ran without
error. Although I feel like I'm doing the equivalent of ignoring
the "check engine" light on my car, I might just live with it
until I actually see symptoms other than the MCE.

You can try running my `burnMMX` with a fairly low memory
parameter like `E` or `H` to exercise your cache ECC

-- Robert author `cpuburn` http://pages.sbcglobal.net/redelm
 
.... or "Intel warranty fun"
I tried running Prime95 for a few hours, and it ran without error.
Although I feel like I'm doing the equivalent of ignoring the "check
engine" light on my car, I might just live with it until I actually see
symptoms other than the MCE.


Well, I'm finally seeing symptoms of instability now. The MCE errors
have been continuing, but with increased frequency now. But now I can't
compile MythTV anymore. The compilation itself crashes at various
stages. (Never at the same spot)

Prime95 fails within a few minutes with a math error.

Now I get to deal with the Intel warranty process...

I call the number, and am transferred to an offshore call center with a
bad connection. I explain the above and why I would like a replacement
processor. Then I get disconnected.

I call again, go through the same steps explaining the problem to a
different person. I explain the Machine Check Exception errors, the
failed compilation, the Prime95 failure. The processor temp is under
50C and Memtest86 passes without error.

His answer: I must take the CPU to a "local computer store" and have
them test the processor before I can get a replacement.

(( ASIDE: What's so special about a "local computer store" that allows
them to determine if I can get an RMA or not? Do they possess some
magical trait that lets them see if a processor is bad or not, which a
mere mortal such as myself couldn't dream of having? Would a tech at a
"local computer store" hook up the CPU to a system that can verify
processor MCE codes? Or would they plug in the chip, turn it on, and
say "it's OK" when they see it POST? ))

Then I get disconnected again.

I call back for the third time, and I get a recording saying that
customer service is closed.

It's great that this chip has a 3-year warranty and all, but who knows
if I'll actually be able to take advantage of it! I guess by Monday I
might now, assuming I don't have an aneurysm by then. :)


-WD
 
:/
... or "Intel warranty fun"



Well, I'm finally seeing symptoms of instability now. The MCE errors
have been continuing, but with increased frequency now. But now I
can't compile MythTV anymore. The compilation itself crashes at
various stages. (Never at the same spot)

Prime95 fails within a few minutes with a math error.

Now I get to deal with the Intel warranty process...

I call the number, and am transferred to an offshore call center with
a bad connection. I explain the above and why I would like a
replacement processor. Then I get disconnected.

I call again, go through the same steps explaining the problem to a
different person. I explain the Machine Check Exception errors, the
failed compilation, the Prime95 failure. The processor temp is under
50C and Memtest86 passes without error.

His answer: I must take the CPU to a "local computer store" and have
them test the processor before I can get a replacement.

(( ASIDE: What's so special about a "local computer store" that allows
them to determine if I can get an RMA or not? Do they possess some
magical trait that lets them see if a processor is bad or not, which a
mere mortal such as myself couldn't dream of having? Would a tech at
a "local computer store" hook up the CPU to a system that can verify
processor MCE codes? Or would they plug in the chip, turn it on, and
say "it's OK" when they see it POST? ))

My guess is that they need to work through an authorized reseller for
the RMA procedure. This is more of an administrative matter, as an
authorized reseller is supposed to be qualified to unmount a CPU from
the motherboard and package it in such a way that the CPU arrives back
at the tech department without any additional damage, which would void
your warranty.

A second possibility is that some - but not all - resellers have
specialized hardware test cards that analyze every component in your
system.
Then I get disconnected again.

I call back for the third time, and I get a recording saying that
customer service is closed.

It's great that this chip has a 3-year warranty and all, but who knows
if I'll actually be able to take advantage of it! I guess by Monday I
might now, assuming I don't have an aneurysm by then. :)

I'd be surprised actually... Intel is a reputed company. I myself have
however had to deal with Chaintech - and this was _through_ an
authorized reseller - and they stonewalled the whole procedure for so
long that the warranty had eventually expired.

I never got that new motherboard they promised, nor did I get a
refund... :-/
 
Back
Top