Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:
The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.
So far I have tried several programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).
Nevertheless, the system has locked up again, so I am wondering what I
might do to provoke the failure again for debug purposes.
In chip testing, it is common to "margin" the chip by essentially
turning down the supply voltage until the chip starts failing in an
obvious and frequent manner.
My question is: What is the collective experience with such tests at
the system and motherboard levels? One of the problems I run into is
that the BIOS only seems to permit turning the voltages (cpu,
memory, ...) UP rather than down.
I suppose I could also try and stress the system by overclocking it,
but somehow I'd feel more convinced if I could do some voltage margin
testing.
Any ideas or experiences that pertain to this matter?
Creating the voltage versus frequency curve, is what
overclockers (or underclockers) do. For example, on
my latest purchase, I know that an extra 0.1V on Vcore,
allows a 33% overclock. By proceeding in small steps of
frequency, and adjusting Vcore for the "same level of
stability" for each test point, you can produce your
own voltage versus frequency curve. On an older
processor (Northwood), I got to see the "brick wall
pattern", where at a certain point, all the extra
(safe) voltage that could be applied, didn't allow
any higher overclock.
In terms of features, AMD and Intel have Cool N' Quiet (CNQ)
and Enhanced SpeedStep (EIST). Depending on OS loading,
if these features are enabled, the voltage and frequency are
changed dynamically, at up to 30 times per second. The
multiplier might vary between 6X and 9X say, with some
small difference in Vcore applied to those two conditions,
according to the manufacturer's declaration of what is
enough to make it work.
So if you are having stability issues, your first step is to
disable CNQ or EIST. The purpose of doing that, is not to
blame those features for the stability issue (as they're not
likely to be the problem), but to make the test conditions
a stable, known quantity. You want just one frequency involved,
when doing a test case, as you're attempting to do
characterization.
On my processor, I believe the Vcore setting is policed by the
processor. My Core2 has VID bits, to drive the Vcore regulator.
And by using tools that can control the multiplier setting, and
drive out new Vcore values while the system is running, the
processor seems to have an upper limit set, as to what bit
pattern it will allow to be passed on the VID bits. That
prevents any useful level of overvolting on my newest system.
Previous generations of systems, used things like overclock
controller chips, to allow "in-band" VID changes.
On some motherboards, you may notice the nomenclature "+0.1V"
for a Vcore setting. Rather than a more direct "1.300V" setting
in the BIOS. I interpret this to mean, the motherboard design has
a feature to bump Vcore, independent of the VID bits. So the
"+0.1V" thing is meant to imply an offset applied in the
Vcore regulator. I had to do something similar to my motherboard
with a soldering iron. I now have a socket, where I can fit
a 1/4W resistor, and by varying the value, I get a voltage boost.
My motherboard is unlike some other brands, in not offering
any out-of-band voltage boost feature. So I had to implement
my own, using instructions from other users who did the
analysis before me. You likely won't have to go through this.
I'm explaining this, in case you cannot reconcile what is
happening while you're testing (setting says one thing,
measured value is some other value). If the set value and
the measured value don't match, part of that difference is
due to "droop", and part can be because of a boost which is
applied independent of the VID bits.
As Kony says, a driver could be responsible for the problem.
The Mersenne Prime95 test is pretty good at finding bad
RAM, and since you've run that for a few hours, that
helps to eliminate bad memory. Prime95 can only test the
memory which is separate from the portion used by the OS,
so it is possible there are still some areas of the RAM
that have not been tested as thoroughly.
Other things that might freeze, might be a misadjusted
bus multiplier, like what is used for Hypertransport between
processor and Northbridge. Or a SATA or IDE clock which
is too far from nominal. So clock signals to other
hardware parts in your system, could give a freezing
symptom. Data or some transaction to the processor
could be frozen, and the processor might still be
running.
Another comment - I've noticed on my older overclocking
test projects, that the processor would crash on an
error. My current Core2 system tends to freeze, rather than
giving an old-fashioned blue screen. So there can be
some differences from one generation to another, as
to what part of the processor is failing, and whether
the system runs long enough to splatter something
across the screen.
Paul