voltage stress an margin test of system stability

  • Thread starter Thread starter reikred
  • Start date Start date
R

reikred

Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

Nevertheless, the system has locked up again, so I am wondering what I
might do to provoke the failure again for debug purposes.

In chip testing, it is common to "margin" the chip by essentially
turning down the supply voltage until the chip starts failing in an
obvious and frequent manner.

My question is: What is the collective experience with such tests at
the system and motherboard levels? One of the problems I run into is
that the BIOS only seems to permit turning the voltages (cpu,
memory, ...) UP rather than down.

I suppose I could also try and stress the system by overclocking it,
but somehow I'd feel more convinced if I could do some voltage margin
testing.

Any ideas or experiences that pertain to this matter?
 
Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

Nevertheless, the system has locked up again, so I am wondering what I
might do to provoke the failure again for debug purposes.

In chip testing, it is common to "margin" the chip by essentially
turning down the supply voltage until the chip starts failing in an
obvious and frequent manner.

My question is: What is the collective experience with such tests at
the system and motherboard levels? One of the problems I run into is
that the BIOS only seems to permit turning the voltages (cpu,
memory, ...) UP rather than down.

I suppose I could also try and stress the system by overclocking it,
but somehow I'd feel more convinced if I could do some voltage margin
testing.

Any ideas or experiences that pertain to this matter?

Creating the voltage versus frequency curve, is what
overclockers (or underclockers) do. For example, on
my latest purchase, I know that an extra 0.1V on Vcore,
allows a 33% overclock. By proceeding in small steps of
frequency, and adjusting Vcore for the "same level of
stability" for each test point, you can produce your
own voltage versus frequency curve. On an older
processor (Northwood), I got to see the "brick wall
pattern", where at a certain point, all the extra
(safe) voltage that could be applied, didn't allow
any higher overclock.

In terms of features, AMD and Intel have Cool N' Quiet (CNQ)
and Enhanced SpeedStep (EIST). Depending on OS loading,
if these features are enabled, the voltage and frequency are
changed dynamically, at up to 30 times per second. The
multiplier might vary between 6X and 9X say, with some
small difference in Vcore applied to those two conditions,
according to the manufacturer's declaration of what is
enough to make it work.

So if you are having stability issues, your first step is to
disable CNQ or EIST. The purpose of doing that, is not to
blame those features for the stability issue (as they're not
likely to be the problem), but to make the test conditions
a stable, known quantity. You want just one frequency involved,
when doing a test case, as you're attempting to do
characterization.

On my processor, I believe the Vcore setting is policed by the
processor. My Core2 has VID bits, to drive the Vcore regulator.
And by using tools that can control the multiplier setting, and
drive out new Vcore values while the system is running, the
processor seems to have an upper limit set, as to what bit
pattern it will allow to be passed on the VID bits. That
prevents any useful level of overvolting on my newest system.
Previous generations of systems, used things like overclock
controller chips, to allow "in-band" VID changes.

On some motherboards, you may notice the nomenclature "+0.1V"
for a Vcore setting. Rather than a more direct "1.300V" setting
in the BIOS. I interpret this to mean, the motherboard design has
a feature to bump Vcore, independent of the VID bits. So the
"+0.1V" thing is meant to imply an offset applied in the
Vcore regulator. I had to do something similar to my motherboard
with a soldering iron. I now have a socket, where I can fit
a 1/4W resistor, and by varying the value, I get a voltage boost.
My motherboard is unlike some other brands, in not offering
any out-of-band voltage boost feature. So I had to implement
my own, using instructions from other users who did the
analysis before me. You likely won't have to go through this.
I'm explaining this, in case you cannot reconcile what is
happening while you're testing (setting says one thing,
measured value is some other value). If the set value and
the measured value don't match, part of that difference is
due to "droop", and part can be because of a boost which is
applied independent of the VID bits.

As Kony says, a driver could be responsible for the problem.
The Mersenne Prime95 test is pretty good at finding bad
RAM, and since you've run that for a few hours, that
helps to eliminate bad memory. Prime95 can only test the
memory which is separate from the portion used by the OS,
so it is possible there are still some areas of the RAM
that have not been tested as thoroughly.

Other things that might freeze, might be a misadjusted
bus multiplier, like what is used for Hypertransport between
processor and Northbridge. Or a SATA or IDE clock which
is too far from nominal. So clock signals to other
hardware parts in your system, could give a freezing
symptom. Data or some transaction to the processor
could be frozen, and the processor might still be
running.

Another comment - I've noticed on my older overclocking
test projects, that the processor would crash on an
error. My current Core2 system tends to freeze, rather than
giving an old-fashioned blue screen. So there can be
some differences from one generation to another, as
to what part of the processor is failing, and whether
the system runs long enough to splatter something
across the screen.

Paul
 
Creating the voltage versus frequency curve, is what
overclockers (or underclockers) do. For example, on
my latest purchase, I know that an extra 0.1V on Vcore,
allows a 33% overclock. By proceeding in small steps of
frequency, and adjusting Vcore for the "same level of
stability" for each test point, you can produce your
own voltage versus frequency curve. On an older
processor (Northwood), I got to see the "brick wall
pattern", where at a certain point, all the extra
(safe) voltage that could be applied, didn't allow
any higher overclock.

In terms of features, AMD and Intel have Cool N' Quiet (CNQ)
and Enhanced SpeedStep (EIST). Depending on OS loading,
if these features are enabled, the voltage and frequency are
changed dynamically, at up to 30 times per second. The
multiplier might vary between 6X and 9X say, with some
small difference in Vcore applied to those two conditions,
according to the manufacturer's declaration of what is
enough to make it work.

So if you are having stability issues, your first step is to
disable CNQ or EIST. The purpose of doing that, is not to
blame those features for the stability issue (as they're not
likely to be the problem), but to make the test conditions
a stable, known quantity. You want just one frequency involved,
when doing a test case, as you're attempting to do
characterization.

On my processor, I believe the Vcore setting is policed by the
processor. My Core2 has VID bits, to drive the Vcore regulator.
And by using tools that can control the multiplier setting, and
drive out new Vcore values while the system is running, the
processor seems to have an upper limit set, as to what bit
pattern it will allow to be passed on the VID bits. That
prevents any useful level of overvolting on my newest system.
Previous generations of systems, used things like overclock
controller chips, to allow "in-band" VID changes.

On some motherboards, you may notice the nomenclature "+0.1V"
for a Vcore setting. Rather than a more direct "1.300V" setting
in the BIOS. I interpret this to mean, the motherboard design has
a feature to bump Vcore, independent of the VID bits. So the
"+0.1V" thing is meant to imply an offset applied in the
Vcore regulator. I had to do something similar to my motherboard
with a soldering iron. I now have a socket, where I can fit
a 1/4W resistor, and by varying the value, I get a voltage boost.
My motherboard is unlike some other brands, in not offering
any out-of-band voltage boost feature. So I had to implement
my own, using instructions from other users who did the
analysis before me. You likely won't have to go through this.
I'm explaining this, in case you cannot reconcile what is
happening while you're testing (setting says one thing,
measured value is some other value). If the set value and
the measured value don't match, part of that difference is
due to "droop", and part can be because of a boost which is
applied independent of the VID bits.

As Kony says, a driver could be responsible for the problem.
The Mersenne Prime95 test is pretty good at finding bad
RAM, and since you've run that for a few hours, that
helps to eliminate bad memory. Prime95 can only test the
memory which is separate from the portion used by the OS,
so it is possible there are still some areas of the RAM
that have not been tested as thoroughly.

Other things that might freeze, might be a misadjusted
bus multiplier, like what is used for Hypertransport between
processor and Northbridge. Or a SATA or IDE clock which
is too far from nominal. So clock signals to other
hardware parts in your system, could give a freezing
symptom. Data or some transaction to the processor
could be frozen, and the processor might still be
running.

Another comment - I've noticed on my older overclocking
test projects, that the processor would crash on an
error. My current Core2 system tends to freeze, rather than
giving an old-fashioned blue screen. So there can be
some differences from one generation to another, as
to what part of the processor is failing, and whether
the system runs long enough to splatter something
across the screen.

     Paul

Just a quick follow-up to some of the questions and comments.

--it is a linux system

--when locked the machine as a whole is locked, not just the window
system. For example, the nmachine does not respond to a ping.

--no log records the error, AFAICT

--the UCBD test programs also run from a cd that boots a custom linux
kernel. Not sure whether the dynamic frequency scaling module
(cpufreq_ondemand) is enabled or not, whether all cores get exercised
simlutaneously, etc. This needs to be investigated.

--To be certain I'll disable CoolNQuiet next time I boot.

--Overclocking versus voltate margining: Doing both (two dimensions)
is generating what is called "Smhmoo plot" in chip parlance. I was
hoping to do voltage margin, but it looks like instead I may have to
do frequency margining. I also have only +0.1V steps available, which
is rather coarse.

--As a rule, if a chip is spec'ed at X volts there is generally a +/-Y
% margin also specified, because no system can guarantee an exact
voltage. Chips quite often are spec'ed at +/-5% or +/-10%, although
processors may have tighter specs, I do not know.

I will follow some of the suggestions and see what I can find. My main
problem is that it can take days to provoke the failure, hence my
desire for additional and fine-grained stress.
 
Just a quick follow-up to some of the questions and comments.

--it is a linux system

--when locked the machine as a whole is locked, not just the window
system. For example, the nmachine does not respond to a ping.

--no log records the error, AFAICT

--the UCBD test programs also run from a cd that boots a custom linux
kernel. Not sure whether the dynamic frequency scaling module
(cpufreq_ondemand) is enabled or not, whether all cores get exercised
simlutaneously, etc. This needs to be investigated.

--To be certain I'll disable CoolNQuiet next time I boot.

--Overclocking versus voltate margining: Doing both (two dimensions)
is  generating what is called "Smhmoo plot" in chip parlance. I was
hoping  to do voltage margin, but it looks like instead I may have to
do  frequency margining. I also have only +0.1V steps available, which
is rather coarse.

--As a rule, if a chip is spec'ed at X volts there is generally a +/-Y
%   margin also specified, because no system can guarantee an exact
voltage. Chips quite often are spec'ed at +/-5% or +/-10%, although
processors may have tighter specs, I do not know.

I will follow some of the suggestions and see what I can find. My main
problem is that it can take days to provoke the failure, hence my
desire for additional and fine-grained stress.

Okay, I did some frequency margin tests by stepping up the FSB
frequency from 200 MHz in 5MHz increments. The cpu frequency was 11.5x
the FSB and the memory frequency was 3+1/3 x the FSB. All voltages at
nominal levels.

POST passed up to 24 5MHz
BOOT (linux) passed up to 240 MHz
Memtest is currently running at 245 MHz with no errors in 27 minutes

CNQ is *off*.

It may be the graphics driver, then, as Kony was saying. At least I
feel more comfortable about the hardware at this point, having seen a
20% frequency margin before any large-scale failures (at nominal
voltage settings).
 
Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

I've seen many memory modules pass MemTest86+ but fail MemTest86.
Similarly, many modules passed GoldMemory ver. 6.92 but failed ver.
5.07. OTOH every module I've tested that failed GM ver. 5.07
eventually failed MT86 ver. 3.xx, and vice-versa.
 
I've seen many memory modules pass MemTest86+ but fail MemTest86.
Similarly, many modules passed GoldMemory ver. 6.92 but failed ver.
5.07.  OTOH every module I've tested that failed GM ver. 5.07
eventually failed MT86 ver. 3.xx, and vice-versa.

That is interesting information. I have a somewhat superficial
knowledge of memory testing, pattern sensitivities and such. I wonder
what is the difference between the programs, especially considering
that the NEWER versions appear to be less stressful than the older
versions in some cases.

On a related note, memory errors are sometimes (perhaps often?)
transient. Do any of the programs keep and save a bad-address-list so
that one can go back and retest the specific address (or regions)
where the failure occurred? At least some of the programs run very
much standalone with little OS support...
 
Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

Nevertheless, the system has locked up again, so I am wondering what I
might do to provoke the failure again for debug purposes.

In chip testing, it is common to "margin" the chip by essentially
turning down the supply voltage until the chip starts failing in an
obvious and frequent manner.

My question is: What is the collective experience with such tests at
the system and motherboard levels? One of the problems I run into is
that the BIOS only seems to permit turning the voltages (cpu,
memory, ...) UP rather than down.

I suppose I could also try and stress the system by overclocking it,
but somehow I'd feel more convinced if I could do some voltage margin
testing.

Any ideas or experiences that pertain to this matter?

A wise mentor once told me the secret to debugging. I'll pass it on
to you if you promise to keep it a secret.

Every time you type a question mark, stop and ponder the question.
EXACTLY what are you going to do with the answer?
Plot yourself a decision tree, if only in your head.
If the answer is yes, I'm gonna do this.
If it's no, I'll do that.
If it.s > 3.4, I'm gonna do the other thing.

After you've done this for a while, it will become obvious
that most questions (tests done) don't need to be asked.
If you're gonna do the same thing no matter what the answer,
skip it and move on.

Another thing that happens is that most of the branches lead
nowhere. If you can't hypothesize a set of results leading
to
something you can actually fix, you need a new plan.
If a set of answers leads nowhere, you don't need any of the
intermediate results.

Pondering the range of possible answers to your question
leads you to much more efficient debugging.
This is a process that will give you better than average
debugging results...but you won't find the cure for cancer with this
strategy.

So, back to your question...
You turn down the volts and it fails.
Now what?
Can you be sure it's the same failure?
How much lower is enough lower?
And what are you going to do to fix it?

Some questions can't be answered with technology you can afford.
Even if it is a voltage problem, you won't be able to
measure it with a voltmeter. You'll need a VERY fast digital
storage scope and a set of probes you'll have to mortgage your
house to buy. It'll be a voltage droop when during a dma
transfer while the disk is seeking and the video memory
crosses a certain memory address and all the address lines
change at once...on tuesday when the moon is full.

You do this kind of debugging on prototypes and subsystems. For failed
customer units, you throw them away.
 
A wise mentor once told me the secret to debugging.  I'll pass it on
to you if you promise to keep it a secret.

Every time you type a question mark, stop and ponder the question.
EXACTLY what are you going to do with the answer?
Plot yourself a decision tree, if only in your head.
If the answer is yes, I'm gonna do this.
If it's no, I'll do that.
If it.s > 3.4, I'm gonna do the other thing.

After you've done this for a while, it will become obvious
that most questions (tests done) don't need to be asked.
If you're gonna do the same thing no matter what the answer,
skip it and move on.

Another thing that happens is that most of the branches lead
nowhere.  If you can't hypothesize a set of results leading
to
something you can actually fix, you need a new plan.
If a set of answers leads nowhere, you don't need any of the
intermediate results.

Pondering the range of possible answers to your question
leads you to much more efficient debugging.
This is a process that will give you better than average
debugging results...but you won't find the cure for cancer with this
strategy.

So, back to your question...
You turn down the volts and it fails.
Now what?
Can you be sure it's the same failure?
How much lower is enough lower?
And what are you going to do to fix it?

Some questions can't be answered with technology you can afford.
Even if it is a voltage problem, you won't be able to
measure it with a voltmeter.  You'll need a VERY fast digital
storage scope and a set of probes you'll have to mortgage your
house to buy.  It'll be a voltage droop when during a dma
transfer while the disk is seeking and the video memory
crosses a certain memory address and all the address lines
change at once...on tuesday when the moon is full.

You do this kind of debugging on prototypes and subsystems.  For failed
customer units, you throw them away.

Point taken, in my case I was trying to determine whether it really
was bad hardware or (as some suggested) a software bug. Right now I am
leaning toward the latter, as the system has not frozen since I
upgraded the video driver (knock on wood).
 
Point taken, in my case I was trying to determine whether it really
was bad hardware or (as some suggested) a software bug. Right now I am
leaning toward the latter, as the system has not frozen since I
upgraded the video driver (knock on wood).

Well, the knock on wood was not enough. It froze twice Friday night
and later overnight, then has been running ok since then.
 
Back
Top