opinions: A8V-E vs. A8N-SLI

  • Thread starter Thread starter H.W. Stockman
  • Start date Start date
H

H.W. Stockman

I would like an 64FX board to build an XP-compatible system. The computer
will be primarily for memory-intensive calculations. While it will be nice
to use the system for general word processing, excel, etc., it will mainly
be a compilation and calculation platform on a network with at least 5 other
systems. I won't play any demanding games, etc. I would like at least 2GB
fairly fast memory, but don't require a super-duper graphics card.

I clearly don't need all the extras on these boards... but ASUS doesn't seem
to offer too many choices right now in dual-channel, fast-memory systems.

Which of these boards would offer the greatest stream-type memory
performance? (The Sandra memory benchmark is an OK substitute for stream.)

Thanks.
 
"H.W. said:
I would like an 64FX board to build an XP-compatible system. The computer
will be primarily for memory-intensive calculations. While it will be nice
to use the system for general word processing, excel, etc., it will mainly
be a compilation and calculation platform on a network with at least 5 other
systems. I won't play any demanding games, etc. I would like at least 2GB
fairly fast memory, but don't require a super-duper graphics card.

I clearly don't need all the extras on these boards... but ASUS doesn't seem
to offer too many choices right now in dual-channel, fast-memory systems.

Which of these boards would offer the greatest stream-type memory
performance? (The Sandra memory benchmark is an OK substitute for stream.)

Thanks.

To whet your appetite, take a look at this:
http://www.anandtech.com/mb/showdoc.aspx?i=2337&p=4

"SiSoft Sandra 2004 standard memory bandwidth was 8,300 MB/s.
The Sandra unbuffered memory bandwidth was at 4000 MB/s."

DFI LANParty nF4 SLI-DR 318x9 (2862MHz) (Auto HT, 2.5-4-3-7, 2.9V)
(1:1 Memory, 1T, 2 DIMMs in DC mode) (using Athlon64 4000+
and OCZ PC3200 EL Platinum Rev. 2 )
http://www.anandtech.com/mb/showdoc.aspx?i=2358&p=7

That is running the memory at DDR636, when the memory is rated
at DDR400.

There is a difference between running two and four sticks of
memory on a dual channel board. Due to signal integrity issues,
four sticks cannot be run nearly as fast. Setting the command
rate option to 2T, makes it possible to run four sticks, but
the memory bandwidth drops by 20%.

It is possible to find 1GB modules, and I posted just yesterday,
a couple of products that I hadn't seen before. They might
allow you to do reasonably well in your quest for 2GB.

If you plan on going further than 2GB, something to watch for,
is the amount of main memory that is lost when the BIOS doles
out address space for the video card, PCI bus and so on. If
you stick 4GB in the A8N-SLI, you get to use 2.75GB when using
one video card, and 2.25GB if using two video cards. Based on
that, if I were you, I might just select an AGP based system,
as at least with those, you can dial down the aperture and
free up a little more address space. You should be able to
get to the 3+ GB level with an AGP board (like an A8V rev 2).

Here are some other sample benchmarks.

A8V review : 289x9 (2601) at 1:1 Memory (using FX53 and
2 x 512MB Mushkin PC3500 Level II OR
2 x 512MB OCZ PC3500 Platinum Ltd )
http://www.anandtech.com/mb/showdoc.aspx?i=2128&p=5

A8N-SLI review 255x11 (2805MHz) (4X HT, 2.5-3-3-7, 2.7V)
(1:1 Memory, 1T, 2 DIMMs in DC mode) (using Athlon64 4000+
and OCZ PC3200 EL Platinum Rev. 2 )
http://www.anandtech.com/mb/showdoc.aspx?i=2358&p=5

Since the A8N-SLI and the DFI board used the same processor
and memory, you can see how much of a difference there is
between the boards. It is possible the FX53 is helping
the A8V a bit.

The A8V-E is very new, and I haven't seen too much feedback
about it yet. Being new, give it three months until the BIOS
bugs are fixed.

http://www.xbitlabs.com/articles/mainboards/display/asus-a8ve-deluxe.html

"The thing is that this mainboard always sets 1T/2T Memory
Timings to 2T, which reduces the mainboard performance a way
below what it should be."

"when we installed Corsair TWINX1024-3200XLPRO or Corsair
TWINX1024-3200XL memory modules based on popular Samsung TCCD
chips into our ASUS A8V-E Deluxe based platform, it lost its
stability"

"In other words, the overclocking potential of ASUS A8V-E
Deluxe mainboard is lower than that of most NVIDIA nForce4
Ultra/SLI based solutions."

So, the trick will be finding an Ultra-D for sale:
http://www.abxzone.com/forums/showthread.php?t=88699

Put together a rig like the one at the bottom of this page :-)
FX-55 overclocked to 3.9GHz.
http://www.abxzone.com/forums/showthread.php?t=88699&page=76&pp=15

Now, you need some memory. To get 2GB, the best config is
2x1GB sticks, as you can run command rate 1T. If you use
4x512MB, you have to run command rate 2T. To bad, though,
as there is some really nice 512MB sticks out there. Like
this 512MB stick, which does CAS2 up to DDR500:

http://www.ocztechnology.com/products/memory/ocz_el_ddr_pc_4000_dual_channel_gold_vx

Here is PC3200 2-3-2-5 2x1GB kit $488
http://www.ocztechnology.com/products/memory/ocz_el_ddr_pc_3200_dual_channel_platinum
With 1GB modules, you don't know how they'll do when overclocked.
People take TCCD and get massive overclocks with that stuff, but
the chips on a 1GB module are completely different animals.

PC4000 3-4-4-8 $280x2 = $560
http://www.crucial.com/ballistix/store/PartSpecs.asp?imodule=BL12864Z503&cat=
The CAS is not too tight on these, but at least they go to DDR500.

PC3700 3-4-4-8 1GB stick KHX3700/1G 2x$350=$700 (fastest 1GB)
http://kingston.com/hyperx/thelines/default.asp?type=khxu

XMS3200 2.5-3-3-6 2x1GB TWINX2048-3200C2PT
http://corsairmicro.com/corsair/xms.html
Corsair has some catching up to do.

I haven't seen any comments on the PC4000 Ballistix, so cannot
say whether that is the best stuff. How far the PC3200 1GB
sticks can be pushed, is anyone's guess. But that is all part
of the fun :-)

I hope the flavor of the information that should be coming
across here, is great things are possible with S939 processors
and boards, but at great personal sacrifice of time and money.
And 2x1GB is much less popular than 2x512MB. It is too bad
there is such a hit when running 4x512MB. If you do decide
to use 2x1GB, post back with your test results.

Paul
 
Paul said:
Stockman" <stockman3@earth-REMOVE_THIS-llink.net> wrote:

Thanks very much for all your comments; it will take some time to digest all
you've said. I'd prefer to get good memory performance without extreme
overclocking.
 
"H.W. said:
Thanks very much for all your comments; it will take some time to digest all
you've said. I'd prefer to get good memory performance without extreme
overclocking.

In which case, any S939 board, with 2x1GB modules having low CAS,
will do the trick. That means the A8N-SLI (PCI Express video)
or A8V Deluxe rev2 (AGP video) and some PC3200 CAS2 memory, would
be good for nominal settings of all operating frequencies. The
2x1GB modules will preserve the possibility of selecting Command
Rate 1T memory setting.

(Command Rate is not a DIMM parameter but is a memory controller
option - it sets how many clock cycles the address sits stable
on the memory bus, before the memory uses the info. The 2T
setting gives more setup time for the memory, but eats up 20%
memory bus performance. If you leave the motherboard at auto
settings, the BIOS might use DDR333 and Command Rate 2T, which
will reduce your memory performance. At the very least, you
should intervene, enter the BIOS, set the memory to at least
its rated DDR400, set Command Rate 1T, and leave the other
parameters at their auto settings. Nothing is being overclocked
with those choices, settings are merely optimized.)

I would not add memory memory to your system, like 2x1GB
and 2x512MB, because you will have to use Command Rate 2T
to run with that much memory. So, consider your machine
to be 2GB max, if you wish to preserve good memory performance.

Given you never plan on overclocking the memory, these two
products give reasonable low CAS. (Lower CAS is better.)

PC3200 2-3-2-5 2x1GB kit $488
http://www.ocztechnology.com/products/memory/ocz_el_ddr_pc_3200_dual_channel_platinum

XMS3200 2.5-3-3-6 2x1GB TWINX2048-3200C2PT $387-$500-$550 (variable)
http://corsairmicro.com/corsair/xms.html

One of the oddities about memory, is you will find some product,
where the product is rated CAS2 with an Intel board, and CAS2.5
with an AMD board. The CAS2.5 comment may apply to the AthlonXP,
but I haven't seen any comments whether that derating still
applies to the Athlon64/Opteron family or not. Since the memory
controller is now in the processor, it would be a different
set of interface conditions. This peculiar detail with
CAS, seems to apply to Hynix D43 memory chips, and so
that CAS rating can be used to infer Hynix D43 is being
used. D43 is available as ordinary PC3200 chips, and is
also speed binned for faster memory products. (The Hynix
web site even advertises use of the chips at faster than
PC3200, while most conservative memory manufacturers never
admit to selling anything at faster than JEDEC standards
committee approved speeds.)

One of my purposes in quoting the overclock info, was to
demonstrate that raising the FSB and dropping the multiplier
(thus keeping a constant core speed), using a 1:1 relationship
between FSB and memory bus, allows the memory to run at maximum
speed. Thus, under that set of conditions, it is possible to
run the processor core at stock speed, and overclock just
the memory bus.

If you wish to overclock the core as well, then raising the
multiplier allows you to see how far the processor
will go with only air cooling. Once higher than stock speed
(and with Cool N' Quiet disabled, as overclocking and CNQ don't
work together), you might find a little more Vcore helps.

I understand you want computing results you can count on, and
you can verify the overclock by using Prime95 (mersenne.org).
That is a reasonable test for memory and processor stability.
Once a certain level of overclock has been demonstrated, you
could drop the processor multiplier by 1 and leave the Vcore,
and retest with Prime95. That should give you a little margin
against future changes. Once you pass Prime95, you can also
run your application and compare results computed on two
different computers.

As I understand it, the FX-55 multiplier can be raised or
lowered, while the athlon64 (4000+) multiplier can just be
lowered. That is another reason people raise the FSB, so
they have a range of multipliers to play with when using
either of those processors.

Both increasing the core frequency and the memory frequency
will give better memory bandwidth. If you raise the FSB
and drop the multiplier, and run the core at its normal
frequency, then only the I/O pads are running faster than
nominal. I consider that to be a reasonable compromise
in terms of electromigration theory and processor life,
if that is a concern.

Another parameter to set, is the Hypertransport bus
frequency. If the FSB is 300MHz, and the product supports
HT of 1000MB/sec, then a HT multiplier of 3X causes the
actual HT to be 900MB/sec. In other words, with a given
FSB choice, and a max HT of 800 or 1000MB/sec, you set
that multiplier so the result is less than the maximum.
That bus is the connection between the processor and
chipset, and in your case is not an issue, as that
bus is only used if you are doing screen updates or
other I/O. Since you are interested mainly in computing
prowess, that bus setting only needs to be selected
such that the bus is stable, and doesn't need to be pushed
past its limits. If you were playing video games, you
might choose another philosophy.

HTH,
Paul
 
Paul said:
There is a difference between running two and four sticks of
memory on a dual channel board. Due to signal integrity issues,
four sticks cannot be run nearly as fast. Setting the command
rate option to 2T, makes it possible to run four sticks, but
the memory bandwidth drops by 20%.

Is this a general issue for all extant dual-channel architectures? I have a
1.4 GHz P4, in which I originally had 2 x 256 MB RAMBUS modules, then went
to 4 x 256MB modules. I don't recall any real change in memory performance.
I realize RAMBUS is very different from DDR; but I was wondering if a modern
P4 with dual channel DDR would suffer the exact same problems as the AMD64
system.
 
H.W. Stockman said:
Is this a general issue for all extant dual-channel architectures? I have a
1.4 GHz P4, in which I originally had 2 x 256 MB RAMBUS modules, then went
to 4 x 256MB modules. I don't recall any real change in memory performance.
I realize RAMBUS is very different from DDR; but I was wondering if a modern
P4 with dual channel DDR would suffer the exact same problems as the AMD64
system.
i865/i875 were not affected by this, in fact they were (slightly) faster
with 4 dimms (due to possibility to have more open pages at a time
IIRC). I have not seen extensive reviews of ram performance with i915
based systems, depending on how many dimms etc., but it's probably the
same (for both ddr1 and ddr2 based systems).

And while true that memory bandwidth drops quite a bit when increasing
command rate from 1T to 2T on the a64 socket 939 systems, in real-world
benchmarks that only amounts to 1-3% maximum hit. There is some hope (or
call it rumour...) increasing command rate with 4 dimms may no longer be
necessary on the new a64 cores (E3/E4 step, San Diego/Venice).

Roland
 
"H.W. said:
Is this a general issue for all extant dual-channel architectures? I have a
1.4 GHz P4, in which I originally had 2 x 256 MB RAMBUS modules, then went
to 4 x 256MB modules. I don't recall any real change in memory performance.
I realize RAMBUS is very different from DDR; but I was wondering if a modern
P4 with dual channel DDR would suffer the exact same problems as the AMD64
system.

Here is a Sandra result for DDR memory. This would be similar
to my test, only using Sandra as the test tool. The loss is
not as great here, with four sticks.

http://www.abxzone.com/forums/showthread.php?t=90280&highlight=four+sticks
"Sandra and CPUZ show PAT enabled with 4x512 on the
P4C800-E Deluxe.

Sandra 2x512 4861/4876
Sandra 4x512 4798/4787"

Here are some Sandra results for (Intel) DDR2 memory:

Two sticks = ~4900 at 533 (can use DDR2 533 or DDR2 667 memory)
a.k.a (PC2 4300 or PC2 5400)
http://www.anandtech.com/memory/showdoc.aspx?i=2112&p=16

http://www.xbitlabs.com/articles/memory/display/ddr2-ddr_10.html
Sandra 4964/4967 Corsair CM2X512-5300C4PRO

In this article, four DDR2 sticks required increasing CAS by
one (from CAS3 to CAS4). So, there should still be an effect
caused by using more memory. DDR2 uses a different termination
scheme, and has some kind of calibration procedure, but I
don't think that affects how the address bus works.

http://www.anandtech.com/mb/showdoc.aspx?i=2293&p=10
http://www.anandtech.com/mb/showdoc.aspx?i=2288&p=3

The Intel DDR2 only really surpasses DDR, at above 533.
DDR2 400 is slower than DDR400.

Here are some results for two sticks in dual channel on
Athlon64. I cannot find Sandra results with four sticks.

http://www.amdzone.com/modules.php?...ns&file=index&req=viewarticle&artid=52&page=7
Sandra 6203/6148 DDR400 but no precise conditions stated. Going
by the picture of the BIOS screen, Command Rate 1T, CAS=2.
So, some good memory is required to replicate this.

A64 FX-53 but settings not stated, Sandra 7387/7293.
Likely the memory is overclocked. See Post #18.
http://abxzone.com/forums/attachment.php?attachmentid=17120
http://abxzone.com/forums/showthread.php?t=86956&page=2&pp=15&highlight=sandra

I think Athlon64/FX dual channel offers the best platform
to work on memory bandwidth. I still think using a high
performance memory, at higher than DDR400, will give the
best results, and you don't have to run the core of the
processor out of spec. All that is required, is a lowering
of the multiplier. Vcore can stay the same.

I reran your 3D0 benchmark, just to see how memory bound it
is.

P4C800-E Deluxe, 2.8C P4, FSB800, DDR400, 2-2-2-6 dual channel
2x512MB ==> 12.45 MUPS (memtest86+ 1.4 bandwidth ==> 2955MB/sec)

P4C800-E Deluxe, 2.8C P4, FSB800, DDR400, 3-3-3-8 dual channel
2x512MB ==> 10.59 MUPS (memtest86+ 1.4 bandwidth ==> 2549MB/sec)

A7N8X-E Deluxe, 3200+, 200x11, DDR400, 2-2-2-6 dual channel
2x512MB ==> 7.67 MUPS (memtest86+ 1.4 bandwidth ==> 1485MB/sec)

No question, you need some good dual channel performance, as
your app scales pretty well with memory bandwidth. Even though
AMD thinks the XP 3200+ is the equal of a P4 3.2GHz processor,
it isn't true for your application.

Paul
 
[...]
I reran your 3D0 benchmark, just to see how memory bound it
is.

P4C800-E Deluxe, 2.8C P4, FSB800, DDR400, 2-2-2-6 dual channel
2x512MB ==> 12.45 MUPS (memtest86+ 1.4 bandwidth ==> 2955MB/sec)

P4C800-E Deluxe, 2.8C P4, FSB800, DDR400, 3-3-3-8 dual channel
2x512MB ==> 10.59 MUPS (memtest86+ 1.4 bandwidth ==> 2549MB/sec)

A7N8X-E Deluxe, 3200+, 200x11, DDR400, 2-2-2-6 dual channel
2x512MB ==> 7.67 MUPS (memtest86+ 1.4 bandwidth ==> 1485MB/sec)

No question, you need some good dual channel performance, as
your app scales pretty well with memory bandwidth. Even though
AMD thinks the XP 3200+ is the equal of a P4 3.2GHz processor,
it isn't true for your application.

Paul

Thanks very much -- I just bought an A8V system last night, if I'd seen your
post, I would have waited to get a new P4! Oh well, I guess I was finally
worn down by all the AMD enthusiasts.
 
"H.W. said:
[...]
I reran your 3D0 benchmark, just to see how memory bound it
is.

P4C800-E Deluxe, 2.8C P4, FSB800, DDR400, 2-2-2-6 dual channel
2x512MB ==> 12.45 MUPS (memtest86+ 1.4 bandwidth ==> 2955MB/sec)

P4C800-E Deluxe, 2.8C P4, FSB800, DDR400, 3-3-3-8 dual channel
2x512MB ==> 10.59 MUPS (memtest86+ 1.4 bandwidth ==> 2549MB/sec)

A7N8X-E Deluxe, 3200+, 200x11, DDR400, 2-2-2-6 dual channel
2x512MB ==> 7.67 MUPS (memtest86+ 1.4 bandwidth ==> 1485MB/sec)

No question, you need some good dual channel performance, as
your app scales pretty well with memory bandwidth. Even though
AMD thinks the XP 3200+ is the equal of a P4 3.2GHz processor,
it isn't true for your application.

Paul

Thanks very much -- I just bought an A8V system last night, if I'd seen your
post, I would have waited to get a new P4! Oh well, I guess I was finally
worn down by all the AMD enthusiasts.

I benched the AthlonXP, the 32 bit processor. You have the 64 bit
processor, and more important to you, dual channel DDR. I think
the A8V will make a fine choice, just crank up that memory.
A Sandra memory bench of 6000 should serve you well, certainly
much better than the three benches quoted above.

I hope you have purchased the Revision 2 board. The revision number
is printed on the silk screen and should be near the "A8V"
model number. The difference between Revision 1 and Revision 2,
is Revision 2 has a working AGP/PCI lock. The lock maintains
66 and 33MHz clocks for the AGP and PCI bus respectively,
which is handy when adjusting other clocks in the system.
The Revision 2 doesn't have wireless, while Revision 1
had wireless. And to further confuse matters, one advert I
looked at, bundled a Rev.2 with a separate wireless card :-(

And, to give you something to aim for, here is someone getting
9000MB/sec from some Samsung TCCD memory at DDR668, using a
Athlon64 3000+ 90nm Winchester:
http://www.xtremesystems.org/forums/showthread.php?t=53051

Paul
 
Paul said:
"H.W. said:
[...]
A7N8X-E Deluxe, 3200+, 200x11, DDR400, 2-2-2-6 dual channel
2x512MB ==> 7.67 MUPS (memtest86+ 1.4 bandwidth ==> 1485MB/sec)
[...]
I benched the AthlonXP, the 32 bit processor. You have the 64 bit
processor, and more important to you, dual channel DDR. I think
the A8V will make a fine choice, just crank up that memory.
A Sandra memory bench of 6000 should serve you well, certainly
much better than the three benches quoted above.

That's what I was hoping you'd say. ;^) I've seen the ~6000 Sandra
benches, though the reporting has been a little vague (unbuffered vs.
buffered, etc.).

I was originally going to get 4GB, but readjusted my wants. Parts of the
real (full) code are now multiplication-intensive, so we'll see if I can win
something there with the Athlon.
I hope you have purchased the Revision 2 board. The revision number
is printed on the silk screen and should be near the "A8V"
model number. The difference between Revision 1 and Revision 2,
is Revision 2 has a working AGP/PCI lock. The lock maintains
66 and 33MHz clocks for the AGP and PCI bus respectively,
which is handy when adjusting other clocks in the system.
The Revision 2 doesn't have wireless, while Revision 1
had wireless. And to further confuse matters, one advert I
looked at, bundled a Rev.2 with a separate wireless card :-(

I'm hoping not to overclock at all, but time will tell.
 
Paul said:
I benched the AthlonXP, the 32 bit processor. You have the 64 bit
processor, and more important to you, dual channel DDR. I think
the A8V will make a fine choice, just crank up that memory.
A Sandra memory bench of 6000 should serve you well, certainly
much better than the three benches quoted above.

So far, the new system, with 2 GB as 2x1GB XMS 3200, "3800+" 64-bit, is
benching about 1.87x as fast as the P4 1.4 GHz, PC800 system I bought over 4
years ago -- for less money. I guess I'll learn to be happy with that, but
I was frankly expecting more improvement. I should have gone for a P4
system.
 
Paul said:
"H.W. said:
[...]
I reran your 3D0 benchmark, just to see how memory bound it
is.

P4C800-E Deluxe, 2.8C P4, FSB800, DDR400, 2-2-2-6 dual channel
2x512MB ==> 12.45 MUPS (memtest86+ 1.4 bandwidth ==> 2955MB/sec)

P4C800-E Deluxe, 2.8C P4, FSB800, DDR400, 3-3-3-8 dual channel
2x512MB ==> 10.59 MUPS (memtest86+ 1.4 bandwidth ==> 2549MB/sec)

A7N8X-E Deluxe, 3200+, 200x11, DDR400, 2-2-2-6 dual channel
2x512MB ==> 7.67 MUPS (memtest86+ 1.4 bandwidth ==> 1485MB/sec)

No question, you need some good dual channel performance, as
your app scales pretty well with memory bandwidth. Even though
AMD thinks the XP 3200+ is the equal of a P4 3.2GHz processor,
it isn't true for your application.

Paul

Thanks very much -- I just bought an A8V system last night, if I'd seen your
post, I would have waited to get a new P4! Oh well, I guess I was finally
worn down by all the AMD enthusiasts.

I benched the AthlonXP, the 32 bit processor. You have the 64 bit
processor, and more important to you, dual channel DDR. I think
the A8V will make a fine choice, just crank up that memory.
A Sandra memory bench of 6000 should serve you well, certainly
much better than the three benches quoted above.

The zip file below has instructions for one more bench -- if you could run it on
your fast P4, I'd appreciate that effort.
http://hwstock.org/ppn/ppn.zip

The readme.txt in the zip files tells how to run the test. This is a console
version, certainly not meant to be a general benchmark. This is the real
program, much more complex than 3D0; the calculation is doing precipitation and
dissolution under flow in a complex geometry. There is no MUPs printout, but all
the info I need is in the lb_data.txt file produced at the end of the run.

I was hoping the math-intensive chemistry routines would allow the AMD64 to
improve more on the 1.4 GHz P4, but I see a factor 1.88 improvement only, almost
the same as in the most memory-intensive programs.

If other folks want to run this benchmark, great; but it isn't meant to be
something for bragging rights at OCworkbench or anandtech.

I might be able to improve the memory timings a bit for the A8V system. There
are 2 x 1024MB DDR Corsair PC3200 XMS modules, and I'm currently at the
recommended settings.
 
"H.W. said:
The zip file below has instructions for one more bench --
if you could run it on
your fast P4, I'd appreciate that effort.
http://hwstock.org/ppn/ppn.zip

The readme.txt in the zip files tells how to run the test. This is a console
version, certainly not meant to be a general benchmark. This is the real
program, much more complex than 3D0; the calculation is doing precipitation and
dissolution under flow in a complex geometry. There is no MUPs printout, but all
the info I need is in the lb_data.txt file produced at the end of the run.

I was hoping the math-intensive chemistry routines would allow the AMD64 to
improve more on the 1.4 GHz P4, but I see a factor 1.88 improvement only, almost
the same as in the most memory-intensive programs.

If other folks want to run this benchmark, great; but it isn't meant to be
something for bragging rights at OCworkbench or anandtech.

I might be able to improve the memory timings a bit for the A8V system. There
are 2 x 1024MB DDR Corsair PC3200 XMS modules, and I'm currently at the
recommended settings.

I've noticed tonight that the BIOS has a mind of its own.
Doesn't matter, as long as you refer back to my memory
bandwidth measurements, as results should be proportional
to measured bandwidth, no matter what the BIOS is doing.
(I just don't like it, when the BIOS timing settings result
in something different as seen by a Windows util.)

These two runs are using the same two settings used for the
other benchmarks with 3D0. First run is with best bandwidth,
second run intended to reflect the purchase of "commodity"
memory. Everything stock, no overclocking. First run 48
seconds, second run 53 seconds.

P4 2.8Ghz 2x512MB memory FSB800 DDR400, PAT enabled.

BIOS Memory timings 2-2-2-6 (actual 2-2-2-5 as measured by CPUZ)

********************** 2-2-2-5 ****************************
START::::::: dp 03/25/2005 01:11:06 :::::::START
Current working directory is:
Version: 1.03m Date of version: 03-24-05
Computer: User:
Geometry file is: gppn.txt
.............
solids and open space on U control planes:
xSolid1=838, xOpen1=762, ySolid1=1024, yOpen1=576
..............
maxstep=2048, clampstep=256, radius=0, solids=89001
rho0effective1=1.06889
------------
STEP Ux txBody txBody_corr
288 -1.938629e-006 0.000000e+000 0.000000e+000
320 -1.558804e-006 0.000000e+000 0.000000e+000
352 -1.341729e-006 0.000000e+000 0.000000e+000
384 -1.125751e-006 0.000000e+000 0.000000e+000
416 -9.443948e-007 0.000000e+000 0.000000e+000
448 -7.937841e-007 0.000000e+000 0.000000e+000
480 -6.688104e-007 0.000000e+000 0.000000e+000
512 -5.654746e-007 0.000000e+000 0.000000e+000
544 -4.819961e-007 0.000000e+000 0.000000e+000
576 -4.131421e-007 0.000000e+000 0.000000e+000
608 -3.566196e-007 0.000000e+000 0.000000e+000
640 -3.096196e-007 0.000000e+000 0.000000e+000
672 -2.705808e-007 0.000000e+000 0.000000e+000
704 -2.396400e-007 0.000000e+000 0.000000e+000
736 -2.144511e-007 0.000000e+000 0.000000e+000
768 -1.928805e-007 0.000000e+000 0.000000e+000
800 -1.758781e-007 0.000000e+000 0.000000e+000
832 -1.592800e-007 0.000000e+000 0.000000e+000
864 -1.456146e-007 0.000000e+000 0.000000e+000
896 -1.389413e-007 0.000000e+000 0.000000e+000
928 -1.323898e-007 0.000000e+000 0.000000e+000
960 -1.266565e-007 0.000000e+000 0.000000e+000
992 -1.222470e-007 0.000000e+000 0.000000e+000
1024 -1.176777e-007 0.000000e+000 0.000000e+000
1056 -1.139793e-007 0.000000e+000 0.000000e+000
1088 -1.097629e-007 0.000000e+000 0.000000e+000
1120 -1.076828e-007 0.000000e+000 0.000000e+000
1152 -1.068908e-007 0.000000e+000 0.000000e+000
1184 -1.033071e-007 0.000000e+000 0.000000e+000
1216 -1.022397e-007 0.000000e+000 0.000000e+000
1248 -1.009435e-007 0.000000e+000 0.000000e+000
1280 -1.010386e-007 0.000000e+000 0.000000e+000
1312 -9.977029e-008 0.000000e+000 0.000000e+000
1344 -9.922373e-008 0.000000e+000 0.000000e+000
1376 -9.888766e-008 0.000000e+000 0.000000e+000
1408 -9.782917e-008 0.000000e+000 0.000000e+000
1440 -9.774550e-008 0.000000e+000 0.000000e+000
1472 -9.776144e-008 0.000000e+000 0.000000e+000
1504 -9.687410e-008 0.000000e+000 0.000000e+000
1536 -9.795440e-008 0.000000e+000 0.000000e+000
1568 -9.582627e-008 0.000000e+000 0.000000e+000
1600 -9.660255e-008 0.000000e+000 0.000000e+000
1632 -9.705410e-008 0.000000e+000 0.000000e+000
1664 -9.715911e-008 0.000000e+000 0.000000e+000
1696 -9.690212e-008 0.000000e+000 0.000000e+000
1728 -9.569668e-008 0.000000e+000 0.000000e+000
1760 -9.673209e-008 0.000000e+000 0.000000e+000
1792 -9.612396e-008 0.000000e+000 0.000000e+000
1824 -9.685042e-008 0.000000e+000 0.000000e+000
1856 -9.713350e-008 0.000000e+000 0.000000e+000
1888 -9.562302e-008 0.000000e+000 0.000000e+000
1920 -9.546941e-008 0.000000e+000 0.000000e+000
1952 -9.661963e-008 0.000000e+000 0.000000e+000
1984 -9.638408e-008 0.000000e+000 0.000000e+000
2016 -9.677484e-008 0.000000e+000 0.000000e+000
STEP=2048, NCOL=100, NROW=100, NLAY=16, xBody=0.000000, yBody=0.000500
Tau[0]=0.950000 Tau[1]=0.500300
Clamped xBody= 0.000000e+000, final ux/xBody=1.000000e+030
END=====dp 03/25/2005 01:11:54 ========

******************** end 2-2-2-5 **************************

BIOS Memory timings 3-3-3-8 (actual 2.5-3-3-6 as measured by CPUZ)

********************** 3-3-3-8 ****************************
START::::::: dp 03/25/2005 01:29:38 :::::::START
Current working directory is:
Version: 1.03m Date of version: 03-24-05
Computer: User:
Geometry file is: gppn.txt
.............
solids and open space on U control planes:
xSolid1=838, xOpen1=762, ySolid1=1024, yOpen1=576
..............
maxstep=2048, clampstep=256, radius=0, solids=89001
rho0effective1=1.06889
------------
STEP Ux txBody txBody_corr
288 -1.938629e-006 0.000000e+000 0.000000e+000
320 -1.558804e-006 0.000000e+000 0.000000e+000
352 -1.341729e-006 0.000000e+000 0.000000e+000
384 -1.125751e-006 0.000000e+000 0.000000e+000
416 -9.443948e-007 0.000000e+000 0.000000e+000
448 -7.937841e-007 0.000000e+000 0.000000e+000
480 -6.688104e-007 0.000000e+000 0.000000e+000
512 -5.654746e-007 0.000000e+000 0.000000e+000
544 -4.819961e-007 0.000000e+000 0.000000e+000
576 -4.131421e-007 0.000000e+000 0.000000e+000
608 -3.566196e-007 0.000000e+000 0.000000e+000
640 -3.096196e-007 0.000000e+000 0.000000e+000
672 -2.705808e-007 0.000000e+000 0.000000e+000
704 -2.396400e-007 0.000000e+000 0.000000e+000
736 -2.144511e-007 0.000000e+000 0.000000e+000
768 -1.928805e-007 0.000000e+000 0.000000e+000
800 -1.758781e-007 0.000000e+000 0.000000e+000
832 -1.592800e-007 0.000000e+000 0.000000e+000
864 -1.456146e-007 0.000000e+000 0.000000e+000
896 -1.389413e-007 0.000000e+000 0.000000e+000
928 -1.323898e-007 0.000000e+000 0.000000e+000
960 -1.266565e-007 0.000000e+000 0.000000e+000
992 -1.222470e-007 0.000000e+000 0.000000e+000
1024 -1.176777e-007 0.000000e+000 0.000000e+000
1056 -1.139793e-007 0.000000e+000 0.000000e+000
1088 -1.097629e-007 0.000000e+000 0.000000e+000
1120 -1.076828e-007 0.000000e+000 0.000000e+000
1152 -1.068908e-007 0.000000e+000 0.000000e+000
1184 -1.033071e-007 0.000000e+000 0.000000e+000
1216 -1.022397e-007 0.000000e+000 0.000000e+000
1248 -1.009435e-007 0.000000e+000 0.000000e+000
1280 -1.010386e-007 0.000000e+000 0.000000e+000
1312 -9.977029e-008 0.000000e+000 0.000000e+000
1344 -9.922373e-008 0.000000e+000 0.000000e+000
1376 -9.888766e-008 0.000000e+000 0.000000e+000
1408 -9.782917e-008 0.000000e+000 0.000000e+000
1440 -9.774550e-008 0.000000e+000 0.000000e+000
1472 -9.776144e-008 0.000000e+000 0.000000e+000
1504 -9.687410e-008 0.000000e+000 0.000000e+000
1536 -9.795440e-008 0.000000e+000 0.000000e+000
1568 -9.582627e-008 0.000000e+000 0.000000e+000
1600 -9.660255e-008 0.000000e+000 0.000000e+000
1632 -9.705410e-008 0.000000e+000 0.000000e+000
1664 -9.715911e-008 0.000000e+000 0.000000e+000
1696 -9.690212e-008 0.000000e+000 0.000000e+000
1728 -9.569668e-008 0.000000e+000 0.000000e+000
1760 -9.673209e-008 0.000000e+000 0.000000e+000
1792 -9.612396e-008 0.000000e+000 0.000000e+000
1824 -9.685042e-008 0.000000e+000 0.000000e+000
1856 -9.713350e-008 0.000000e+000 0.000000e+000
1888 -9.562302e-008 0.000000e+000 0.000000e+000
1920 -9.546941e-008 0.000000e+000 0.000000e+000
1952 -9.661963e-008 0.000000e+000 0.000000e+000
1984 -9.638408e-008 0.000000e+000 0.000000e+000
2016 -9.677484e-008 0.000000e+000 0.000000e+000
STEP=2048, NCOL=100, NROW=100, NLAY=16, xBody=0.000000, yBody=0.000500
Tau[0]=0.950000 Tau[1]=0.500300
Clamped xBody= 0.000000e+000, final ux/xBody=1.000000e+030
END=====dp 03/25/2005 01:30:31 ========

******************** end 3-3-3-8 **************************

I trust you have a way of subtracting the time
spent drawing character maps on the screen, from
the measured start and stop times in the files
above.

HTH,
Paul
 
Paul said:
"H.W. said:
The zip file below has instructions for one more bench --
if you could run it on
your fast P4, I'd appreciate that effort.
http://hwstock.org/ppn/ppn.zip
[...]
I trust you have a way of subtracting the time
spent drawing character maps on the screen, from
the measured start and stop times in the files
above.


Thanks!

The time taken in screen draws is actually quite small -- less than the
resolution of the timer (if I eliminate the screen redraws, there is no
difference in the time for 2048 steps, to one second). Your system is ~1.13
times as fast as mine for this benchmark. Que sera, sera (apologies to the
French and the "Barefoot Contessa").
 
"H.W. said:
So far, the new system, with 2 GB as 2x1GB XMS 3200, "3800+" 64-bit, is
benching about 1.87x as fast as the P4 1.4 GHz, PC800 system I bought over 4
years ago -- for less money. I guess I'll learn to be happy with that, but
I was frankly expecting more improvement. I should have gone for a P4
system.

Hmm. And the real clock rate of the 3800+ is 2.4GHz.

Have a look at the floating point benchmark results:

http://www.tomshardware.com/cpu/20041221/cpu_charts-23.html

I would have thought the Athlon64 would fare better there.
My 3200+ Barton does almost as well as several of the
Athlon64 processors in that list ?

http://groups.google.ca/[email protected]

"The Athlon cores have an IPC of around 1.5x that of the P4, on
average. With fully-parallel FP SIMD code it drops close to 1.0x,
while for non-SIMD FP or integer-heavy code it can rise to 2.0x."

Does that make any sense ? Perhaps a different optimization is needed ?

Paul
 
Paul said:
Hmm. And the real clock rate of the 3800+ is 2.4GHz.

Have a look at the floating point benchmark results:

http://www.tomshardware.com/cpu/20041221/cpu_charts-23.html

I would have thought the Athlon64 would fare better there.
My 3200+ Barton does almost as well as several of the
Athlon64 processors in that list ?

I have used SIMD heavily in my code, via hand optimizations using intrinsics --
probably similar to the optimizations in multimedia benchmarks. In the past,
while those optimizations always helped both Athlon XP and P4, they helped the
latter more. But again, the best predictor I saw was the old (true) stream
benchmark, as I have pared away all excess math operations, and parallelized the
rest, to a point where the code is memory-bound. The memory set is so large
that a big cache is negligible help. I haven't been able to get a straight
answer on what the "unbuffered" vs. "buffered" Sandra stream benchmarks do, but
everyone seems to use the latter nowadays; until ~2001, there was only one
Sandra "stream", which correlated rather well with the true stream (except it
had a much larger random variation).

http://groups.google.ca/[email protected]

"The Athlon cores have an IPC of around 1.5x that of the P4, on
average. With fully-parallel FP SIMD code it drops close to 1.0x,
while for non-SIMD FP or integer-heavy code it can rise to 2.0x."

Does that make any sense ? Perhaps a different optimization is needed ?

Perhaps the only thing I could do right now would be to monkey with the hand
prefetch optimizations; but I suspect that might just by a few percentage points
here and there.
 
"H.W. said:
Paul said:
"H.W. said:
The zip file below has instructions for one more bench --
if you could run it on
your fast P4, I'd appreciate that effort.
http://hwstock.org/ppn/ppn.zip
[...]
I trust you have a way of subtracting the time
spent drawing character maps on the screen, from
the measured start and stop times in the files
above.


Thanks!

The time taken in screen draws is actually quite small -- less than the
resolution of the timer (if I eliminate the screen redraws, there is no
difference in the time for 2048 steps, to one second). Your system is ~1.13
times as fast as mine for this benchmark. Que sera, sera (apologies to the
French and the "Barefoot Contessa").

Here is the A7N8X-E AthlonXP 3200+ 32bit processor
200x11=2200MHz core, DDR400 dual channel 2-2-2-6 memory.

************* A7N8X-E *******************************
START::::::: dp 03/25/2005 17:07:59 :::::::START
Current working directory is:
Version: 1.03m Date of version: 03-24-05
Computer: User:
Geometry file is: gppn.txt
.............
solids and open space on U control planes:
xSolid1=838, xOpen1=762, ySolid1=1024, yOpen1=576
..............
maxstep=2048, clampstep=256, radius=0, solids=89001
rho0effective1=1.06889
------------
STEP Ux txBody txBody_corr
288 -1.938629e-006 0.000000e+000 0.000000e+000
320 -1.558804e-006 0.000000e+000 0.000000e+000
352 -1.341729e-006 0.000000e+000 0.000000e+000
384 -1.125751e-006 0.000000e+000 0.000000e+000
416 -9.443948e-007 0.000000e+000 0.000000e+000
448 -7.937841e-007 0.000000e+000 0.000000e+000
480 -6.688104e-007 0.000000e+000 0.000000e+000
512 -5.654746e-007 0.000000e+000 0.000000e+000
544 -4.819961e-007 0.000000e+000 0.000000e+000
576 -4.131421e-007 0.000000e+000 0.000000e+000
608 -3.566196e-007 0.000000e+000 0.000000e+000
640 -3.096196e-007 0.000000e+000 0.000000e+000
672 -2.705808e-007 0.000000e+000 0.000000e+000
704 -2.396400e-007 0.000000e+000 0.000000e+000
736 -2.144511e-007 0.000000e+000 0.000000e+000
768 -1.928805e-007 0.000000e+000 0.000000e+000
800 -1.758781e-007 0.000000e+000 0.000000e+000
832 -1.592800e-007 0.000000e+000 0.000000e+000
864 -1.456146e-007 0.000000e+000 0.000000e+000
896 -1.389413e-007 0.000000e+000 0.000000e+000
928 -1.323898e-007 0.000000e+000 0.000000e+000
960 -1.266565e-007 0.000000e+000 0.000000e+000
992 -1.222470e-007 0.000000e+000 0.000000e+000
1024 -1.176777e-007 0.000000e+000 0.000000e+000
1056 -1.139793e-007 0.000000e+000 0.000000e+000
1088 -1.097629e-007 0.000000e+000 0.000000e+000
1120 -1.076828e-007 0.000000e+000 0.000000e+000
1152 -1.068908e-007 0.000000e+000 0.000000e+000
1184 -1.033071e-007 0.000000e+000 0.000000e+000
1216 -1.022397e-007 0.000000e+000 0.000000e+000
1248 -1.009435e-007 0.000000e+000 0.000000e+000
1280 -1.010386e-007 0.000000e+000 0.000000e+000
1312 -9.977029e-008 0.000000e+000 0.000000e+000
1344 -9.922373e-008 0.000000e+000 0.000000e+000
1376 -9.888766e-008 0.000000e+000 0.000000e+000
1408 -9.782917e-008 0.000000e+000 0.000000e+000
1440 -9.774550e-008 0.000000e+000 0.000000e+000
1472 -9.776144e-008 0.000000e+000 0.000000e+000
1504 -9.687410e-008 0.000000e+000 0.000000e+000
1536 -9.795440e-008 0.000000e+000 0.000000e+000
1568 -9.582627e-008 0.000000e+000 0.000000e+000
1600 -9.660255e-008 0.000000e+000 0.000000e+000
1632 -9.705410e-008 0.000000e+000 0.000000e+000
1664 -9.715911e-008 0.000000e+000 0.000000e+000
1696 -9.690212e-008 0.000000e+000 0.000000e+000
1728 -9.569668e-008 0.000000e+000 0.000000e+000
1760 -9.673209e-008 0.000000e+000 0.000000e+000
1792 -9.612396e-008 0.000000e+000 0.000000e+000
1824 -9.685042e-008 0.000000e+000 0.000000e+000
1856 -9.713350e-008 0.000000e+000 0.000000e+000
1888 -9.562302e-008 0.000000e+000 0.000000e+000
1920 -9.546941e-008 0.000000e+000 0.000000e+000
1952 -9.661963e-008 0.000000e+000 0.000000e+000
1984 -9.638408e-008 0.000000e+000 0.000000e+000
2016 -9.677484e-008 0.000000e+000 0.000000e+000
STEP=2048, NCOL=100, NROW=100, NLAY=16, xBody=0.000000, yBody=0.000500
Tau[0]=0.950000 Tau[1]=0.500300
Clamped xBody= 0.000000e+000, final ux/xBody=1.000000e+030
END=====dp 03/25/2005 17:09:11 ========
************* end A7N8X-E *******************************

P4C800-E Deluxe, 2.8C P4, FSB800, DDR400, 2-2-2-6 dual channel
2x512MB ==> 12.45 MUPS (memtest86+ 1.4 bandwidth ==> 2955MB/sec)
new bench = 48 seconds = 6.83 MUPs ?

P4C800-E Deluxe, 2.8C P4, FSB800, DDR400, 3-3-3-8 dual channel
2x512MB ==> 10.59 MUPS (memtest86+ 1.4 bandwidth ==> 2549MB/sec)
new bench = 53 seconds = 6.18 MUPs ?

A7N8X-E Deluxe, 3200+, 200x11=2200MHz, DDR400, 2-2-2-6 dual channel
2x512MB ==> 7.67 MUPS (memtest86+ 1.4 bandwidth ==> 1485MB/sec)
new bench = 72 seconds = 4.55 MUPs ?

At the risk of "bringing coal to Newcastle", I did a search on
"optimizing athlon64" and one of the first hits:

http://www.moskalyuk.com/links/cpp.htm -->
"Optimizing Your C/C++ Applications, Part 2"
http://www.devx.com/amd/Article/21545?trk=DXRSS_LATEST -->
"Software optimization guide Athlon64"

http://www.amd.com/us-en/assets/content_type/DownloadableAssets/dwamd_25112.pdf

Perhaps there are some clever optimizations out there.

HTH,
Paul
 
Paul said:
"optimizing athlon64" and one of the first hits:

http://www.moskalyuk.com/links/cpp.htm -->
"Optimizing Your C/C++ Applications, Part 2"
http://www.devx.com/amd/Article/21545?trk=DXRSS_LATEST -->
"Software optimization guide Athlon64"

http://www.amd.com/us-en/assets/content_type/DownloadableAssets/dwamd_25112.pdf

Perhaps there are some clever optimizations out there.


Thanks again for your time --

I've pretty much done all those optimizations, which are not really too
processor specific. I did the subexpression elimination and manual
dereferencing of pointers about 7 years ago, and the conversion to SSE (SIMD
instructions) about 4 years back, plus a lot more. I'm afraid it is now a
question of memory speed. Rambus 4-channel has missed the bus, and for the
vast majority of people, that's fine. Vanilla rambus could rarely compete
with vanilla DDR in price; but it is ironic that people were so willing to
pay mucho extra bucks for really fast DDR.

My 2 twinx xms 1GB modules are "only" 3-3-3-8 (or 6, depending on who you
believe).
 
Paul said:
Thanks!

The time taken in screen draws is actually quite small -- less than the
resolution of the timer (if I eliminate the screen redraws, there is no
difference in the time for 2048 steps, to one second). Your system is ~1.13
times as fast as mine for this benchmark. Que sera, sera (apologies to the
French and the "Barefoot Contessa").
[...]
Perhaps there are some clever optimizations out there.


I just got a 15% improvement, from an optimization that wasn't too clever -
and involved no coding!!

I noticed that the 2T setting in the bios was on "auto". I just changed it
to "disable" and got the 15% improvement in my application, and about 18% in
sandra 2005 stream (now ~6000 MB/s). I guess this should have been a "duh",
but I somehow thought the BIOS would see that there were only two DDR
modules, and would pick the faster 2T setting.

I hope there are no down sides to disabling 2T -- like having the computer
burst into flames, a fall in the Dow-Jones, an earthquake in China, etc.
 
"H.W. said:
Paul said:
Thanks!

The time taken in screen draws is actually quite small -- less than the
resolution of the timer (if I eliminate the screen redraws, there is no
difference in the time for 2048 steps, to one second). Your system is ~1.13
times as fast as mine for this benchmark. Que sera, sera (apologies to the
French and the "Barefoot Contessa").
[...]
Perhaps there are some clever optimizations out there.


I just got a 15% improvement, from an optimization that wasn't too clever -
and involved no coding!!

I noticed that the 2T setting in the bios was on "auto". I just changed it
to "disable" and got the 15% improvement in my application, and about 18% in
sandra 2005 stream (now ~6000 MB/s). I guess this should have been a "duh",
but I somehow thought the BIOS would see that there were only two DDR
modules, and would pick the faster 2T setting.

I hope there are no down sides to disabling 2T -- like having the computer
burst into flames, a fall in the Dow-Jones, an earthquake in China, etc.

Before you know it, you'll be overclocking :-)))

Paul
 
Back
Top