"Carl Daniel [VC++ MVP]" <
[email protected]>
wrote in message | | > "Carl Daniel [VC++ MVP]"
<
[email protected]>
| > Are you sure you compiled non optimized (/Od and /o-)? As I said, the
C++
| > compiler will hoist the loop when optimization is on (O&, O2 or
whatever).
|
| Quite certain - I used the exact command lines given in your posting
| (optimization if off by default as well, so specifying nothing is
equivalent
| to /Od).
|
| > It's not the resolution it's the interval which is the cullprit.
|
| We're talking about the same thing - 15ms precision is quite sufficient
for
| measuring intervals of 500ms or more and certainly won't account for a 50%
| measurement error for such intervals - only 3% or so.
Yes, but not for a loop of 10.000.000 (as in Don's code), which takes only
takes 37 msecs. to complete. And as I said on SMP systems this interval can
be as large as 60 msecs. (as I have measured here on a Compaq Proliant 8 way
system).
|
| > That's very surprising, QueryPerformanceFrequency (StopWatch.Frequency)
| > should not be that high, notice that this Frequency is not the CPU clock
| > frequency, it's the output of a CPU clock divider, it's frequency is
much
| > lower, on my System it's 3579545MHz (try with:
| > Console::WriteLine(System:
iagnostics::Stopwatch::Frequency)
| > If on your system it's much higher than 1GHz, you might have an issue
with
| > your system.
|
| (You made a typo - on your system it's 3579545Hz, not MHz)
|
Right, sorry for that.
| If your machine uses the MP HAL (which mine does), then QPC uses the RDTSC
| instruction which does report actual CPU core clocks. If your system
| doesn't use the MP HAL, then QPC uses the system board timer, which
| generally has a clock speed of 1X or 0.5X the NTSC color burst frequency
of
| 3.57954545 Mhz. Note that this rate has absolutely nothing to do with
your
| CPU clock - it's a completely independent crystal oscillator on the MB.
|
True MP HAL uses the externam CPU clock (yours runs at 3.052420000 GHz), but
the 3.57954545 Mhz clock is derived from a divider or otherwise stated, the
CPU clock (internal) is always a multiple of this 3.57954545 MHz, for
instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 =
995MHz. The stepping number is important here, as it may change the dividers
value.
No my current test machine is not a MP or HT, so it doesn't use an MP HAL,
and you didn't specify that either in your previous reply, it's quite
important as I know about the MP HAL.
| > | With that change, I get a time of 15.8us for the C++ code and 42.3us
| > for
| > | the C# code - about the same difference I saw with GetTickCount.
| > |
| >
| > Hmmm , 15.8 ùsec. for 10000000 loops in which you execute 6 instructions
| > [1]per loop, that would mean 60000000 instructions in 15.8µsec or
0.000263
| > nanosecs/instruction, or ~4.000.000.000.000 instructions/sec.- not
| > possible
| > really, looks like the loop is hoisted or your clock is broken ;-).
|
| I agree - it doesn't add up. I'm quite sure that I did unoptimized
builds,
| and the results are 100% reproducible. But see below.
|
| > | It seems that there's something significantly different about your
| > machine
| > | as compared to mine & Don's when it comes to the performance of this
| > code -
| > | and that is very interesting!
| >
| > Looks like you have to investigate the Frequency value returned first,
and
| > inspect your code.
|
| Well, it's your code - not mine. The Frequency value is right on for this
| machine.
|
Well ..., it's Don's code. What do you mean with the Frequency value is
right? The Frequency is also right on mine
.
| I'm at my office right now, on a different computer. This one's a 3GHz
| Pentium D. I modified the samples as before to make nanosecPerTick double
| instead of Int64 and added code to print the value of Stopwatch.Frequency
| and the raw Ticks and nanosecPerTick. Here are the results:
|
| C:\Dev\Misc\fortest>fortest0312cs
| Stopwatch frequency=3052420000
| 0.327608913583321 ns/tick
| 240117 ticks
| 78664.4695028862 nanoseconds
|
| C:\Dev\Misc\fortest>fortest0312cpp
| Stopwatch frequency=3052420000
| 0.327608913583321 ns/tick
| 49225 ticks
| 16126.548771139 nanoseconds
|
That's for 10000000 loops I assume.
| Increasing the loop count by a factor of 10 increases the times by a
factor
| of 10. Decreasing by a factor of 10 decreases the times by a factor of
10.
| Clearly the loop has not been optimized out, but that still doesn't
explain
| the apparent execution speed of more than 200 adds per clock cycle (I know
| modern CPUs are somewhat super-scalar, but 200 adds/clock? I don't think
| so!)
|
That's not possible, Intel Pentium IV CPU's fetches and executes 2
instruction per cycle.
The AMD Athlon 64 fetches and executes a max. of 3 instructions per cycle,
(mine clocks at 2.2GHz)
These are the results on PIV 3GHz not HT running W2K3 R2.
C#
Frequency = 3579545
46632867 nanoseconds
C++
Frequency = 3579545
40659177 nanoseconds
Notice the difference between C++ and C#, looks like the X86 JIT'd code is
not exactly the same, have to check this.
Remember the results on AMD 64 bit (XP SP2) - 37368702 nanoseconds, that
means that the AMD the Intel 3GHz show comparable results, as expected.
| I don't know what's going on here, but two things seem to be true:
|
| 1. The C++ code is faster on these machines. If I increase the loop count
| to 1,000,000,000 I can clearly see the difference in execution time with
my
| eyes.
Assumed the timings are correct, it's simply not possible to execute that
number instructions during that time, so there must be something going on
here.
| 2. The Stopwatch class doesn't appear to work correctly on these
machines -
| it's measuring times that are orders of magnitude too short, yet still
| proportional to the actual time spent.
|
| Working on the assumpting that #2 is true, I modified the code to call
| QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are the
| results:
|
| C:\Dev\Misc\fortest>fortest0312cpp
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 22388910 ticks
| 7334806.48141475 nanoseconds
|
| C:\Dev\Misc\fortest>fortest0312cs
| QPC frequency=3052420000
| 0.327608913583321 ns/tick
| 58980368 ticks
| 19322494.2832245 nanoseconds
|
How many loops here?
| The times are now much more reasonable - Stopwatch apparently doesn't work
| correctly with such a high value from QPF (it's apparently off by a factor
| of 1000).
This is really strange as Stopwatch uses the same QueryPerformanceCounter
and Frequency under the hood.
The ratio of times remains about equal though- the C++ code is
| still nearly 2X faster on this machine (despite the fact that that makes
no
| sense at all, it seems to be true).
|
Time to expect the Stopwatch code, and I'll try to prepare a multicore or HT
box to do some more tests.
wd.