|
| "Carl Daniel [VC++ MVP]" <
[email protected]>
| wrote in message || || > "Carl Daniel [VC++ MVP]"
| <
[email protected]>
|| > | If your machine uses the MP HAL (which mine does), then QPC uses the
|| > RDTSC
|| > | instruction which does report actual CPU core clocks. If your system
|| > | doesn't use the MP HAL, then QPC uses the system board timer, which
|| > | generally has a clock speed of 1X or 0.5X the NTSC color burst
| frequency
|| > of
|| > | 3.57954545 Mhz. Note that this rate has absolutely nothing to do
with
|| > your
|| > | CPU clock - it's a completely independent crystal oscillator on the
| MB.
|| > |
|| > True MP HAL uses the externam CPU clock (yours runs at 3.052420000
GHz),
|| > but
|| > the 3.57954545 Mhz clock is derived from a divider or otherwise stated,
|| > the
|| > CPU clock (internal) is always a multiple of this 3.57954545 MHz, for
|| > instance an Intel PIII 1GHz steping 5 clocks at 3.57954545 Mhz * 278 =
|| > 995MHz. The stepping number is important here, as it may change the
|| > dividers
|| > value.
||
|| Not (necessarily) true. For example, this Pentium D machine uses a BCLK
|| frequency of 200Mhz with a multiplier of 15. There's no requirement
|| (imposed by the CPU or MCH) that the CPU clock be related to color burst
|| frequency at all.
||
| Carl, I'm not saying this is the case for all type of CPU's and mother
| boards, I only say that it's true for Pentiums up to III, things are
| different for other type of CPU's. See, AMD clocks at 200MHz with a
| multiplier of 11 or 12 depending on the type (and CPU id), this 200MHz
clock
| can be adjusted (overclocked or underclocked), the Frequency returned by
| QueryPerformanceFrequency stays the same, the same is true for recent
PIV's
| Pentium M and D. So here it's true that both aren't related, and the
| 3.57954545MHz clock is derived from the on baord Graphics controller or an
| external clock source (on mobo or not) when no on board graphics
controller,
| but the value remains the same 3.57954545MHz unless you are using a MP
HAL.
|
|| Now, it's entirely possible that the motherboard generates that 200Mhz
| BCLK
|| by multipliying a color burst crystal by 56 (200.45Mhz), but that's a
|| motherboard detail that's unrelated to the CPU. Without really digging,
|| there's no way I can tell one way or another - just looking at the MB, I
| see
|| at least 4 different crystal oscillators of unknown frequency.
| Historically,
|| the only reason color burst crystals are used is that they're cheap -
|| they're manufactured by the gazillion for NTSC televisions.
||
|
| I know,carl, I've been working for IHV's (HP before Compac, before DEC
....)
| I know what you are talking about. Even on DEC Alpha (AXP) systems, the
| QueryPerformance frequency was 3.57954545MHz using the mono CPU HAL, while
| on SMP boxes like the Alpha 8400 (with the MP HAL) range it was also not
the
| case, Jeez, what a bunch of problems did we have when porting W2K (never
| released for well known reasons) from intel code to AXP, just because some
| drivers and core OS components did not expect QueryPerformanceCounter
speeds
| higher that 1GHz (that is when we overclocked an 800MHz CPU).
|
|| > | Working on the assumpting that #2 is true, I modified the code to
call
|| > | QueryPerformanceCounter/QueryPerformanceFrequency directly. Here are
|| > the
|| > | results:
|| > |
|| > | C:\Dev\Misc\fortest>fortest0312cpp
|| > | QPC frequency=3052420000
|| > | 0.327608913583321 ns/tick
|| > | 22388910 ticks
|| > | 7334806.48141475 nanoseconds
|| > |
|| > | C:\Dev\Misc\fortest>fortest0312cs
|| > | QPC frequency=3052420000
|| > | 0.327608913583321 ns/tick
|| > | 58980368 ticks
|| > | 19322494.2832245 nanoseconds
|| > |
|| >
|| > How many loops here?
||
|| That's 10,000,000 loops - 2.2 clock cycles per loop sounds like a pretty
|| resonable rate to me - certainly not off by orders of magnitude.
||
|
| Sure it is, I was wrong when reading the tick values (largely over
midnight
| here, time to go to bed).
|
|| > | I don't know what's going on here, but two things seem to be true:
|| > |
|| > | 1. The C++ code is faster on these machines. If I increase the loop
|| > count
|| > | to 1,000,000,000 I can clearly see the difference in execution time
| with
|| > my
|| > | eyes.
|| >
|| > Assumed the timings are correct, it's simply not possible to execute
| that
|| > number instructions during that time, so there must be something going
| on
|| > here.
||
|| It's completely reasonable based on the times reported directly by QPC,
| not
|| the bogus values from Stopwatch, which is off by a factor of 1000 on
these
|| machines.
||
|| So, any theory why the C++ code consistently runs faster than the C# code
| on
|| both of my machines? I can't think of any reasonable argument why having
| a
|| dual core or HT CPU would make the C++ code run faster. Clearly the
JIT'd
|| code is different for the two loops - maybe there's some pathological
code
|| in the C# case that the P4 executes much more slowly than AMD, or some
|| optimal code in the C++ case that the P4 executes much more quickly than
|| AMD. I'd be curious to hear the details of Don's machine - Intel/AMD,
|| Single/HT/Dual, etc.
||
|| -cd
||
|
| Well I have investigated the native code generated on the Intel PIV (see
| previous .
| Here is (part of) the disassembly (VS2005)for C++:
| ...
| 0000001f 46 inc esi
| 00000020 81 FE 80 96 98 00 cmp esi,989680h
| 00000026 7D 03 jge 0000002B
| 00000028 90 nop ---> not sure why this one is good for, it's ignored by
the
| CPU anyway
| 00000029 EB F4 jmp 0000001F
| ...
|
| That means 4 instructions per loop compared to 6 on AMD.
| And the results are comparable to yours (for C++).
| Did not look at the C# code and it's result, but above shows that the JIT
| compiler generates (better?) code for PIV (don't know what the __cpuid
call
| returns, but I know the CLR checks it when booting). Again, notice this is
| an unoptimized code build (/Od flag set), optimized code is a totally
| different story.
|
| Willy.
|
Last follow up, (before my spouse pulls the plugs).
Here is the X86 output of a C# release build on both AMD and Intel PIV:
[1]
0000001c 46 inc esi
0000001d 81 FE 80 96 98 00 cmp esi,989680h
00000023 7C F7 jl
this results in 6.235684 msec on AMD and 7.023547 msec on PIV (10.000.000
loops).
while this is the debug build on Intel:
00000030 90 nop
00000031 90 nop
00000032 46 inc esi
00000033 81 FE 80 96 98 00 cmp esi,989680h
00000039 0F 9C C0 setl al
0000003c 0F B6 C0 movzx eax,al
0000003f 8B F8 mov edi,eax
00000041 85 FF test edi,edi
00000043 75 EB jne 00000030
See that the release build is the most optimum X86 code possible for the
loop. The C++/CLI compiler in optimized build hoists the loop completely, so
can't compare.
Carl, could you look at the disassembly on your box, not a problem if you
can't (It doesn't mean that much anyway), it looks like on you box the
C++/CLI output looks more like [1] above.
Willy.