O
Olaf Baeyens
Hi Olaf, Are you looking for my original code, or just code that fails ?I try to look at the code that you tried to make it work in theory.
I had prepared a big mail with disassembler stuff. But I did not have enough
time at my company, so I wanted to complete it at home. But I see that
you already discovered some of my points.
First of all the code assumes a Intel byte order, mine did not and would
work on C# and C++ including none-Intel.
Second this --P thing was moving in the wrong direction. This way you
unecessarely slow down any memory operations since memory tends to start
preloading the next memory byte in bursts. And you use the previous instead
thus stalling any memory cycle.
On most processors/memories or any, memory reads are optimized to be read
from low to high in sequenctional order.
Interesting, is that you also found the way I would have done it that indeed
works on VC++ 2003
But then again, the Intel byte order is not gone and you had to use a global
variable in your case to optimize it.
The problem with a global memory is that it might not reside in one of the
32 bytes cache lines of your processor cache, so you might lose time to to
load the memory to the processor cache. And you loas an additional 32 bytes
of processor cache memory for that one global variable again slowing down
something else.
But you biggest bottleneck is this thing:
Compared to my code, you need additional memory access and worse you only
read one byte at a time, while the complete original variable of 32 bits
would have been aligned on a memory/4 boundary thus loaded on one pass in
memory.
Only your compiler optimizer discovered that it could reduce one byte read
in it's first load.
Both these assembler instructions indicates that you need additional reads
and also on memory locations that are not devidable by 4. So giving your
processor another stall.
You should really try my way and check the optimized results.You will lose
at least the 2 memory reads.
I don't think I need to post the message I was planning to. ;-)
Everything here has been said.
I had prepared a big mail with disassembler stuff. But I did not have enough
time at my company, so I wanted to complete it at home. But I see that
you already discovered some of my points.
First of all the code assumes a Intel byte order, mine did not and would
work on C# and C++ including none-Intel.
Second this --P thing was moving in the wrong direction. This way you
unecessarely slow down any memory operations since memory tends to start
preloading the next memory byte in bursts. And you use the previous instead
thus stalling any memory cycle.
On most processors/memories or any, memory reads are optimized to be read
from low to high in sequenctional order.
Interesting, is that you also found the way I would have done it that indeed
works on VC++ 2003
But then again, the Intel byte order is not gone and you had to use a global
variable in your case to optimize it.
The problem with a global memory is that it might not reside in one of the
32 bytes cache lines of your processor cache, so you might lose time to to
load the memory to the processor cache. And you loas an additional 32 bytes
of processor cache memory for that one global variable again slowing down
something else.
__inline int Large_is_First_32 ( int X ) {
uint_8_P P = ( uint_8_P ) & X ;
return * P << 24 | P [ 1 ] << 16 | P [ 2 ] << 8 | P [ 3 ] ; }
But you biggest bottleneck is this thing:
0C movzx eax,byte ptr [rv+2 (403012h)]
....
15 movzx edx,byte ptr [rv+3 (403013h)]
Compared to my code, you need additional memory access and worse you only
read one byte at a time, while the complete original variable of 32 bits
would have been aligned on a memory/4 boundary thus loaded on one pass in
memory.
Only your compiler optimizer discovered that it could reduce one byte read
in it's first load.
Both these assembler instructions indicates that you need additional reads
and also on memory locations that are not devidable by 4. So giving your
processor another stall.
You should really try my way and check the optimized results.You will lose
at least the 2 memory reads.
I don't think I need to post the message I was planning to. ;-)
Everything here has been said.