Missed jmp Optimization?

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I am using MSVC++ 2005, and compiling with /02, Maximum Speed (/0x Full
Optimizations doesn't change anything).

Never mind what my code does, it's what the compiler does with my code that
I'm posting about.

This will generate a forward jump, basically skipping over a move (increment)
if (BinaryHeap[current2].key > BinaryHeap[current2+1].key) {
++current2;
}

This code here is much faster:
current2 += (BinaryHeap[current2].key > BinaryHeap[current2+1].key);
Disassembler shows it generates an xor, setg, and add, but no jmp.

Assuming the branches are roughly 50/50, which in my case they are, the code
without the jump is much faster. In my benchmark, the cycles dropped from
about 14k to 12k, because the jump is inside a tight loop.
I don't know what the ratio of branch taken vs. branch not taken would have
to be for the default code to be faster. I can only guess that this
optimization wasn't used because it wasn't a sure optimization. But IMO it's
pretty close to a sure thing. I'm sure there is a way for a programmer who
knows a jump will be taken less often to write his code in such a way that it
generates a single jump rather than multiple ops, and is therefore faster.
So I think the code without the jump should be generated by default by the
compiler. This allows me to have my cake (readability) and eat it too
(performance).

Thoughts? Code available upon request.
 
AalaarDB said:
I am using MSVC++ 2005, and compiling with /02, Maximum Speed (/0x
Full Optimizations doesn't change anything).

Never mind what my code does, it's what the compiler does with my
code that I'm posting about.

This will generate a forward jump, basically skipping over a move
(increment) if (BinaryHeap[current2].key >
BinaryHeap[current2+1].key) { ++current2;
}

This code here is much faster:
current2 += (BinaryHeap[current2].key > BinaryHeap[current2+1].key);
Disassembler shows it generates an xor, setg, and add, but no jmp.

Assuming the branches are roughly 50/50, which in my case they are,
the code without the jump is much faster. In my benchmark, the
cycles dropped from about 14k to 12k, because the jump is inside a
tight loop.
I don't know what the ratio of branch taken vs. branch not taken
would have to be for the default code to be faster. I can only guess
that this optimization wasn't used because it wasn't a sure
optimization. But IMO it's pretty close to a sure thing. I'm sure
there is a way for a programmer who knows a jump will be taken less
often to write his code in such a way that it generates a single jump
rather than multiple ops, and is therefore faster. So I think the
code without the jump should be generated by default by the compiler.
This allows me to have my cake (readability) and eat it too
(performance).

Thoughts? Code available upon request.

Interesting. My thought is that it's too specialized an optimization to be
worth the trouble of modifying the compiler. It's likely that on a typical
code base it'll make an immeausrably small performance improvement. It
sounds like your case is special.

-cd
 
Back
Top