Curious about loop optimization C++ - assembly

Egbert Nierop \(MVP for IIS\) · Mar 20, 2006

Hi,

Out of curiousity, I sometimes look at the produced assembly after
compilation in release mode.

What you often see, is that CPP, always fully addresses registers to copy
values from a to b...

While stosb,stosw, stosd etc and the same for movs[x] are one statement, and
internally use registers ESI and EDI (source, destination) to copy data.

This seems (imho) more efficient, however, CPP never uses this construct...
it always uses a lot more instructions.

imagine this loop (I simplified the idea, of course, memcpy would be
normally used)

DWORD anArray [10000];

// copy array while skipping uneven element positions

for (int mycounter=5000; mycounter != 0; mycounter--, element+=2)
anArray[element] = somesource[element];

could be optimized to

setup source and destination

MOV EDI, [anArray]
MOV ESI, [somesource]
MOV ECX, myCounter
DEC ECX
CLD // forward copy

mylabel:
MOVSD <--- actual loop and copy instruction
LOOP mylabel <-- decrement ECX until ECX == 0

Q: is the mentioned construct, simply not so efficient or is there a reason
the C++ compiler team decided not to try to optimize to this level?

Carl Daniel [VC++ MVP] · Mar 20, 2006

Egbert said:
Hi,

Out of curiousity, I sometimes look at the produced assembly after
compilation in release mode.

What you often see, is that CPP, always fully addresses registers to
copy values from a to b...

While stosb,stosw, stosd etc and the same for movs[x] are one
statement, and internally use registers ESI and EDI (source,
destination) to copy data.
This seems (imho) more efficient, however, CPP never uses this
construct... it always uses a lot more instructions.

imagine this loop (I simplified the idea, of course, memcpy would be
normally used)

DWORD anArray [10000];

// copy array while skipping uneven element positions

for (int mycounter=5000; mycounter != 0; mycounter--, element+=2)
anArray[element] = somesource[element];

could be optimized to

setup source and destination

MOV EDI, [anArray]
MOV ESI, [somesource]
MOV ECX, myCounter
DEC ECX
CLD // forward copy

mylabel:
MOVSD <--- actual loop and copy instruction
LOOP mylabel <-- decrement ECX until ECX == 0

Q: is the mentioned construct, simply not so efficient or is there a
reason the C++ compiler team decided not to try to optimize to this
level?

The LOOP and MOVS instructions are horribly slow on modern CPUs because they
don't make effective use of the deep pipeline in the CPU. The longer
instruction sequence actually executes many times faster.

IIRC, VC++ did generate LOOP/MOVS years ago (VC1-4 maybe?), but has gone
away from using those constructs since maybe the Pentium.

-cd

Egbert Nierop \(MVP for IIS\) · Mar 20, 2006

Carl Daniel said:
Egbert said:

Hi,

DWORD anArray [10000];

// copy array while skipping uneven element positions

for (int mycounter=5000; mycounter != 0; mycounter--, element+=2)
anArray[element] = somesource[element];

could be optimized to

setup source and destination

MOV EDI, [anArray]
MOV ESI, [somesource]
MOV ECX, myCounter
DEC ECX
CLD // forward copy

mylabel:
MOVSD <--- actual loop and copy instruction
LOOP mylabel <-- decrement ECX until ECX == 0

Q: is the mentioned construct, simply not so efficient or is there a
reason the C++ compiler team decided not to try to optimize to this
level?

Click to expand...

The LOOP and MOVS instructions are horribly slow on modern CPUs because
they don't make effective use of the deep pipeline in the CPU. The longer
instruction sequence actually executes many times faster.

Interesting!

This seems to prove the remark of some C++ / ASM programmer somewhere on the
web. He stated that he could not optimize code anymore better than C++ did.
Normally, I tend to think 'ok, so he was not up to the task' but I seemed
wrong (again :-)

) .

Egbert Nierop \(MVP for IIS\) · Mar 20, 2006

Nope,
I've beaten the C++ optimization by 25% (by testing 100MB !) but this might
be true on a ATHLON 64, not for other CPUS possibly...

Anyway, you were right, that one cannot state, the less ASM instructions,
the faster!

ps: Function below is not meant to 'decode' be for real (it skips unicode
coding). Just for fun...

void __stdcall AnsiToBstr(PCSTR ansi, BSTR bstr, int writtenLen)
{
//#ifdef _M_IX86
DWORD ticks = GetTickCount();
__asm XOR AH, AH // just to clear the high part of our unicode char (= 2
bytes)
__asm MOV ECX, writtenLen // initialize our loop
__asm DEC ECX // our loop counter
__asm MOV EDI, [bstr] // destination index
__asm MOV ESI, [ansi] // source index
__asm labell:
__asm MOV AL, BYTE PTR [ESI] // copy a string byte
__asm MOV [EDI], AX
__asm INC EDI
__asm INC EDI
__asm INC ESI
__asm DEC ECX
__asm JNZ labell

//#else
wprintf(L"%d\n", GetTickCount() - ticks);
ticks = GetTickCount();

for (int loopit =
writtenLen - 1;
loopit != 0;
loopit--, bstr++, ansi++)
bstr[0] = ansi[0];

wprintf(L"%d\n", GetTickCount() - ticks);

//#endif
}

Curious about loop optimization C++ - assembly

Egbert Nierop \(MVP for IIS\)

Carl Daniel [VC++ MVP]

Egbert Nierop \(MVP for IIS\)

Egbert Nierop \(MVP for IIS\)