alignment is lost with openmp

  • Thread starter Thread starter Gabest
  • Start date Start date
G

Gabest

I really like the new openmp implementation of vc, but I ran into a
something which may be a bug. Please see the following little example:

void funct()
{
__declspec(align(16)) BYTE buff[16*16];
#pragma omp parallel private(buff) num_threads(2)
{
#pragma omp for
for(int y = 0; y < height; y += 16)
{
// here a few instructions involving buff and sse2 intrinsics which
require 16 byte alignment
}
}
}

If I print the base address of buff inside the loop I can see that it has
lost its alignment and of course it crashes a little later there.
 
Gabest said:
I really like the new openmp implementation of vc, but I ran into a
something which may be a bug. Please see the following little example:

void funct()
{
__declspec(align(16)) BYTE buff[16*16];
#pragma omp parallel private(buff) num_threads(2)
{
#pragma omp for
for(int y = 0; y < height; y += 16)
{
// here a few instructions involving buff and sse2 intrinsics
which require 16 byte alignment
}
}
}

If I print the base address of buff inside the loop I can see that it
has lost its alignment and of course it crashes a little later there.

Please post a bug report with a complete repro case to
http://lab.msdn.microsoft.com/productfeedback/

-cd
 
Gabest said:

You might have taken the time to post a complete repro (there's no standard
include file named stdafx.h) or to actually fill out all the fields...

In any case, I'm unable to reproduce the problem with your sample. How
'bout a few more repro steps, such as the exact command-line arguments to
the compiler?

Here's what I get:

C:\Pub\Dev\cppbugs>cl -MD -arch:SSE2 -openmp ompalign0513.cpp
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50215.44 for
80x86
Copyright (C) Microsoft Corporation. All rights reserved.

ompalign0513.cpp
Microsoft (R) Incremental Linker Version 8.00.50215.44
Copyright (C) Microsoft Corporation. All rights reserved.

/out:ompalign0513.exe
ompalign0513.obj

C:\Pub\Dev\cppbugs>ompalign0513
thread_num=0, i=0, buff=0012FB90
thread_num=2, i=8, buff=00AEFE30
thread_num=3, i=12, buff=00BEFE30
thread_num=3, i=13, buff=00BEFE30
thread_num=0, i=1, buff=0012FB90
thread_num=0, i=2, buff=0012FB90
thread_num=0, i=3, buff=0012FB90
thread_num=3, i=14, buff=00BEFE30
thread_num=2, i=9, buff=00AEFE30
thread_num=2, i=10, buff=00AEFE30
thread_num=3, i=15, buff=00BEFE30
thread_num=1, i=4, buff=009EFE30
thread_num=1, i=5, buff=009EFE30
thread_num=2, i=11, buff=00AEFE30
thread_num=1, i=6, buff=009EFE30
thread_num=1, i=7, buff=009EFE30

-cd
 
You might have taken the time to post a complete repro (there's no
standard include file named stdafx.h) or to actually fill out all the
fields...

That was the auto generated precompiled header file. Pretty "standard" in
visual c, every new project gets that automagically. This one was a console
application with the default settings, I have only changed openmp support to
yes.
In any case, I'm unable to reproduce the problem with your sample. How
'bout a few more repro steps, such as the exact command-line arguments to
the compiler?

As I said, the default settings were used, except the /openmp switch of
course. But if you insist, here is the command line of the debug build:

/Od /D "WIN32" /D "_DEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm
/EHsc /RTC1 /MTd /GS- /openmp /Yu"stdafx.h" /Fp"Debug\omptest.pch"
/Fo"Debug\\" /Fd"Debug\vc80.pdb" /W3 /nologo /c /Wp64 /ZI /TP
/errorReport:prompt

/OUT:"Debug\omptest.exe" /INCREMENTAL /NOLOGO /MANIFEST:NO /DEBUG
/PDB:"i:\Progs\omptest\omptest\Debug\omptest.pdb" /SUBSYSTEM:CONSOLE
/MACHINE:X86 /ERRORREPORT:PROMPT kernel32.lib user32.lib gdi32.lib
winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib
uuid.lib odbc32.lib odbccp32.lib
Here's what I get:
...

Try to change the code around a bit or run it several times, eventually you
will see non 0 ending pointers.
 
Gabest said:
That was the auto generated precompiled header file. Pretty
"standard" in visual c, every new project gets that automagically.
This one was a console application with the default settings, I have
only changed openmp support to yes.

I normally do repros as stand-alone CPP files compiled from the command-line
with the minimal options necessary to elicit the bug. Old habit :)
As I said, the default settings were used, except the /openmp switch
of course. But if you insist, here is the command line of the debug
build:
/Od /D "WIN32" /D "_DEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE"
/Gm /EHsc /RTC1 /MTd /GS- /openmp /Yu"stdafx.h" /Fp"Debug\omptest.pch"
/Fo"Debug\\" /Fd"Debug\vc80.pdb" /W3 /nologo /c /Wp64 /ZI /TP
/errorReport:prompt

/OUT:"Debug\omptest.exe" /INCREMENTAL /NOLOGO /MANIFEST:NO /DEBUG
/PDB:"i:\Progs\omptest\omptest\Debug\omptest.pdb" /SUBSYSTEM:CONSOLE
/MACHINE:X86 /ERRORREPORT:PROMPT kernel32.lib user32.lib gdi32.lib
winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib
oleaut32.lib uuid.lib odbc32.lib odbccp32.lib

Ah Ha!

It's /ZI that does it. I'll add a note to your bug report to that effect.
Thanks for the details.

-cd
 
Ah Ha!

It's /ZI that does it. I'll add a note to your bug report to that effect.
Thanks for the details.

Well, I don't think that switch does it. Just tried the release build as
well with /Zi, then completly disabled the generation of debug info too, the
alignment was still wrong sometimes.

This is how it builds now for the release conofiguration, but I don't think
any of the switches can affect this problem.

/O2 /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD
/EHsc /MT /openmp /Yu"stdafx.h" /Fp"Release\omptest.pch" /Fo"Release\\"
/Fd"Release\vc80.pdb" /W3 /nologo /c /Wp64 /TP /errorReport:prompt

/OUT:"Release\omptest.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST
/MANIFESTFILE:"Release\omptest.exe.intermediate.manifest" /DEBUG
/PDB:"i:\progs\omptest\omptest\release\omptest.pdb" /SUBSYSTEM:CONSOLE
/OPT:REF /OPT:ICF /MACHINE:X86 /ERRORREPORT:PROMPT kernel32.lib user32.lib
gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib
oleaut32.lib uuid.lib odbc32.lib odbccp32.lib
 
Gabest said:
Well, I don't think that switch does it. Just tried the release build
as well with /Zi, then completly disabled the generation of debug
info too, the alignment was still wrong sometimes.

This is how it builds now for the release conofiguration, but I don't
think any of the switches can affect this problem.

You're right, it's not as simple as just -ZI. -Zi also seems to be
sufficient, as well as -O2. In any case, having at least one concrete repro
will allow someone to triage the bug. Watch it on product feedback center -
you should see something within a few days.

-cd
 
You're right, it's not as simple as just -ZI. -Zi also seems to be
sufficient, as well as -O2. In any case, having at least one concrete
repro will allow someone to triage the bug. Watch it on product feedback
center - you should see something within a few days.

Just stepped through the assembly and found something strange after I
tailored the code to the following. (also turned of security check to not
get in to the way). My comments are inlined.

__declspec(align(16)) BYTE buff[1234];

#pragma omp parallel private(buff) num_threads(2)
{
#pragma omp for
for(int i = 0; i < 2; i++)
{
_mm_store_si128((__m128i*)&buff[16], _mm_setzero_si128());
}
}

When the debugger reached this multi-threaded code then I saw this:

#pragma omp for
for(int i = 0; i < 2; i++)
00401000 push ebp
00401001 mov ebp,esp
00401003 and esp,0FFFFFFF0h
00401006 sub esp,4E0h

buff was just aligned to 16 bytes, fine!

0040100C lea eax,[esp]
0040100F push eax
00401010 lea ecx,[esp+8]
00401014 push ecx
00401015 push 1
00401017 push 1
00401019 push 1
0040101B push 0
0040101D call _vcomp_for_static_simple_init (40710Ah)

It pushed 6 arguments on stack (esp -= 0x18) and called
_vcomp_for_static_simple_init.

00401022 mov ecx,dword ptr [esp+1Ch]
00401026 mov eax,dword ptr [esp+18h]
0040102A add esp,18h

Looks like it is cleaning up the stack, esp += 0x18

0040102D cmp ecx,eax
0040102F jg wmain$omp$1+4Bh (40104Bh)
00401031 sub eax,ecx
00401033 pxor xmm0,xmm0
00401037 add eax,1
0040103A lea ebx,[ebx]
00401040 sub eax,1
#include <windows.h>
#include <omp.h>
#include <xmmintrin.h>
#include <emmintrin.h>
int _tmain(int argc, _TCHAR* argv[])
{
__declspec(align(16)) BYTE buff[1234];
buff[0] = 0;
#pragma omp parallel private(buff) num_threads(2)
{
#pragma omp for
for(int i = 0; i < 2; i++)
{
_mm_store_si128((__m128i*)&buff[16], _mm_setzero_si128());
00401043 movdqa xmmword ptr [esp+18h],xmm0

Storing at "esp+18h" ???? esp is aligned to 16 bytes right now
(esp=0x0012fa00), what's that +18h doing there? Also, this 18h looks
familiar, could it be a coincidence?

00401049 jne wmain$omp$1+40h (401040h)
{
#pragma omp for
for(int i = 0; i < 2; i++)
0040104B call _vcomp_for_static_end (407104h)
00401050 call _vcomp_barrier (4070FEh)
}
00401055 mov esp,ebp
00401057 pop ebp
00401058 ret
 
Back
Top