I'm still testing the ffte benchmark. Running it on the host with -O3 -xAVX -openmp works flawlessly and the performance looks great. Now I wanted to use that code on the Intel Xeon Phi. So I replaced the -xAVX option with -mmic in order to create a native binary. With ulimit and KMP_STACKSIZE I increased the stack to avoid a stack overflow.
Running the code on the Phi gives the following error (-g -traceback):
$ ./speed1d
N =
10
forrtl: severe (154): array index out of bounds
Image PC Routine Line Source
speed1d 000000000049252B Unknown Unknown Unknown
speed1d 0000000000490EB4 Unknown Unknown Unknown
speed1d 000000000045BE07 Unknown Unknown Unknown
speed1d 000000000043AAC5 Unknown Unknown Unknown
speed1d 000000000040F721 Unknown Unknown Unknown
libpthread.so.0 00007F47F8B1B800 Unknown Unknown Unknown
speed1d 000000000040AD3F fft5a_ 170 kernel.f
speed1d 00000000004061A3 fft235_ 148 fft235.f
speed1d 0000000000403F25 zfft1d_ 56 zfft1d.f
speed1d 0000000000403634 MAIN__ 31 speed1d.f
speed1d 000000000040346C Unknown Unknown Unknown
libc.so.6 00007F47F85CF634 Unknown Unknown Unknown
speed1d 0000000000403369 Unknown Unknown Unknown
Compiling it with -O2 gives the same error. When I use O1 or O0 everything works without any problems. When I run a debug test on the O3 build using gdb on the Phi, I noticed that the input variable N is somehow not correctly initialized after the READ call in the code. Thus the error may occur later in the program.
I tried -heap-arrays without effect. Using check all didn't work because it disables optimization and therefore not detecting any problems.
Can you point out the direction in which I need to investigate?