All,
A colleague here found an interesting performance regression with Intel Fortran 14. It's probable that it's related to this C regression, but I thought I'd place it here in case there is a more Intel Fortran specific fix.
To wit, the simple program to look at is:
program dotProduct implicit none integer, parameter :: npts=100000 integer, parameter :: nobs=1000 integer, parameter :: nanals=64 real, allocatable, dimension(:,:) :: analysis_chunk,analysis_ob,dotProd integer :: i,k,j real t1,t2, tmpsum integer(kind=8) :: cr, cm real(kind=8) :: rate integer(kind=8) :: random_start, random_end integer(kind=8) :: dot_start, dot_end real(kind=8) :: random_time, dot_time call system_clock(count_rate=cr) call system_clock(count_max=cm) rate = real(cr,kind=8) allocate(analysis_chunk(npts,nanals)) allocate(analysis_ob(nobs,nanals)) allocate(dotProd(nobs,npts)) call system_clock(random_start) call random_number(analysis_ob) call random_number(analysis_chunk) call system_clock(random_end) random_time = (random_end-random_start)/rate write (*,111) 'Random generation time:', random_time call system_clock(dot_start) do i=1,nobs do k=1,npts dotProd(i,k) = sum(analysis_chunk(k,:)*analysis_ob(i,:))/float(nanals-1) enddo enddo call system_clock(dot_end) dot_time = (dot_end-dot_start)/rate write (*,111) "dotProd time:", dot_time write (*,111) 'dotProd(npts,nobs):', dotProd(npts,nobs) deallocate(analysis_chunk) deallocate(analysis_ob) deallocate(dotProd) 111 format (1X,A,T20,F12.8) end program dotProduct(Wow...that is some funky indenting by the code parser).
Now if I compile with Intel 13:
$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.3.192 Build 20130607
Copyright (C) 1985-2013 Intel Corporation. All rights reserved.
(1091) $ ifort -O3 dot.f90 -o dot.ifort13.exe
(1092) $ ./dot.ifort13.exe
Random generation 0.10362200
dotProd time: 0.82572200
dotProd(npts,nobs) 0.21707585
And now Intel 14:
$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.1.106 Build 20131008
Copyright (C) 1985-2013 Intel Corporation. All rights reserved.
(1065) $ ifort -O3 dot.f90 -o dot.ifort14.exe
(1066) $ ./dot.ifort14.exe
Random generation 0.10825000
dotProd time: 8.31167000
dotProd(npts,nobs) 0.21707585
As you can see, the Intel 14 version is about 9x slower. Hmm. Looking at -opt-report1 -vec-report1, we can see the difference:
Intel 13:
HPO VECTORIZER REPORT (MAIN__) LOG OPENED ON Mon Dec 16 11:56:12 2013
<dot.f90;-1:-1;hpo_vectorization;MAIN__;0>
HPO Vectorizer Report (MAIN__)
dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
dot.f90(37:66-37:66):VEC:MAIN__: PERMUTED LOOP WAS VECTORIZED
dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
PERMUTED LOOP WAS VECTORIZED
vectorization support: unroll factor set to 2
dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
PERMUTED LOOP WAS VECTORIZED
vectorization support: unroll factor set to 4
dot.f90(37): (col. 66) remark: PARTIAL LOOP WAS VECTORIZED.
PARTIAL LOOP WAS VECTORIZED
Intel 14:
HPO VECTORIZER REPORT (MAIN__) LOG OPENED ON Mon Dec 16 11:56:30 2013
<dot.f90;-1:-1;hpo_vectorization;MAIN__;0>
HPO Vectorizer Report (MAIN__)
dot.f90(27:9-27:9):VEC:MAIN__: loop was not vectorized: unsupported data type
loop was not vectorized: not inner loop
dot.f90(28:9-28:9):VEC:MAIN__: loop was not vectorized: unsupported data type
loop was not vectorized: not inner loop
dot.f90(37:25-37:25):VEC:MAIN__: loop was not vectorized: vectorization possible but seems inefficient
dot.f90(36): (col. 7) remark: OUTER LOOP WAS VECTORIZED
dot.f90(36:7-36:7):VEC:MAIN__: OUTER LOOP WAS VECTORIZED
dot.f90(37:25-37:25):VEC:MAIN__: loop was not vectorized: vectorization possible but seems inefficient
dot.f90(36:7-36:7):VEC:MAIN__: loop was not vectorized: low trip count
dot.f90(37:66-37:66):VEC:MAIN__: loop was not vectorized: not inner loop
So, it looks like the optimizer isn't doing right. If we read the "PERMUTED" part, one can try interchanging the loops in our calculation, and if we do we get:
Random generation 0.09767500
dotProd time: 2.87279900
dotProd(npts,nobs) 0.21707585
Much better, but still about 4x slower than Intel 13.
Also, as an aside, it looks like compiling with -xHost on Intel 14 leads to a slowdown even with the permuted code:
(Intel 13) $ ifort -O3 -xHost dot.permute.f90 -o dot.ifort13.permute.xHost.exe
(Intel 13) $ ./dot.ifort13.permute.xHost.exe
Random generation 0.09036300
dotProd time: 0.80169300
dotProd(npts,nobs) 0.21707587
(Intel 14) $ ifort -O3 -xHost dot.permute.f90 -o dot.ifort14.permute.xHost.exe
(Intel 14) $ ./dot.ifort14.permute.xHost.exe
Random generation 0.09545500
dotProd time: 3.34055000
dotProd(npts,nobs) 0.21707590
Finally, -fast really doesn't like this code with either compiler:
(Intel 13) $ ifort -fast dot.permute.f90 -o dot.ifort13.permute.fast.exe
ipo: remark #11001: performing single-file optimizations
ipo: remark #11006: generating object file /gpfsm/dnb31/tdirs/pbs/slurm.448506.mathomp4/ipo_iforta6D8B9.o
(Intel 13) $ ./dot.ifort13.permute.fast.exe
Random generation 0.09720600
dotProd time: 3.20100500
dotProd(npts,nobs) 0.21707590
(Intel 14) $ ifort -fast dot.permute.f90 -o dot.ifort14.permute.fast.exe
ipo: remark #11001: performing single-file optimizations
ipo: remark #11006: generating object file /gpfsm/dnb31/tdirs/pbs/slurm.448505.mathomp4/ipo_iforthmRKBV.o
(Intel 14) $ ./dot.ifort14.permute.fast.exe
Random generation 0.10952500
dotProd time: 3.42286700
dotProd(npts,nobs) 0.21707590
(That said, the compilers are even for once.)