Quantcast
Channel: Intel® Software - Intel® Fortran Compiler for Linux* and macOS*
Viewing all articles
Browse latest Browse all 2746

Performance Regression with Intel 14.0.1.106

$
0
0

All,

A colleague here found an interesting performance regression with Intel Fortran 14. It's probable that it's related to this C regression, but I thought I'd place it here in case there is a more Intel Fortran specific fix.

To wit, the simple program to look at is:

program dotProduct

	   implicit none
   integer, parameter :: npts=100000

	   integer, parameter :: nobs=1000

	   integer, parameter :: nanals=64
   real, allocatable, dimension(:,:) ::  analysis_chunk,analysis_ob,dotProd

	   integer :: i,k,j

	   real t1,t2, tmpsum

	   integer(kind=8) :: cr, cm

	   real(kind=8) :: rate
   integer(kind=8) :: random_start, random_end

	   integer(kind=8) :: dot_start, dot_end

	   real(kind=8) :: random_time, dot_time
   call system_clock(count_rate=cr)

	   call system_clock(count_max=cm)

	   rate = real(cr,kind=8)
   allocate(analysis_chunk(npts,nanals))

	   allocate(analysis_ob(nobs,nanals))

	   allocate(dotProd(nobs,npts))
   call system_clock(random_start)

	   call random_number(analysis_ob)

	   call random_number(analysis_chunk)

	   call system_clock(random_end)
   random_time = (random_end-random_start)/rate

	   write (*,111) 'Random generation time:', random_time
   call system_clock(dot_start)

	   do i=1,nobs

	      do k=1,npts

	         dotProd(i,k) = sum(analysis_chunk(k,:)*analysis_ob(i,:))/float(nanals-1)

	      enddo

	   enddo

	   call system_clock(dot_end)

	   dot_time = (dot_end-dot_start)/rate
   write (*,111) "dotProd time:", dot_time
   write (*,111) 'dotProd(npts,nobs):', dotProd(npts,nobs)
   deallocate(analysis_chunk)

	   deallocate(analysis_ob)

	   deallocate(dotProd)
111 format (1X,A,T20,F12.8)
end program dotProduct
(Wow...that is some funky indenting by the code parser).

Now if I compile with Intel 13:

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.3.192 Build 20130607
Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.

(1091) $ ifort -O3 dot.f90 -o dot.ifort13.exe
(1092) $ ./dot.ifort13.exe
 Random generation   0.10362200
 dotProd time:       0.82572200
 dotProd(npts,nobs)  0.21707585

And now Intel 14:

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.1.106 Build 20131008
Copyright (C) 1985-2013 Intel Corporation.  All rights reserved.

(1065) $ ifort -O3 dot.f90 -o dot.ifort14.exe
(1066) $ ./dot.ifort14.exe
 Random generation   0.10825000
 dotProd time:       8.31167000
 dotProd(npts,nobs)  0.21707585

As you can see, the Intel 14 version is about 9x slower. Hmm. Looking at -opt-report1 -vec-report1, we can see the difference:

Intel 13:

HPO VECTORIZER REPORT (MAIN__) LOG OPENED ON Mon Dec 16 11:56:12 2013

<dot.f90;-1:-1;hpo_vectorization;MAIN__;0>
HPO Vectorizer Report (MAIN__)

dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
dot.f90(37:66-37:66):VEC:MAIN__:  PERMUTED LOOP WAS VECTORIZED
dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
PERMUTED LOOP WAS VECTORIZED
vectorization support: unroll factor set to 2
dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
PERMUTED LOOP WAS VECTORIZED
vectorization support: unroll factor set to 4
dot.f90(37): (col. 66) remark: PARTIAL LOOP WAS VECTORIZED.
PARTIAL LOOP WAS VECTORIZED

Intel 14:

HPO VECTORIZER REPORT (MAIN__) LOG OPENED ON Mon Dec 16 11:56:30 2013

<dot.f90;-1:-1;hpo_vectorization;MAIN__;0>
HPO Vectorizer Report (MAIN__)

dot.f90(27:9-27:9):VEC:MAIN__:  loop was not vectorized: unsupported data type
loop was not vectorized: not inner loop
dot.f90(28:9-28:9):VEC:MAIN__:  loop was not vectorized: unsupported data type
loop was not vectorized: not inner loop
dot.f90(37:25-37:25):VEC:MAIN__:  loop was not vectorized: vectorization possible but seems inefficient
dot.f90(36): (col. 7) remark: OUTER LOOP WAS VECTORIZED
dot.f90(36:7-36:7):VEC:MAIN__:  OUTER LOOP WAS VECTORIZED
dot.f90(37:25-37:25):VEC:MAIN__:  loop was not vectorized: vectorization possible but seems inefficient
dot.f90(36:7-36:7):VEC:MAIN__:  loop was not vectorized: low trip count
dot.f90(37:66-37:66):VEC:MAIN__:  loop was not vectorized: not inner loop

So, it looks like the optimizer isn't doing right. If we read the "PERMUTED" part, one can try interchanging the loops in our calculation, and if we do we get:

 Random generation   0.09767500
 dotProd time:       2.87279900
 dotProd(npts,nobs)  0.21707585

Much better, but still about 4x slower than Intel 13.

Also, as an aside, it looks like compiling with -xHost on Intel 14 leads to a slowdown even with the permuted code:

(Intel 13) $ ifort -O3 -xHost dot.permute.f90 -o dot.ifort13.permute.xHost.exe
(Intel 13) $ ./dot.ifort13.permute.xHost.exe
 Random generation   0.09036300
 dotProd time:       0.80169300
 dotProd(npts,nobs)  0.21707587

(Intel 14) $ ifort -O3 -xHost dot.permute.f90 -o dot.ifort14.permute.xHost.exe
(Intel 14) $ ./dot.ifort14.permute.xHost.exe
 Random generation   0.09545500
 dotProd time:       3.34055000
 dotProd(npts,nobs)  0.21707590

Finally, -fast really doesn't like this code with either compiler:

(Intel 13) $ ifort -fast dot.permute.f90 -o dot.ifort13.permute.fast.exe
ipo: remark #11001: performing single-file optimizations
ipo: remark #11006: generating object file /gpfsm/dnb31/tdirs/pbs/slurm.448506.mathomp4/ipo_iforta6D8B9.o
(Intel 13) $ ./dot.ifort13.permute.fast.exe
 Random generation   0.09720600
 dotProd time:       3.20100500
 dotProd(npts,nobs)  0.21707590

(Intel 14) $ ifort -fast dot.permute.f90 -o dot.ifort14.permute.fast.exe
ipo: remark #11001: performing single-file optimizations
ipo: remark #11006: generating object file /gpfsm/dnb31/tdirs/pbs/slurm.448505.mathomp4/ipo_iforthmRKBV.o
(Intel 14) $ ./dot.ifort14.permute.fast.exe
 Random generation   0.10952500
 dotProd time:       3.42286700
 dotProd(npts,nobs)  0.21707590

(That said, the compilers are even for once.)


Viewing all articles
Browse latest Browse all 2746

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>