Performance Regression with Intel 14.0.1.106

All,

A colleague here found an interesting performance regression with Intel Fortran 14. It's probable that it's related to this C regression, but I thought I'd place it here in case there is a more Intel Fortran specific fix.

To wit, the simple program to look at is:

program dotProduct

	   implicit none
   integer, parameter :: npts=100000

	   integer, parameter :: nobs=1000

	   integer, parameter :: nanals=64
   real, allocatable, dimension(:,:) ::  analysis_chunk,analysis_ob,dotProd

	   integer :: i,k,j

	   real t1,t2, tmpsum

	   integer(kind=8) :: cr, cm

	   real(kind=8) :: rate
   integer(kind=8) :: random_start, random_end

	   integer(kind=8) :: dot_start, dot_end

	   real(kind=8) :: random_time, dot_time
   call system_clock(count_rate=cr)

	   call system_clock(count_max=cm)

	   rate = real(cr,kind=8)
   allocate(analysis_chunk(npts,nanals))

	   allocate(analysis_ob(nobs,nanals))

	   allocate(dotProd(nobs,npts))
   call system_clock(random_start)

	   call random_number(analysis_ob)

	   call random_number(analysis_chunk)

	   call system_clock(random_end)
   random_time = (random_end-random_start)/rate

	   write (*,111) 'Random generation time:', random_time
   call system_clock(dot_start)

	   do i=1,nobs

	      do k=1,npts

	         dotProd(i,k) = sum(analysis_chunk(k,:)*analysis_ob(i,:))/float(nanals-1)

	      enddo

	   enddo

	   call system_clock(dot_end)

	   dot_time = (dot_end-dot_start)/rate
   write (*,111) "dotProd time:", dot_time
   write (*,111) 'dotProd(npts,nobs):', dotProd(npts,nobs)
   deallocate(analysis_chunk)

	   deallocate(analysis_ob)

	   deallocate(dotProd)
111 format (1X,A,T20,F12.8)
end program dotProduct

(Wow...that is some funky indenting by the code parser).

Now if I compile with Intel 13:

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.3.192 Build 20130607
Copyright (C) 1985-2013 Intel Corporation. All rights reserved.

(1091) $ ifort -O3 dot.f90 -o dot.ifort13.exe
(1092) $ ./dot.ifort13.exe
Random generation 0.10362200
dotProd time: 0.82572200
dotProd(npts,nobs) 0.21707585

And now Intel 14:

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.1.106 Build 20131008
Copyright (C) 1985-2013 Intel Corporation. All rights reserved.

(1065) $ ifort -O3 dot.f90 -o dot.ifort14.exe
(1066) $ ./dot.ifort14.exe
Random generation 0.10825000
dotProd time: 8.31167000
dotProd(npts,nobs) 0.21707585

As you can see, the Intel 14 version is about 9x slower. Hmm. Looking at -opt-report1 -vec-report1, we can see the difference:

Intel 13:

HPO VECTORIZER REPORT (MAIN__) LOG OPENED ON Mon Dec 16 11:56:12 2013

<dot.f90;-1:-1;hpo_vectorization;MAIN__;0>
HPO Vectorizer Report (MAIN__)

dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
dot.f90(37:66-37:66):VEC:MAIN__: PERMUTED LOOP WAS VECTORIZED
dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
PERMUTED LOOP WAS VECTORIZED
vectorization support: unroll factor set to 2
dot.f90(37): (col. 66) remark: PERMUTED LOOP WAS VECTORIZED.
PERMUTED LOOP WAS VECTORIZED
vectorization support: unroll factor set to 4
dot.f90(37): (col. 66) remark: PARTIAL LOOP WAS VECTORIZED.
PARTIAL LOOP WAS VECTORIZED

Intel 14:

HPO VECTORIZER REPORT (MAIN__) LOG OPENED ON Mon Dec 16 11:56:30 2013

<dot.f90;-1:-1;hpo_vectorization;MAIN__;0>
HPO Vectorizer Report (MAIN__)

dot.f90(27:9-27:9):VEC:MAIN__: loop was not vectorized: unsupported data type
loop was not vectorized: not inner loop
dot.f90(28:9-28:9):VEC:MAIN__: loop was not vectorized: unsupported data type
loop was not vectorized: not inner loop
dot.f90(37:25-37:25):VEC:MAIN__: loop was not vectorized: vectorization possible but seems inefficient
dot.f90(36): (col. 7) remark: OUTER LOOP WAS VECTORIZED
dot.f90(36:7-36:7):VEC:MAIN__: OUTER LOOP WAS VECTORIZED
dot.f90(37:25-37:25):VEC:MAIN__: loop was not vectorized: vectorization possible but seems inefficient
dot.f90(36:7-36:7):VEC:MAIN__: loop was not vectorized: low trip count
dot.f90(37:66-37:66):VEC:MAIN__: loop was not vectorized: not inner loop

So, it looks like the optimizer isn't doing right. If we read the "PERMUTED" part, one can try interchanging the loops in our calculation, and if we do we get:

Random generation 0.09767500
dotProd time: 2.87279900
dotProd(npts,nobs) 0.21707585

Much better, but still about 4x slower than Intel 13.

Also, as an aside, it looks like compiling with -xHost on Intel 14 leads to a slowdown even with the permuted code:

(Intel 13) $ ifort -O3 -xHost dot.permute.f90 -o dot.ifort13.permute.xHost.exe
(Intel 13) $ ./dot.ifort13.permute.xHost.exe
Random generation 0.09036300
dotProd time: 0.80169300
dotProd(npts,nobs) 0.21707587

(Intel 14) $ ifort -O3 -xHost dot.permute.f90 -o dot.ifort14.permute.xHost.exe
(Intel 14) $ ./dot.ifort14.permute.xHost.exe
Random generation 0.09545500
dotProd time: 3.34055000
dotProd(npts,nobs) 0.21707590

Finally, -fast really doesn't like this code with either compiler:

(Intel 13) $ ifort -fast dot.permute.f90 -o dot.ifort13.permute.fast.exe
ipo: remark #11001: performing single-file optimizations
ipo: remark #11006: generating object file /gpfsm/dnb31/tdirs/pbs/slurm.448506.mathomp4/ipo_iforta6D8B9.o
(Intel 13) $ ./dot.ifort13.permute.fast.exe
Random generation 0.09720600
dotProd time: 3.20100500
dotProd(npts,nobs) 0.21707590

(Intel 14) $ ifort -fast dot.permute.f90 -o dot.ifort14.permute.fast.exe
ipo: remark #11001: performing single-file optimizations
ipo: remark #11006: generating object file /gpfsm/dnb31/tdirs/pbs/slurm.448505.mathomp4/ipo_iforthmRKBV.o
(Intel 14) $ ./dot.ifort14.permute.fast.exe
Random generation 0.10952500
dotProd time: 3.42286700
dotProd(npts,nobs) 0.21707590

(That said, the compilers are even for once.)

Performance Regression with Intel 14.0.1.106

Trending Articles

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

The Falls Testament of Love (2013) 720p WEB-DL 800MB

CVP measurement

Black Angus Grilled Artichokes

Moondru Mudichu 07-06-2016 – Polimer tv Serial

Practice Sheet of Right form of verbs for HSC Students

The Old Stag Class 4 Extra Questions MCQ English Chapter 5

Drug dealing partners enjoyed lavish trip to Cribbs Causeway...

Throw Back: Sony Achiba — Nipa Boniayefour

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Hull man, 27, dies after crashing car into a tree on the A165 near Brandesburton

B 1 & Kelevin _Redlinso_Skirt Na Katenge

Printing to on-prem printers from Azure AD-joined devices

Neem Baba Extra Questions Answer Class 6 English Poorvi

Badly behaved grandson remanded

fs_older_downloads

Mahakal Attitude Status

Unable to Set Proper CPU Reservation with High Latency Sensitivity

Born To Be Wild: Chicago Outfit Hit Squad Littered The Streets With Bodies...

Drug dealing brothers caught with £74k stash in Newtown Linford home