Dear,
Below is a Coarray Fortran program that gives me some troubles:
A large vector (x) is updated in two different ways. For large sizes of x, the updates of x is wrong WHEN each image is on a different node. WHEN all images are on the same node, results are always fine, whatever the size of x.
When size(x)=10^6, exchanging the full array across images on different nodes led to wrong results. However, exchanging small subsets of x led to correct results.
When size(x)>2*10^7, exchanging the full array across images on different nodes led to wrong results, AND exchanging subsets of x (size(subset) > 6*10^6) led to wrong results too.
My troubles seem to be linked to the size of the array that is exchanged across images on different nodes. So, am I doing something wrong? Could it be a bug?
I use ifort 17.0.0 with -coarray=distributed.
Here is the program that mimicks the problem (it may be stupid, with too many sync all, .... , but it is to replicate my issue):
program testcoarray implicit none integer(kind=4)::i,j,k,neq integer(kind=4)::startrow[*],endrow[*] real(kind=8)::val[*] real(kind=8),allocatable::x(:)[:] neq=1000000 neq=25806732 if(this_image().eq.1)then write(*,'(/a,i0)')' Size of the array: ',neq write(*,'(a,i0/)')' Number of images : ',num_images() endif !INITIALISATION i=neq/num_images() startrow=(this_image()-1)*i+1 endrow=this_image()*i if(this_image().eq.num_images())endrow=neq allocate(x(neq)[*]) sync all !FIRST UPDATE x=0.d0 x(startrow:endrow)=real(this_image(),8) sync all if(this_image().eq.1)then do i=2,num_images() x=x+x(:)[i] enddo write(*,*)' First update : ',sum(x) endif sync all !SECOND UPDATE x=0.d0 x(startrow:endrow)=real(this_image(),8) sync all if(this_image().eq.1)then do i=2,num_images() j=startrow[i] k=endrow[i] x(j:k)=x(j:k)+x(j:k)[i] enddo write(*,*)' Second update: ',sum(x) endif sync all !CORRECT ANSWER x=0.d0 x(startrow:endrow)=real(this_image(),8) val=sum(x) sync all if(this_image().eq.1)then do i=2,num_images() val=val+val[i] enddo write(*,*)' Correct value: ',val endif sync all end program
And here are the output for neq=1000000
*With all images on the same node:
Size of the array: 1000000
Number of images : 4First update : 2500000.00000000
Second update: 2500000.00000000
Correct value: 2500000.00000000
*With each image on a different node:
Size of the array: 1000000
Number of images : 4First update : 750000.000000000
Second update: 2500000.00000000
Correct value: 2500000.00000000
And here are the output for neq=25806732
*With all images on the same node:
Size of the array: 25806732
Number of images : 4First update : 64516830.0000000
Second update: 64516830.0000000
Correct value: 64516830.0000000
*With each image on a different node:
Size of the array: 25806732
Number of images : 4First update : 19355049.0000000
Second update: 6451727.00000000
Correct value: 64516830.0000000
In advance thank you for your help.
Jeremie