Quantcast
Channel: Intel® Software - Intel® Fortran Compiler for Linux* and macOS*
Viewing all articles
Browse latest Browse all 2746

Extensive memory usage error for large matrices in MPI/OpenMP hybrid Fortran code

$
0
0

Hello everyone, I'm working on some astrophysical simulations and I've written a Fortran code. I have implemented MPI and OpenMP to create a hybrid code. The code works fine, with optimisations -O3, for small-medium array sizes. However, when I try to increase the size of the matrices to obtain higher resolution, I run out of memory on the nodes in the HPC cluster, to be exact 96 GB on each node. I've tried running on 4 nodes with 20 cores each. Is there any way I can run my code while using larger matrix sizes without having such a large memory requirement? I've read about using allocatable arrays and I've tried, but the memory reduction is only by less than 1 %, perhaps I'm allocating and deallocating them wrong. Basically, I read four external input arrays (X,Y,zeta_list,M_list) and I only have one major calculation that's happening in my code within a do loop, it loops over coordinates(in the array zeta_list), so I use MPI_scatter to break these up into chunks. I've shown an example code below so it might be easier to understand what's going on. Would appreciate any help on this!!!

​<span style="font-family: Arial, 宋体, Tahoma, Helvetica, sans-serif; font-size: 1em;"> </span>
PROGRAM TEST_MPI

use mpi

IMPLICIT NONE

DOUBLE PRECISION,DIMENSION(1,1000000)::M_list
DOUBLE PRECISION,DIMENSION(20000,20000)::X,Y,z_m_z_x,z_m_z_y,dist_z_m_z,alpha_x, alpha_y, sub_alpha_x, sub_alpha_y
DOUBLE PRECISION,DIMENSION(2,1000000)::zeta_list
DOUBLE PRECISION,ALLOCATABLE,DIMENSION(:,:)::sub_zeta_list
INTEGER::i,myid,ntasks,ierror,n_elements,chunk,t1,t2,clock_rate,clock_max,required,provided


call MPI_INIT_THREAD(MPI_THREAD_SERIALIZED,provided,ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, ntasks, ierror)

if (myid==0) then
	call system_clock ( t1, clock_rate, clock_max )
end if

open(10,file='/home/h1352888/Fortran_Test/10000points_5000grid/M_list.bin',form='UNFORMATTED',access='STREAM')
open(9,file='/home/h1352888/Fortran_Test/10000points_5000grid/zeta_list.bin',form='UNFORMATTED',access='STREAM')
open(8,file='/home/h1352888/Fortran_Test/10000points_5000grid/X.bin',form='UNFORMATTED',access='STREAM')
open(7,file='/home/h1352888/Fortran_Test/10000points_5000grid/Y.bin',form='UNFORMATTED',access='STREAM')

read(10)M_list
read(9)zeta_list
read(8)X
read(7)Y


alpha_x = 0
alpha_y = 0
sub_alpha_x = 0
sub_alpha_y = 0
z_m_z_x = 0
z_m_z_y = 0
dist_z_m_z = 0
n_elements = size(zeta_list,1)*size(zeta_list,2)
chunk = n_elements/ntasks
allocate(sub_zeta_list(2,chunk/2))
sub_zeta_list = 0

!The final output arrays are alpha_x and alpha_y which are obtained from their sub_ counterparts via the loop.

call MPI_Scatter(zeta_list,chunk,MPI_DOUBLE_PRECISION,sub_zeta_list,chunk,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror)
!$OMP PARALLEL DO SCHEDULE(GUIDED,50) private(z_m_z_x, z_m_z_y, dist_z_m_z) reduction( +: sub_alpha_x, sub_alpha_y )
do i=1,chunk/2,1
    	z_m_z_x = X - sub_zeta_list(1,i)
	z_m_z_y = Y - sub_zeta_list(2,i)
	dist_z_m_z = z_m_z_x**2 + z_m_z_y**2
	sub_alpha_x = sub_alpha_x + (M_list(1,i)* z_m_z_x / dist_z_m_z)
        sub_alpha_y = sub_alpha_y + (M_list(1,i)* z_m_z_y / dist_z_m_z)
end do
!$OMP END PARALLEL DO

call MPI_Reduce(sub_alpha_x,alpha_x,400000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror)
call MPI_Reduce(sub_alpha_y,alpha_y,400000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror)

if (myid == 0) then
	write(*,*)'Number of processes is', ntasks
	write(*,*)chunk
	!write(*,*)alpha_x
	call system_clock ( t2, clock_rate, clock_max )
	write ( *, * ) 'Elapsed real time = ', real ( t2 - t1 ) / real ( clock_rate ) ,'seconds'
	open(12,file='/home/h1352888/Fortran_Test/10000points_5000grid/alpha_x.bin',form='UNFORMATTED',access='STREAM')
	write(12)alpha_x
	open(11,file='/home/h1352888/Fortran_Test/10000points_5000grid/alpha_y.bin',form='UNFORMATTED',access='STREAM')
	write(11)alpha_y
end if

call MPI_FINALIZE(ierror)
stop
END PROGRAM TEST_MPI  

I compile with mpiifort -g -traceback -qopenmp -O3 -ipo -xHost -mt_mpi test.f90, I also know about the -heap-arrays flag(which slows down the code significantly, 2x-3x) which I use sometimes, or I simply set OMP_STACKSIZE, although I know this isn't good for larger matrices. For example, if I run on 4 nodes and 20 cores, I set 4 MPI processes and let OpenMP have 20 threads.

Thanks a lot

 

 


Viewing all articles
Browse latest Browse all 2746

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>