Hello everyone, I'm working on some astrophysical simulations and I've written a Fortran code. I have implemented MPI and OpenMP to create a hybrid code. The code works fine, with optimisations -O3, for small-medium array sizes. However, when I try to increase the size of the matrices to obtain higher resolution, I run out of memory on the nodes in the HPC cluster, to be exact 96 GB on each node. I've tried running on 4 nodes with 20 cores each. Is there any way I can run my code while using larger matrix sizes without having such a large memory requirement? I've read about using allocatable arrays and I've tried, but the memory reduction is only by less than 1 %, perhaps I'm allocating and deallocating them wrong. Basically, I read four external input arrays (X,Y,zeta_list,M_list) and I only have one major calculation that's happening in my code within a do loop, it loops over coordinates(in the array zeta_list), so I use MPI_scatter to break these up into chunks. I've shown an example code below so it might be easier to understand what's going on. Would appreciate any help on this!!!
<span style="font-family: Arial, 宋体, Tahoma, Helvetica, sans-serif; font-size: 1em;"> </span> PROGRAM TEST_MPI use mpi IMPLICIT NONE DOUBLE PRECISION,DIMENSION(1,1000000)::M_list DOUBLE PRECISION,DIMENSION(20000,20000)::X,Y,z_m_z_x,z_m_z_y,dist_z_m_z,alpha_x, alpha_y, sub_alpha_x, sub_alpha_y DOUBLE PRECISION,DIMENSION(2,1000000)::zeta_list DOUBLE PRECISION,ALLOCATABLE,DIMENSION(:,:)::sub_zeta_list INTEGER::i,myid,ntasks,ierror,n_elements,chunk,t1,t2,clock_rate,clock_max,required,provided call MPI_INIT_THREAD(MPI_THREAD_SERIALIZED,provided,ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, ntasks, ierror) if (myid==0) then call system_clock ( t1, clock_rate, clock_max ) end if open(10,file='/home/h1352888/Fortran_Test/10000points_5000grid/M_list.bin',form='UNFORMATTED',access='STREAM') open(9,file='/home/h1352888/Fortran_Test/10000points_5000grid/zeta_list.bin',form='UNFORMATTED',access='STREAM') open(8,file='/home/h1352888/Fortran_Test/10000points_5000grid/X.bin',form='UNFORMATTED',access='STREAM') open(7,file='/home/h1352888/Fortran_Test/10000points_5000grid/Y.bin',form='UNFORMATTED',access='STREAM') read(10)M_list read(9)zeta_list read(8)X read(7)Y alpha_x = 0 alpha_y = 0 sub_alpha_x = 0 sub_alpha_y = 0 z_m_z_x = 0 z_m_z_y = 0 dist_z_m_z = 0 n_elements = size(zeta_list,1)*size(zeta_list,2) chunk = n_elements/ntasks allocate(sub_zeta_list(2,chunk/2)) sub_zeta_list = 0 !The final output arrays are alpha_x and alpha_y which are obtained from their sub_ counterparts via the loop. call MPI_Scatter(zeta_list,chunk,MPI_DOUBLE_PRECISION,sub_zeta_list,chunk,MPI_DOUBLE_PRECISION,0,MPI_COMM_WORLD,ierror) !$OMP PARALLEL DO SCHEDULE(GUIDED,50) private(z_m_z_x, z_m_z_y, dist_z_m_z) reduction( +: sub_alpha_x, sub_alpha_y ) do i=1,chunk/2,1 z_m_z_x = X - sub_zeta_list(1,i) z_m_z_y = Y - sub_zeta_list(2,i) dist_z_m_z = z_m_z_x**2 + z_m_z_y**2 sub_alpha_x = sub_alpha_x + (M_list(1,i)* z_m_z_x / dist_z_m_z) sub_alpha_y = sub_alpha_y + (M_list(1,i)* z_m_z_y / dist_z_m_z) end do !$OMP END PARALLEL DO call MPI_Reduce(sub_alpha_x,alpha_x,400000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror) call MPI_Reduce(sub_alpha_y,alpha_y,400000000,MPI_DOUBLE_PRECISION,MPI_SUM,0,MPI_COMM_WORLD,ierror) if (myid == 0) then write(*,*)'Number of processes is', ntasks write(*,*)chunk !write(*,*)alpha_x call system_clock ( t2, clock_rate, clock_max ) write ( *, * ) 'Elapsed real time = ', real ( t2 - t1 ) / real ( clock_rate ) ,'seconds' open(12,file='/home/h1352888/Fortran_Test/10000points_5000grid/alpha_x.bin',form='UNFORMATTED',access='STREAM') write(12)alpha_x open(11,file='/home/h1352888/Fortran_Test/10000points_5000grid/alpha_y.bin',form='UNFORMATTED',access='STREAM') write(11)alpha_y end if call MPI_FINALIZE(ierror) stop END PROGRAM TEST_MPI
I compile with mpiifort -g -traceback -qopenmp -O3 -ipo -xHost -mt_mpi test.f90, I also know about the -heap-arrays flag(which slows down the code significantly, 2x-3x) which I use sometimes, or I simply set OMP_STACKSIZE, although I know this isn't good for larger matrices. For example, if I run on 4 nodes and 20 cores, I set 4 MPI processes and let OpenMP have 20 threads.
Thanks a lot