Hi,
I found a very strange thing: adding more memory allocations leads to significantly more data load instructions and CPU_TIME.
I know this sounds weird so I posted the code below to illustrate my problem (I have tried my best to simplify it):
subroutine ARK2(region) USE ModGlobal USE ModDataStruct USE ModIO USE ModDerivBuildOps USE ModDeriv USE ModMetrics USE ModAdvection USE ModMPI Implicit None ! ... Incoming variables type(t_region), pointer :: region ! ... local variables integer :: rkStep, i, j, k, ng, ARK2_nStages, ImplicitFlag type(t_grid), pointer :: grid type(t_mixt), pointer :: state type(t_mixt_input), pointer :: input real(rfreal), pointer :: cv(:,:), dt(:), b_vec_exp(:), rhs_explicit(:,:,:) integer :: nGrids, nCells, nCv nGrids = region%nGrids if (rk_alloc .eqv. .true.) then do ng = 1, nGrids grid => region%grid(ng) input => grid%input call additive_RK_coeff(input, grid, ImplicitFlag) end do rk_alloc = .false. if (myrank == 0) write (*,'(A)') 'PlasComCM: ==> Using ARK2 time integration <==' end if grid => region%grid(1) state => region%state(1) input => grid%input ARK2_nStages = grid%ARK2_nStages ! ---------------------------- ! ... memory allocation PART 1 ! ---------------------------- if (.not.allocated(state%rhs) .eqv. .true.) allocate(state%rhs(grid%nCells, input%nCv)) if (.not.allocated(state%rhs_explicit) .eqv. .true.) allocate(state%rhs_explicit(grid%nCells, input%nCv, ARK2_nStages)) if (.not.allocated(state%timeOld) .eqv. .true.) allocate(state%timeOld(grid%nCells)) if (.not.allocated(state%cfl) .eqv. .true.) allocate(state%cfl(grid%nCells)) if (.not.allocated(state%cvOld) .eqv. .true.) allocate(state%cvOld(grid%nCells,input%nCv)) if (.not.allocated(state%dt) .eqv. .true.) allocate(state%dt(grid%nCells)) if (.not.allocated(time_g) .eqv. .true.) allocate(time_g(grid%nCells)) if (.not.allocated(timeOld_g) .eqv. .true.) allocate(timeOld_g(grid%nCells)) if (.not.allocated(dt_g) .eqv. .true.) allocate(dt_g(grid%nCells)) if (.not.allocated(rhs_explicit_g) .eqv. .true.) allocate(rhs_explicit_g(grid%nCells, input%nCv, ARK2_nStages)) if (.not.allocated(state_rhs_g) .eqv. .true.) allocate(state_rhs_g(grid%nCells, input%nCv)) if (.not.allocated(JAC_g) .eqv. .true.) allocate(JAC_g(grid%nCells)) if (.not.allocated(cv_g) .eqv. .true.) allocate(cv_g(grid%nCells,input%nCv)) if (.not.allocated(cvOld_g) .eqv. .true.) allocate(cvOld_g(grid%nCells,input%nCv)) ! ---------------------------------------------------------------------------------- ! ... memory allocation PART 2 (adding these 5 memory allocations leads to significantly ! ... more data load instructions and CPU_TIME for the loop in the bottom!!!) ! ---------------------------------------------------------------------------------- if (.not.allocated(a_mat_exp_g) .eqv. .true.) allocate(a_mat_exp_g(ARK2_nStages,ARK2_nStages)) if (.not.allocated(a_mat_imp_g) .eqv. .true.) allocate(a_mat_imp_g(ARK2_nStages,ARK2_nStages)) if (.not.allocated(b_vec_exp_g) .eqv. .true.) allocate(b_vec_exp_g(ARK2_nStages)) if (.not.allocated(b_vec_imp_g) .eqv. .true.) allocate(b_vec_imp_g(ARK2_nStages)) if (.not.allocated(c_vec_g) .eqv. .true.) allocate(c_vec_g(ARK2_nStages)) ! ... dereference pointers cv => state%cv dt => state%dt b_vec_exp => grid%ARK2_b_vec_exp rhs_explicit => state%rhs_explicit nCv = input%nCv nCells = grid%nCells ! ---------------------------------------------------------- ! ... Adding memory allocation PART 2 leads to significantly ! ... more data load instructions and CPU_TIME for this loop!!! ! ---------------------------------------------------------- do j = 1, ARK2_nStages do k = 1, nCv !DIR$ SIMD do i = 1, nCells cv(i,k) = cv(i,k) + dt(i) * b_vec_exp(j) * rhs_explicit(i,k,j) end do end do end do end subroutine ARK2
As shown above, in ARK2 I have 2 memory allocation parts and 1 loop. My finding is that :
If I only have memory allocation PART1 (without PART2), the loop can run very fast with small number of data load instructions;
If I have both memory allocation PART1 and PART2, the loop will run very slowly with quite large number of data load instructions.
Using TAU and PAPI, I measured the CPU_TIME and memory usage of the loop for these two cases, here are the results:
CPU_TIME (seconds) data load instructions L1 Cache Hits L2 Cache Hits L3 Cache Hits Main Memory Hits
Without PART2 1.82 3.8E09 87% 3% 10% 0%
With PART2 13.24 5.6E10 99% 1% 0% 0%
We can see that the CPU_TIME and data load instructions increase more than 10 times. From the cache usage results, it seems to me that PART2 adds a lot of useless L1 cache hits. These results are very confusing to me since PART2 is just allocating some memory. How can it have such a huge influence on the performance of the loop? These seems to be a limit on the size of the memory that I can allocate (regarding performance).
I promise these results are repeatable. I compiled the code using ifort 13.1.0 with -O2 -xHost options and SIMD directives. I ran the code on Intel E5 Sandy Bridge processor with only one MPI process. I will truly appreciate any hints regarding what is going on here. Thanks for your time, help and patience for reading this long story.
Best regards,
Wentao