Quantcast
Channel: Intel® Software - Intel® Fortran Compiler for Linux* and macOS*
Viewing all articles
Browse latest Browse all 2746

Memory allocation leads to significant loss of performance

$
0
0

Hi,

I found a very strange thing: adding more memory allocations leads to significantly more data load instructions and CPU_TIME.
I know this sounds weird so I posted the code below to illustrate my problem (I have tried my best to simplify it):

  subroutine ARK2(region)
    
    USE ModGlobal
    USE ModDataStruct
    USE ModIO
    USE ModDerivBuildOps
    USE ModDeriv
    USE ModMetrics
    USE ModAdvection
    USE ModMPI

    Implicit None

! ... Incoming variables
    type(t_region), pointer :: region

! ... local variables
    integer :: rkStep, i, j, k, ng, ARK2_nStages, ImplicitFlag
    type(t_grid), pointer :: grid
    type(t_mixt), pointer :: state
    type(t_mixt_input), pointer :: input
    real(rfreal), pointer :: cv(:,:), dt(:), b_vec_exp(:), rhs_explicit(:,:,:)
    integer :: nGrids, nCells, nCv

    nGrids = region%nGrids

    if (rk_alloc .eqv. .true.) then
      do ng = 1, nGrids
        grid  => region%grid(ng)
        input => grid%input 
        call additive_RK_coeff(input, grid, ImplicitFlag)
      end do
      rk_alloc = .false.
      if (myrank == 0) write (*,'(A)') 'PlasComCM: ==> Using ARK2 time integration <=='
    end if
    
    grid => region%grid(1)
    state => region%state(1)
    input => grid%input
    ARK2_nStages = grid%ARK2_nStages
    
! ----------------------------
! ... memory allocation PART 1
! ----------------------------
    if (.not.allocated(state%rhs) .eqv. .true.) allocate(state%rhs(grid%nCells, input%nCv))
    if (.not.allocated(state%rhs_explicit) .eqv. .true.) allocate(state%rhs_explicit(grid%nCells, input%nCv, ARK2_nStages))
    if (.not.allocated(state%timeOld) .eqv. .true.) allocate(state%timeOld(grid%nCells))
    if (.not.allocated(state%cfl)     .eqv. .true.) allocate(state%cfl(grid%nCells))
    if (.not.allocated(state%cvOld)   .eqv. .true.) allocate(state%cvOld(grid%nCells,input%nCv))
    if (.not.allocated(state%dt)      .eqv. .true.) allocate(state%dt(grid%nCells))
    if (.not.allocated(time_g) .eqv. .true.) allocate(time_g(grid%nCells))
    if (.not.allocated(timeOld_g) .eqv. .true.) allocate(timeOld_g(grid%nCells))
    if (.not.allocated(dt_g) .eqv. .true.) allocate(dt_g(grid%nCells))
    if (.not.allocated(rhs_explicit_g) .eqv. .true.) allocate(rhs_explicit_g(grid%nCells, input%nCv, ARK2_nStages))
    if (.not.allocated(state_rhs_g) .eqv. .true.) allocate(state_rhs_g(grid%nCells, input%nCv))
    if (.not.allocated(JAC_g) .eqv. .true.) allocate(JAC_g(grid%nCells))
    if (.not.allocated(cv_g)   .eqv. .true.) allocate(cv_g(grid%nCells,input%nCv))
    if (.not.allocated(cvOld_g)   .eqv. .true.) allocate(cvOld_g(grid%nCells,input%nCv))

! ----------------------------------------------------------------------------------
! ... memory allocation PART 2 (adding these 5 memory allocations leads to significantly 
! ... more data load instructions and CPU_TIME for the loop in the bottom!!!)
! ----------------------------------------------------------------------------------
    if (.not.allocated(a_mat_exp_g) .eqv. .true.) allocate(a_mat_exp_g(ARK2_nStages,ARK2_nStages))
    if (.not.allocated(a_mat_imp_g) .eqv. .true.) allocate(a_mat_imp_g(ARK2_nStages,ARK2_nStages))
    if (.not.allocated(b_vec_exp_g) .eqv. .true.) allocate(b_vec_exp_g(ARK2_nStages))
    if (.not.allocated(b_vec_imp_g) .eqv. .true.) allocate(b_vec_imp_g(ARK2_nStages))
    if (.not.allocated(c_vec_g) .eqv. .true.) allocate(c_vec_g(ARK2_nStages))

! ... dereference pointers
    cv => state%cv
    dt => state%dt
    b_vec_exp => grid%ARK2_b_vec_exp
    rhs_explicit => state%rhs_explicit
    nCv = input%nCv
    nCells = grid%nCells

! ----------------------------------------------------------
! ... Adding memory allocation PART 2 leads to significantly 
! ... more data load instructions and CPU_TIME for this loop!!!
! ----------------------------------------------------------
    do j = 1, ARK2_nStages
      do k = 1, nCv
!DIR$ SIMD
        do i = 1, nCells
          cv(i,k) = cv(i,k) + dt(i) * b_vec_exp(j) * rhs_explicit(i,k,j)
        end do
      end do
    end do

  end subroutine ARK2

 

As shown above, in ARK2 I have 2 memory allocation parts and 1 loop. My finding is that :
If I only have memory allocation PART1 (without PART2), the loop can run very fast with small number of data load instructions;
If I have both memory allocation PART1 and PART2, the loop will run very slowly with quite large number of data load instructions.

Using TAU and PAPI, I measured the CPU_TIME and memory usage of the loop for these two cases, here are the results:
                                       CPU_TIME (seconds)        data load instructions      L1 Cache Hits     L2 Cache Hits    L3 Cache Hits      Main Memory Hits
Without  PART2                     1.82                                           3.8E09                            87%                      3%                      10%                         0%
With        PART2                    13.24                                        5.6E10                            99%                     1%                        0%                          0%

We can see that the CPU_TIME and data load instructions increase more than 10 times. From the cache usage results, it seems to me that PART2 adds a lot of useless L1 cache hits. These results are very confusing to me since PART2 is just allocating some memory. How can it have such a huge influence on the performance of the loop? These seems to be a limit on the size of the memory that I can allocate (regarding performance).

I promise these results are repeatable. I compiled the code using ifort 13.1.0 with -O2 -xHost options and SIMD directives. I ran the code on Intel E5 Sandy Bridge processor with only one MPI process. I will truly appreciate any hints regarding what is going on here. Thanks for your time, help and patience for reading this long story.

Best regards,
    Wentao 

 


Viewing all articles
Browse latest Browse all 2746

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>