Memory allocation leads to significant loss of performance

Hi,

I found a very strange thing: adding more memory allocations leads to significantly more data load instructions and CPU_TIME.
I know this sounds weird so I posted the code below to illustrate my problem (I have tried my best to simplify it):

  subroutine ARK2(region)
    
    USE ModGlobal
    USE ModDataStruct
    USE ModIO
    USE ModDerivBuildOps
    USE ModDeriv
    USE ModMetrics
    USE ModAdvection
    USE ModMPI

    Implicit None

! ... Incoming variables
    type(t_region), pointer :: region

! ... local variables
    integer :: rkStep, i, j, k, ng, ARK2_nStages, ImplicitFlag
    type(t_grid), pointer :: grid
    type(t_mixt), pointer :: state
    type(t_mixt_input), pointer :: input
    real(rfreal), pointer :: cv(:,:), dt(:), b_vec_exp(:), rhs_explicit(:,:,:)
    integer :: nGrids, nCells, nCv

    nGrids = region%nGrids

    if (rk_alloc .eqv. .true.) then
      do ng = 1, nGrids
        grid  => region%grid(ng)
        input => grid%input 
        call additive_RK_coeff(input, grid, ImplicitFlag)
      end do
      rk_alloc = .false.
      if (myrank == 0) write (*,'(A)') 'PlasComCM: ==> Using ARK2 time integration <=='
    end if
    
    grid => region%grid(1)
    state => region%state(1)
    input => grid%input
    ARK2_nStages = grid%ARK2_nStages
    
! ----------------------------
! ... memory allocation PART 1
! ----------------------------
    if (.not.allocated(state%rhs) .eqv. .true.) allocate(state%rhs(grid%nCells, input%nCv))
    if (.not.allocated(state%rhs_explicit) .eqv. .true.) allocate(state%rhs_explicit(grid%nCells, input%nCv, ARK2_nStages))
    if (.not.allocated(state%timeOld) .eqv. .true.) allocate(state%timeOld(grid%nCells))
    if (.not.allocated(state%cfl)     .eqv. .true.) allocate(state%cfl(grid%nCells))
    if (.not.allocated(state%cvOld)   .eqv. .true.) allocate(state%cvOld(grid%nCells,input%nCv))
    if (.not.allocated(state%dt)      .eqv. .true.) allocate(state%dt(grid%nCells))
    if (.not.allocated(time_g) .eqv. .true.) allocate(time_g(grid%nCells))
    if (.not.allocated(timeOld_g) .eqv. .true.) allocate(timeOld_g(grid%nCells))
    if (.not.allocated(dt_g) .eqv. .true.) allocate(dt_g(grid%nCells))
    if (.not.allocated(rhs_explicit_g) .eqv. .true.) allocate(rhs_explicit_g(grid%nCells, input%nCv, ARK2_nStages))
    if (.not.allocated(state_rhs_g) .eqv. .true.) allocate(state_rhs_g(grid%nCells, input%nCv))
    if (.not.allocated(JAC_g) .eqv. .true.) allocate(JAC_g(grid%nCells))
    if (.not.allocated(cv_g)   .eqv. .true.) allocate(cv_g(grid%nCells,input%nCv))
    if (.not.allocated(cvOld_g)   .eqv. .true.) allocate(cvOld_g(grid%nCells,input%nCv))

! ----------------------------------------------------------------------------------
! ... memory allocation PART 2 (adding these 5 memory allocations leads to significantly 
! ... more data load instructions and CPU_TIME for the loop in the bottom!!!)
! ----------------------------------------------------------------------------------
    if (.not.allocated(a_mat_exp_g) .eqv. .true.) allocate(a_mat_exp_g(ARK2_nStages,ARK2_nStages))
    if (.not.allocated(a_mat_imp_g) .eqv. .true.) allocate(a_mat_imp_g(ARK2_nStages,ARK2_nStages))
    if (.not.allocated(b_vec_exp_g) .eqv. .true.) allocate(b_vec_exp_g(ARK2_nStages))
    if (.not.allocated(b_vec_imp_g) .eqv. .true.) allocate(b_vec_imp_g(ARK2_nStages))
    if (.not.allocated(c_vec_g) .eqv. .true.) allocate(c_vec_g(ARK2_nStages))

! ... dereference pointers
    cv => state%cv
    dt => state%dt
    b_vec_exp => grid%ARK2_b_vec_exp
    rhs_explicit => state%rhs_explicit
    nCv = input%nCv
    nCells = grid%nCells

! ----------------------------------------------------------
! ... Adding memory allocation PART 2 leads to significantly 
! ... more data load instructions and CPU_TIME for this loop!!!
! ----------------------------------------------------------
    do j = 1, ARK2_nStages
      do k = 1, nCv
!DIR$ SIMD
        do i = 1, nCells
          cv(i,k) = cv(i,k) + dt(i) * b_vec_exp(j) * rhs_explicit(i,k,j)
        end do
      end do
    end do

  end subroutine ARK2

As shown above, in ARK2 I have 2 memory allocation parts and 1 loop. My finding is that :
If I only have memory allocation PART1 (without PART2), the loop can run very fast with small number of data load instructions;
If I have both memory allocation PART1 and PART2, the loop will run very slowly with quite large number of data load instructions.

Using TAU and PAPI, I measured the CPU_TIME and memory usage of the loop for these two cases, here are the results:
CPU_TIME (seconds) data load instructions L1 Cache Hits L2 Cache Hits L3 Cache Hits Main Memory Hits
Without PART2 1.82 3.8E09 87% 3% 10% 0%
With PART2 13.24 5.6E10 99% 1% 0% 0%

We can see that the CPU_TIME and data load instructions increase more than 10 times. From the cache usage results, it seems to me that PART2 adds a lot of useless L1 cache hits. These results are very confusing to me since PART2 is just allocating some memory. How can it have such a huge influence on the performance of the loop? These seems to be a limit on the size of the memory that I can allocate (regarding performance).

I promise these results are repeatable. I compiled the code using ifort 13.1.0 with -O2 -xHost options and SIMD directives. I ran the code on Intel E5 Sandy Bridge processor with only one MPI process. I will truly appreciate any hints regarding what is going on here. Thanks for your time, help and patience for reading this long story.

Best regards,
Wentao

Memory allocation leads to significant loss of performance

Trending Articles

Moondru Mudichu 20-07-2016 – Polimer tv Serial

What happened to the guy who stabbed Ron shirley from the show lizard lick...

David Gladwin – Worksop

Mahakal Attitude Status

[GET] Laura Meyer – Fractional Freedom 2025 ($3,997.00)

Housekeeping Jobs_SAP BPC NW

Download: Ziba Zako ft Rich Bizzy & General Kanene – Chikwati (Prod by: Bicko...

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Danna Paola – KHE CALOR – Single [iTunes Plus M4A]

操作を 2 つ以上設定したタスクの実行が失敗する問題について

Neem Baba Extra Questions Answer Class 6 English Poorvi

Windows Update / Microsoft Update の接続先 URL について

Practice Sheet of Right form of verbs for HSC Students

BAMU University BA B.Sc B.Com April 2017 Result, bamua.digitaluniversity.ac

Mario Salieri (La sposa abusata / Italian / 2013)

Port Moresby: A Big Squatter Settlement

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Detroit Mafia’s Consigliere Tony Pal, Possible Final Tie To Hoffa Mystery,...

Jesse Davenport Calls Out Darryl Hill For Sleeping With His Ex-Wife Monique...