Data movement time spent in rho_g2r_1spin

I have noticed running small GAMMA case (159 atoms, 3 atomic species) that non-trivial amount of time is spent inside rho_g2r_1spin (line 234). The calculation run fast but it could be faster!

Screenshot_2025-04-01_at_11.26.59

  !------------------------------------------------------------------
  SUBROUTINE rho_g2r_1spin( desc, rhog, rhor )
    !----------------------------------------------------------------
    !! Bring charge density rho from G-space to real space. 1-dimensional
    !! input (1 spin component only).
    !
    USE fft_types,              ONLY: fft_type_descriptor
    USE fft_helper_subroutines, ONLY: fftx_oned2threed
    !
    IMPLICIT NONE
    !
    TYPE(fft_type_descriptor), INTENT(IN) :: desc
    COMPLEX(DP), INTENT(IN)  :: rhog(:)
    REAL(DP),    INTENT(OUT) :: rhor(:)
    !
    INTEGER :: ir
    COMPLEX(DP), ALLOCATABLE :: psi(:)
    !
    ALLOCATE( psi(desc%nnr) )
    !
    !$acc data present_or_copyin(rhog) present_or_copyout(rhor) create(psi) <<<<<<<< HERE
    !
    CALL fftx_oned2threed( desc, psi, rhog )
    !
    !$acc host_data use_device( psi )
    CALL invfft( 'Rho', psi, desc )
    !$acc end host_data
    !
#if defined(_OPENACC)
!$acc parallel loop
#else
!$omp parallel do
#endif
    DO ir = 1, desc%nnr
       rhor(ir) = DBLE(psi(ir))
    ENDDO
#if !defined(_OPENACC)
!$omp end parallel do
#endif
    !
    !$acc end data
    DEALLOCATE( psi )
    !
  END SUBROUTINE rho_g2r_1spin

I am running QE 7.4 with NVHPC 25.1 (HPCX bundled with NVHPC) and NVPL 25.1. The GPU is GH200.

Attached input case.

out.559799 pw_1.in

Edited by Filippo Spiga