Data movement time spent in rho_g2r_1spin
I have noticed running small GAMMA case (159 atoms, 3 atomic species) that non-trivial amount of time is spent inside rho_g2r_1spin (line 234). The calculation run fast but it could be faster!
!------------------------------------------------------------------
SUBROUTINE rho_g2r_1spin( desc, rhog, rhor )
!----------------------------------------------------------------
!! Bring charge density rho from G-space to real space. 1-dimensional
!! input (1 spin component only).
!
USE fft_types, ONLY: fft_type_descriptor
USE fft_helper_subroutines, ONLY: fftx_oned2threed
!
IMPLICIT NONE
!
TYPE(fft_type_descriptor), INTENT(IN) :: desc
COMPLEX(DP), INTENT(IN) :: rhog(:)
REAL(DP), INTENT(OUT) :: rhor(:)
!
INTEGER :: ir
COMPLEX(DP), ALLOCATABLE :: psi(:)
!
ALLOCATE( psi(desc%nnr) )
!
!$acc data present_or_copyin(rhog) present_or_copyout(rhor) create(psi) <<<<<<<< HERE
!
CALL fftx_oned2threed( desc, psi, rhog )
!
!$acc host_data use_device( psi )
CALL invfft( 'Rho', psi, desc )
!$acc end host_data
!
#if defined(_OPENACC)
!$acc parallel loop
#else
!$omp parallel do
#endif
DO ir = 1, desc%nnr
rhor(ir) = DBLE(psi(ir))
ENDDO
#if !defined(_OPENACC)
!$omp end parallel do
#endif
!
!$acc end data
DEALLOCATE( psi )
!
END SUBROUTINE rho_g2r_1spin
I am running QE 7.4 with NVHPC 25.1 (HPCX bundled with NVHPC) and NVPL 25.1. The GPU is GH200.
Attached input case.
Edited by Filippo Spiga
