CUDA + RMM-DIS causes `Illegal memory access` in test-suite when k != GAMMA
I've been trying to compile QE for GPU using:
- HPC-SDK 23.7 (both using
nv...andpg...compilers) - CUDA 12.1.1
- OpenMPI 4.1.5 (Tested both with and without)
and running on a system with a A100 GPU.
All tests in the pw test-suite are passing, but the ones that uses rmm diagonalization and involves k-points other than GAMMA, in particular:
- system--pw_noncolin--noncolin-rmm
- system--pw_scf--scf-rmm-k
- system--pw_scf--scf-rmm-paro-k
fail after RMM-DIIS diagonalization appears for the first time in the output with the error
cudaMemcpy returned status 700: an illegal memory access was encountered
The GAMMA only counterparts do pass:
- system--pw_scf--scf-rmm-gamma
- system--pw_scf--scf-rmm-paro-gamma
I've seen from other issues that hpc-sdk with QE can be finicky, but since the failures seems to appear in a specific section of the code i thought this might be worth investigating.