Davidson ported with OpenACC, CPU and GPU versions merged into one (!1826) · Merge requests · QEF - Quantum ESPRESSO Foundation / q-e

Ivan Carnimeo requested to merge icarnimeo/q-e:ks_acc into develop May 06, 2022

NOTES:

MYDDOT_VECTOR_GPU has been added in UtilXlib, in order to perform ddot inside gpu kernels using "vector" OpenACC level of parallelism
four new subroutines (mp_sum_rm_nc, mp_sum_cm_nc, mp_sum_rm_nc_gpu, mp_sum_cm_nc_gpu, where "nc" means "non-contiguous") have been added to the mp_sum interface, in order to perform mp_sum on non-contiguous arrays: Call mp_sum(a(k1:k2,k3:k4), MPI_COMM) --> Call mp_sum(a, k1, k2, k3, k4, MPI_COMM) The new subroutines allocate a buffer (msg_buff) to pack the input array internally, whereas in the older version of cegterg_gpu it was necessary to allocate the buffer outside, before the call to mp_sum (see pinned_buffer in old cegterg_gpu), simplifying the code.
Regarding the memory consumption (cfr mp_sum_cm_nc_gpu vs mp_sum_cm_gpu):

in the __GPU_MPI case, the overall GPU memory consumption is unchanged, because pinned_buffer allocation in cegterg_gpu has been just replaced with msg_buff allocation in mp_sum_cm_nc_gpu;
in the non __GPU_MPI case, the overall GPU memory consumption is now reduced, because pinned_buffer allocation is avoided outside mp_sum and msg_buff (mp_sum_cm_nc_gpu) is used in place of msg_h (mp_sum_cm_gpu);
the CPU case in cegterg has been protected with __CUDA, because in this case the regular mp_sum (mp_sum_cm_gpu) works well and there is no reason to call mp_sum_cm_nc_gpu and allocate msg_buff.

note that only mp_sum_cm_nc_gpu is used (at the moment), the other routines (mp_sum_rm_nc, mp_sum_cm_nc, mp_sum_rm_nc_gpu) are included for completeness

Edited May 16, 2022 by Ivan Carnimeo

Davidson ported with OpenACC, CPU and GPU versions merged into one