This repository hosts the experimental GPU accelerated version of QuantumESPRESSO.
This project aims at developing, testing and stabilizing the GPU adaptation of a number of components of the QE suite. Mature components are eventually merged into the official repository (QEF/q-e).
Currently only pw.x can take advantage from offloading computation on NVidia GPUs. The code has been tested with K40, K80, P100 and V100 cards. It provides all the functionalities of the equivalent CPU-only version, but different levels of acceleration may be obtained depending on the performed task.
Quantum ESPRESSO GPU will follow the same software versioning of the official Quantum ESPRESSO release, but an additional alpha version will be used to indicate advances in the GPU accelerated parts.
For example, v6.4-a1 indicates that the release is fully compatible with the CPU version v6.4 and is the first alpha release of the GPU accelerated part.
In order to compile this version of the code you need:
CUDA Toolkit v8+
PGI Compilers v17.10+
Both these software can be downloaded free of charge from NVIDIA and PGI websites.
The code can be compiled with the standard ./configure and make pw sequence. For compiling the GPU part, a recent version of the PGI compilers is needed. The last versions of the community edition can be downloaded for free. The environment should of course also be equipped with recent NVIDIA drivers and at least CUDA Toolkit v8.
where XX is the location of the CUDA Toolkit (in HPC environments it is
generally $CUDA_HOME), Y.y is the version of the CUDA Toolkit (Y and y are the two numbers identifying major and minor release, e.g. 9.0) and ZZ
is the compute capability of the card.
Openmp is required in order to successfully compile the accelerated version.
Moreover, it is generally a good idea to disable Scalapack when running small test
cases since the serial GPU eigensolver can outperform the parallel CPU
eigensolver in many circumstances. If you want to keep the scalapack interface,
remember to run with -ndiag 1 when using GPUs.
Finally, if you have multiple GPUs per node you may activate the experimental implementation of
CUDA-Aware MPI. This must be done manually by adding __GPUMPI just after __MPI in the file make.inc.
Only pw.x is supported at the moment. To compile it just do
(parallel compilation with -j works).
NB: only make pw will work, all other options of the Makefile are not supported yet.
Optimized install instructions
If you HPC cluster is in the following list, you may find ready to use configure and compilation commands including modules and correct acceleration options:
As a rule, the code should run with one MPI per GPU. In some cases, if allowed by the node configuration, performance may be improved by running two processes per GPU device. The code will print a warning if more than two processes try to run on the same GPU as this is discouraged.
The reduction of the time to solution provided by the GPU version compared to the CPU only version running on the same host node depends on many factors. The GPU accelerated version of pw.x has been reported to provide speedups between 2x to 4x.
More details can be found in these presentations:
Here's a list of common problems when compiling and using the accelerated version of pw.x.
Cannot link to OpenMP, error looks like:
libcusolver.so: undefined reference to `GOMP_parallel_start'libcusolver.so: undefined reference to `GOMP_loop_dynamic_start'libcusolver.so: undefined reference to `GOMP_critical_end'libcusolver.so: undefined reference to `GOMP_loop_end_nowait'libcusolver.so: undefined reference to `GOMP_parallel_start'libcusolver.so: undefined reference to `GOMP_parallel_end'
PGI is linking the final binary file against the wrong cusolver library (it should use the one with PGI threading support). Try to remove the CUDA Toolkit module or paths from the environment. PGI will use its internal version.