Quantum ESPRESSO GPU
This repository hosts the experimental GPU accelerated version of QuantumESPRESSO.
This project aims at developing, testing and stabilizing the GPU adaptation of a number of components of the QE suite. Mature components are eventually merged into the official repository (
pw.x can take advantage from offloading computation on NVidia GPUs. The code has been tested with K40, K80, P100 and V100 cards. It provides all the functionalities of the equivalent CPU-only version, but different levels of acceleration may be obtained depending on the performed task.
Quantum ESPRESSO GPU will follow the same software versioning of the official Quantum ESPRESSO release, but an additional alpha version will be used to indicate advances in the GPU accelerated parts.
v6.4-a1 indicates that the release is fully compatible with the CPU version
v6.4 and is the first alpha release of the GPU accelerated part.
In order to compile this version of the code you need:
- CUDA Toolkit v8+
- PGI Compilers v17.10+
- OpenMP package v3+
Both of CUDA Toolkit and PGI compilers can be downloaded free of charge from NVIDIA and PGI websites.
The code can be compiled with the standard
make pw sequence. For compiling the GPU part, a recent version of the PGI compilers is needed. The last versions of the community edition can be downloaded for free. The environment should of course also be equipped with recent NVIDIA drivers and at least CUDA Toolkit v8.
An example is:
./configure --with-cuda=XX --with-cuda-runtime=Y.y --with-cuda-cc=ZZ --enable-openmp [ --with-scalapack=no ]
XX is the location of the CUDA Toolkit (in HPC environments it is
$CUDA_HOME, be sure that this variable is not empty with a simple
Y.y is the version of the CUDA Toolkit (
y are the two numbers identifying major and minor release, e.g.
is the compute capability (cc) of the card. This information can be found on the internet using the model name of the GPU card or by using
Openmp is required in order to successfully compile the accelerated version.
Moreover, it is generally a good idea to disable Scalapack when running small test
cases since the serial GPU eigensolver can outperform the parallel CPU
eigensolver in many circumstances. If you want to keep the scalapack interface,
remember to run with
-ndiag 1 when using GPUs.
Finally, if you have multiple GPUs per node you may activate the experimental implementation of
CUDA-Aware MPI. This must be done manually by adding
__GPU_MPI just after
__MPI in the file
pw.x is supported at the moment. To compile it just do
(parallel compilation with
make pw will work, all other options of the Makefile are not supported yet.
As a rule, the code should run with one MPI per GPU. In some cases, if allowed by the node configuration, performance may be improved by running two processes per GPU device. The code will print a warning if more than two processes try to run on the same GPU as this is discouraged.
The page accelerated features describes the components of the code that can exploit GPU acceleration.
A list of benchmarks is also available.
The reduction of the time to solution provided by the GPU version compared to the CPU only version running on the same host node depends on many factors. The GPU accelerated version of pw.x has been reported to provide speedups between 2x to 4x. More details can be found in these presentations:
Authors and contributors
In alphabetical order:
- Fabio Affinito (CINECA),
- Pietro Bonfà (Univ. Parma),
- Ivan Carnimeo (SISSA),
- Carlo Cavazzoni (CINECA),
- Anoop Chandran,
- Brandon Cook (NERSC),
- Pietro Delugas (SISSA)
- Elena De Paoli
- Massimiliano Fatica (NVIDIA),
- Paolo Giannozzi (Univ. Udine)
- Ivan Girotto (ICTP),
- Miloš Marić (NVIDIA),
- Everett Philips (NVIDIA),
- Josh Romero (NVIDIA),
- Fabrizio Ferrari Ruffino (SISSA)
- Filippo Spiga (NVIDIA),
- Kurth Thorsten (NVIDIA).
Many other people contributed with comments, suggestion, bug reports, great libraries and benchmarks, so let us thank everybody we have not mentioned above.
Here's a list of common problems when compiling and using the accelerated version of
Cannot link to OpenMP, error looks like:
libcusolver.so: undefined reference to `GOMP_parallel_start' libcusolver.so: undefined reference to `GOMP_loop_dynamic_start' libcusolver.so: undefined reference to `GOMP_critical_end' libcusolver.so: undefined reference to `GOMP_loop_end_nowait' libcusolver.so: undefined reference to `GOMP_parallel_start' libcusolver.so: undefined reference to `GOMP_parallel_end'
PGI is linking the final binary file against the wrong
cusolver library (it should use the one with PGI threading support). Try to remove the CUDA Toolkit module or paths from the environment. PGI will use its internal version.
Compile problems with PGI 19.10
PGF90-S-0038-Symbol, gird, becsum_nc_d, has not been explicitly declared (sum_band_gpu.f90)
This has been fixed in QE-GPU v6.5.
configure: error: You do not have the cudafor module. Are you using a PGI compiler?
If you are trying to build an MPI enabled version of QE-GPU, check the output of
if it starts with something different from
pgfortran it means that your MPI implementation is not wrapping PGI compilers. The configure command detects this and uses the Fortran compiler used by MPI to perform all tests.
To solve this use an MPI implementation that wraps PGI Fortran. PGI compilers ship an OpenMPI implementation with their install package. See the compiler documentation for further instructions on how to enable it.