Draft: OpenMP5 offloading (WIP)
tests UPDATED to commit: adb9433b
(previously fully tested commit: 14dc80fa )
To be done:
- more OMP offload in PW (Done - noncolin and gamma cases in vloc_psi.f90 fully offloaded, except tg; v_xc offloaded | h_psi offloaded (no real space) with inner data mapping ; calbec_k, calbec_gamma, calbec_nc offloaded with inner data mapping - to be tested on more than 1 gpu )
- OMP offload in XClib (Done)
- configure build system (now omp offload works by manually setting make.inc)
- cmake build system
- LAXlib
- merge AMD libraries (Done)
To be fixed:
- MPI + GPU with Cray
- PPCG algorithm with Intel compiler on devcloud
- ugly fix in Modules/Makefile: ifx compiler does not compile space_group.o and ifort must be used only for that file
Fixed:
- Intel compiler compiles without -D__USE_DISPATCH, but then crashes at runtime;
(Done a temporary omp offloaded buffer psic_omp has been defined alongside the usual psic, for omp offloaded ffts) - find a workaround for omp dispatch with cray compiler (also GNU and NVHPC complain with dispatch)
(Done: dispatch directives have been protected and can be switched on with __USE_DISPATCH flag) - gfortran complains when finds map with data structures, e.g. !$omp target exit data map(delete:dfft%nl)
(Done those directives have been protected with __OPENMP_GPU) - crashes on stress calculation with nvfortran (on GPU) + MKL
(Done: There was a small bug in PW/src/gradutils.f90) - simpler OPENMP_GPU logics inside FFTXlib interfaces
(Done: Now FFTXlib interfaces are the same as the official develop) - clearer distinction between CUDA and OMP5 routines
(Done: Now _omp is appended to OMP routines and modules to distinguish them from CUDA _gpu) - bug fix: hpc-sdk (GPU) + mkl (CPU) (see Tests)
(Done: fft_scalar.DFTI.f90 has been restored to the official develop version, and a new file fft_scalar.DFTIOMP.f90 has been introduced specifically for OMP offloading)
Tests:
1. Intel software stack (GPU) + MKL
- Setup: ifx (IFORT) 2022.1.0 20220316
- Hardware: devcloud Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz + gen9 GPUs
- Compilation (with __USE_DISPATCH): OK
- FFTXlib tests (with __USE_DISPATCH): All passed
- PW test-suite (with __USE_DISPATCH): 228/232 Passed (PPCG fails)
2. AMD software stack
- compilation without __USE_DISPATCH worked, and some PW tests passed with 1 rank
3. GNU software stack (CPU) + MKL
- Setup: GNU Fortran (GCC) 10.2.0 + MKL 2020.4.304
- Hardware: local cluster with Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz + 2 GV100 GPUs
- Compilation: OK
- PW test-suite (4 mpi, 2 threads): 232/232 Passed
4. hpc-sdk software stack (CPU) + MKL
- Setup: nvfortran 21.3-0 LLVM 64-bit target on x86-64 Linux -tp skylake + MKL 2020.4.304
- Hardware: local cluster with Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz + 2 GV100 GPUs
- Compilation: OK
- PW test-suite (4 mpi, 2 threads): 232/232 Passed
5. hpc-sdk software stack (GPU) + MKL
- Setup: nvfortran 21.3-0 LLVM 64-bit target on x86-64 Linux -tp skylake + MKL 2020.4.304
- Hardware: local cluster with Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz + 2 GV100 GPUs
- Compilation: OK
- FFTXlib tests: All passed
- PW test-suite (4 mpi, 2 threads, 2 gpu): 232/232 Passed
6. hpc-sdk software stack (CPU) on m100
- Setup: nvfortran 21.5 + spectrum_mpi/10.4.0
- Hardware: m100
- Compilation: OK
- PW test-suite: (4 mpi, 8 threads): 232/232 Passed
7. hpc-sdk software stack (GPU) on m100
- Setup: nvfortran 21.5 + spectrum_mpi/10.4.0 + cuda/11.0
- Hardware: m100
- Compilation: OK
- FFTXlib tests: to be done
- PW test-suite: (4 mpi, 8 threads, 4 GPUs): 232/232 Passed
Edited by Laura Bellentani