Skip to content

Draft: OpenMP5 offloading (WIP)

Ivan Carnimeo requested to merge devel_omp5 into develop

tests UPDATED to commit: adb9433b
(previously fully tested commit: 14dc80fa )

To be done:

  • more OMP offload in PW (Done - noncolin and gamma cases in vloc_psi.f90 fully offloaded, except tg; v_xc offloaded | h_psi offloaded (no real space) with inner data mapping ; calbec_k, calbec_gamma, calbec_nc offloaded with inner data mapping - to be tested on more than 1 gpu )
  • OMP offload in XClib (Done)
  • configure build system (now omp offload works by manually setting make.inc)
  • cmake build system
  • LAXlib
  • merge AMD libraries (Done)

To be fixed:

  • MPI + GPU with Cray
  • PPCG algorithm with Intel compiler on devcloud
  • ugly fix in Modules/Makefile: ifx compiler does not compile space_group.o and ifort must be used only for that file

Fixed:

  • Intel compiler compiles without -D__USE_DISPATCH, but then crashes at runtime;
    (Done a temporary omp offloaded buffer psic_omp has been defined alongside the usual psic, for omp offloaded ffts)
  • find a workaround for omp dispatch with cray compiler (also GNU and NVHPC complain with dispatch)
    (Done: dispatch directives have been protected and can be switched on with __USE_DISPATCH flag)
  • gfortran complains when finds map with data structures, e.g. !$omp target exit data map(delete:dfft%nl)
    (Done those directives have been protected with __OPENMP_GPU)
  • crashes on stress calculation with nvfortran (on GPU) + MKL
    (Done: There was a small bug in PW/src/gradutils.f90)
  • simpler OPENMP_GPU logics inside FFTXlib interfaces
    (Done: Now FFTXlib interfaces are the same as the official develop)
  • clearer distinction between CUDA and OMP5 routines
    (Done: Now _omp is appended to OMP routines and modules to distinguish them from CUDA _gpu)
  • bug fix: hpc-sdk (GPU) + mkl (CPU) (see Tests)
    (Done: fft_scalar.DFTI.f90 has been restored to the official develop version, and a new file fft_scalar.DFTIOMP.f90 has been introduced specifically for OMP offloading)

Tests:

1. Intel software stack (GPU) + MKL

  • Setup: ifx (IFORT) 2022.1.0 20220316
  • Hardware: devcloud Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz + gen9 GPUs
  • Compilation (with __USE_DISPATCH): OK
  • FFTXlib tests (with __USE_DISPATCH): All passed
  • PW test-suite (with __USE_DISPATCH): 228/232 Passed (PPCG fails)

2. AMD software stack

  • compilation without __USE_DISPATCH worked, and some PW tests passed with 1 rank

3. GNU software stack (CPU) + MKL

  • Setup: GNU Fortran (GCC) 10.2.0 + MKL 2020.4.304
  • Hardware: local cluster with Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz + 2 GV100 GPUs
  • Compilation: OK
  • PW test-suite (4 mpi, 2 threads): 232/232 Passed

4. hpc-sdk software stack (CPU) + MKL

  • Setup: nvfortran 21.3-0 LLVM 64-bit target on x86-64 Linux -tp skylake + MKL 2020.4.304
  • Hardware: local cluster with Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz + 2 GV100 GPUs
  • Compilation: OK
  • PW test-suite (4 mpi, 2 threads): 232/232 Passed

5. hpc-sdk software stack (GPU) + MKL

  • Setup: nvfortran 21.3-0 LLVM 64-bit target on x86-64 Linux -tp skylake + MKL 2020.4.304
  • Hardware: local cluster with Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz + 2 GV100 GPUs
  • Compilation: OK
  • FFTXlib tests: All passed
  • PW test-suite (4 mpi, 2 threads, 2 gpu): 232/232 Passed

6. hpc-sdk software stack (CPU) on m100

  • Setup: nvfortran 21.5 + spectrum_mpi/10.4.0
  • Hardware: m100
  • Compilation: OK
  • PW test-suite: (4 mpi, 8 threads): 232/232 Passed

7. hpc-sdk software stack (GPU) on m100

  • Setup: nvfortran 21.5 + spectrum_mpi/10.4.0 + cuda/11.0
  • Hardware: m100
  • Compilation: OK
  • FFTXlib tests: to be done
  • PW test-suite: (4 mpi, 8 threads, 4 GPUs): 232/232 Passed
Edited by Laura Bellentani

Merge request reports