Enable CUDA + Complex (!2822) · Merge requests · PETSc / petsc

Junchao Zhang requested to merge jczhang/fix-cuda-complex into master May 30, 2020

When petsc is configured with CUDA and complex, a simple test ex1.c gives a segfault with ./ex1 -vec_type cuda,

  1 #include <petscvec.h>
  2 int main(int argc,char **argv)
  3 {
  4   PetscErrorCode ierr;
  5   Vec            x;
  6   PetscScalar    alpha = 1234.0+5678*PETSC_i;
  7
  8   ierr = PetscInitialize(&argc,&argv,(char*)0,NULL);if (ierr) return ierr;
  9   ierr = VecCreate(PETSC_COMM_WORLD,&x);CHKERRQ(ierr);
 10   ierr = VecSetSizes(x,PETSC_DECIDE,10);CHKERRQ(ierr);
 11   ierr = VecSetFromOptions(x);CHKERRQ(ierr);
 12   ierr = VecSet(x,alpha);CHKERRQ(ierr);
 13   ierr = VecDestroy(&x);CHKERRQ(ierr);
 14   ierr = PetscFinalize();
 15   return ierr;
 16 }

Thread 1 "ex1" received signal SIGSEGV, Segmentation fault.
0x00007ffff59f01f2 in thrust::complex<double>::real (this=0x0) at /nfs/gce/software/custom/linux-ubuntu18.04-x86_64/cuda/10.2/include/thrust/complex.h:381
381	  T real() const { return data.x; }
(gdb) bt
#0  0x00007ffff59f01f2 in thrust::complex<double>::real (this=0x0) at /nfs/gce/software/custom/linux-ubuntu18.04-x86_64/cuda/10.2/include/thrust/complex.h:381
#1  0x00007ffff59f04b6 in thrust::operator==<double, double> (x=..., y=...) at /nfs/gce/software/custom/linux-ubuntu18.04-x86_64/cuda/10.2/include/thrust/detail/complex/complex.inl:273
#2  0x00007ffff59e5b08 in VecSet_SeqCUDA (xin=0x118ce2b0, alpha=<error reading variable: Cannot access memory at address 0x0>) at /home/jczhang/petsc-cuda/src/vec/vec/impls/seq/seqcuda/veccuda2.cu:806
#3  0x00007ffff5d7eaf6 in VecSet (x=0x118ce2b0, alpha=...) at /home/jczhang/petsc-cuda/src/vec/vec/interface/rvector.c:529
#4  0x00007ffff5d268fe in VecCreate_SeqCUDA (V=0x118ce2b0) at /home/jczhang/petsc-cuda/src/vec/vec/impls/seq/seqcuda/veccuda.c:268
#5  0x00007ffff5d71c56 in VecSetType (vec=0x118ce2b0, method=0x7ffff7629544 "seqcuda") at /home/jczhang/petsc-cuda/src/vec/vec/interface/vecreg.c:50
#6  0x00007ffff59fc6b1 in VecCreate_CUDA (v=0x118ce2b0) at /home/jczhang/petsc-cuda/src/vec/vec/impls/mpi/mpicuda/mpicuda.cu:215
#7  0x00007ffff5d71c56 in VecSetType (vec=0x118ce2b0, method=0x7fffffffb5b0 "cuda") at /home/jczhang/petsc-cuda/src/vec/vec/interface/vecreg.c:50
#8  0x00007ffff5d6860b in VecSetTypeFromOptions_Private (PetscOptionsObject=0x7fffffffb710, vec=0x118ce2b0) at /home/jczhang/petsc-cuda/src/vec/vec/interface/vector.c:1249
#9  0x00007ffff5d68c38 in VecSetFromOptions (vec=0x118ce2b0) at /home/jczhang/petsc-cuda/src/vec/vec/interface/vector.c:1282
#10 0x0000000000400f6d in main (argc=3, argv=0x7fffffffb898) at ex1.c:11

Similar errors were reported by petsc users, ex., https://lists.mcs.anl.gov/mailman/htdig/petsc-users/2019-February/037677.html. The segfault is indeed due to an ABI mismatch between C/C++ complex and CUDA thrust complex, as Jed suggested in a petsc-maint thread.

To investigate this problem, let's look at the code in petscsystypes.h,

#if defined(__cplusplus) && defined(PETSC_HAVE_CXX_COMPLEX) && !defined(PETSC_USE_REAL___FLOAT128)
#  if !defined(PETSC_SKIP_COMPLEX)
     /* C++ support of complex number */
#    define PETSC_HAVE_COMPLEX 1
#    if defined(PETSC_HAVE_CUDA) && __CUDACC_VER_MAJOR__ > 6
       /* complex headers in thrust only available in CUDA 7.0 and above */
#      define petsccomplexlib thrust
#      include <thrust/complex.h>
#    else
#      define petsccomplexlib std
#      include <complex>
#    endif
...

It makes PetscScalar/PetscComplex to be C complex with C compiler, std::complex with C++ compiler, and thrust::complex with nvcc compiler. All three complex implementations make a complex out of two reals. So it looks our code is perfect. But a subtle issue can make it catastrophic.

In struct _VecOps *VecOps from vecimpl.h, we declare function pointer PetscErrorCode (*set)(Vec,PetscScalar); In cudavecimpl.h, we declare PetscErrorCode VecSet_SeqCUDA(Vec,PetscScalar); In veccuda.c, we do V->ops->set = VecSet_SeqCUDA; Assume --with-clangauge=c, veccuda.c, which includes vecimpl.h and cudavecimpl.h, will be smoothly compiled by a C compiler, since these three files have the same prototype for the function pointer, e.g., (*set)(Vec, double _Complex). If instead --with-clangauge=cxx, the second parameter will be of type std::complex. Still no compilation problems.

In veccuda2.cu, we define PetscErrorCode VecSet_SeqCUDA(Vec xin,PetscScalar alpha) {...}. Though we also include vecimpl.h and cudavecimpl.h, veccuda2.cu is compiled with nvcc (internally calls a C++ compiler) and PetscScalar will be thrust::complex.

Note linker only sees symbol VecSet_SeqCUDA, not its parameters. So users will build petsc successfully with no problems in compilation or link.

But Let's see how PetscScalar is passed as a parameter. According to p.20~22 of System V Application Binary Interface AMD64 Architecture Processor Supplement, parameters are passed based on their type:

If the size of an object is larger than eight eightbytes, or it contains unaligned fields, it has class MEMORY

Arguments of complex T where T is one of the types float or double are treated as if they are implemented as:
   struct complex T {
     T real;
     T imag;
   };
If a C++ object is non-trivial for the purpose of calls, as specified in the C++ ABI, it is passed by invisible reference (the object is replaced in the parameter list by a pointer that has class INTEGER)

By the rules, size of struct complex T{...} does not exceeds 8 eightbytes, so a C complex PetscScalar is passed by vector registers in VecSet_SeqCUDA. The tricky ones are C++ std::complex and thrust::complex. Definition of non-trivial for the purpose of calls goes to Itanium C++ ABI. You did not read wrong, Itanium C++ ABI! It is arch-independent and says:

A type is considered non-trivial for the purposes of calls if:

*it has a non-trivial copy constructor, move constructor, or destructor, or

*all of its copy and move constructors are deleted.

This definition, as applied to class types, is intended to be the complement of the definition in [class.temporary]p3 of types for which an extra temporary is allowed when passing or returning a type. A type which is trivial for the purposes of the ABI will be passed and returned according to the rules of the base C ABI, e.g. in registers; often this has the effect of performing a trivial copy of the type.

According to C++ standard,

The copy constructor for class T is trivial if all of the following are true:

it is not user-provided (that is, it is implicitly-defined or defaulted), ...

...

GNU stdc++ library implemented std::complex with a trivial-copy constructor at here

template<typename _Tp>
struct complex
{
      // Let the compiler synthesize the copy constructor
#if __cplusplus >= 201103L
      constexpr complex(const complex&) = default;
#endif
  ...
};

However, Thrust 1.9.7 in CUDA Toolkit, release 10.2, V10.2.89 implemented complex with a non-trivial, user-provided copy constructor at here

template <typename T>
__host__ __device__
complex<T>::complex(const complex<T>& z)
#if THRUST_CPP_DIALECT >= 2011
  // Initialize the storage in the member initializer list using C++ unicorn
  // initialization. This allows `complex<T const>` to work.
  : data{z.real(), z.imag()}
{}
#else
{
  real(z.real());
  imag(z.imag());
}
#endif

We can test their trivial-copyablity with

#include <iostream>
#include <type_traits>
#include <complex>
#include <thrust/complex.h>

int main()
{
std::cout << std::boolalpha;
std::cout << std::is_trivially_copyable<std::complex<double> >::value << '\n';
std::cout << std::is_trivially_copyable<thrust::complex<double> >::value << '\n';
}

Output:

true
false

So, by the ABI, when we call ierr = (*x->ops->set)(x,alpha) in caller VecSet(Vec x,PetscScalar alpha), alpha of C or C++ complex will be passed by registers. In callee VecSet_SeqCUDA(Vec xin,PetscScalar alpha) in veccuda2.cu, alpha of thrust::complex is supposed to be passed by pointer and will be retrieved from memory. Therefore segfault is deserved with this fatal mismatch.

Types with trivial copy constructor make passing object parameters as trivial as memcpy. Thrust::complex breaks it, such that the compiler has to allocate a new temp object (e.g., on stack), call user-provided copy constructor, and then pass pointer of the temp to callee.

We can confirm the above analysis by dumping assembly code of ex1.c (e.g., with gcc -S). When PetscScalar is C/C++ complex, VecSet(x,alpha) is called in this way:

    .loc 3 12 16 is_stmt 1
    movq    -16(%rbp), %rax     // Note Linux x86 assembly has format: `mov src, dst`
    movsd   -32(%rbp), %xmm0    // pass real(alpha) in register %xmm0 
    movsd   -24(%rbp), %xmm1    // pass imag(alpha) in register %xmm1 
    movq    %rax, %rdi          // pass the first argument (pointer x) in %rdi, 
    call    VecSet@PLT          // call VecSet

If we somehow make PetscScalar to be thrust::complex, and compile ex1.c with a C++ compiler, VecSet will be called in another way:

.LBE4:
  .loc 3 12 16 is_stmt 1
  leaq    -32(%rbp), %rdx
  leaq    -16(%rbp), %rax  // -16(%rbp) = address of a temp complex object on stack
  movq    %rdx, %rsi
  movq    %rax, %rdi  // %rdi is 1st arg. of the copy constructor. It is the 'this' in C++, also the address of the temp.
  call    _ZN6thrust7complexIdEC1ERKS1_ // c++filt decodes it as thrust::complex<double>::complex(thrust::complex<double> const&),
  movq    -40(%rbp), %rax               // indicating the copy constructor is called to construct the temp object.
  leaq    -16(%rbp), %rdx
  movq    %rdx, %rsi       // %rsi is 2nd arg. register. It contains address of the temp complex object.
  movq    %rax, %rdi       // %rdi is 1st arg. register. It contains pointer x.
  call    VecSet@PLT

Fortunately, Thrust 1.9.8 had a change. The commit message said "Make thrust::complex be actually trivially copyable whenever possible."

#if THRUST_CPP_DIALECT >= 2011
  /*! Default construct a complex number.
   */
  complex() = default;

  complex(const complex<T>& z) = default;
#else
  __host__ __device__
  complex();

  __host__ __device__
  complex(const complex<T>& z);
#endif

I believe with this thrust version and C++ dialect >= 2011, PETSc CUDA+Complex should work. Note the latest CUDATookit release 10.2.89 (on Summit and on CELS machine) only has thrust 1.9.7. The latest thrust release is 1.9.10.

The above analysis also implies the approach once suggested by Jed, e.g., changing VecSet_SeqCUDA()'s prototype to VecSet_SeqCUDA(Vec x, PetscScalar *alpha) won't solve our problem. The issue happens in many other places, e.g., VecWAXPY_SeqCUDA(Vec win,PetscScalar alpha,Vec xin, Vec yin). It could also happen in users code, when they declare functions with PetscScalar parameters in C files and define them in *.cu files. It is terrifying to see that in C++ such a tiny (also looks harmless or useless) change could have such significant effect.

I approached the problem in many ways:

Override the default thrust in CUDAToolkit. I git clone --recurse-submodules the latest thrust from its github repo and added the local path to petsc configure --CFLAGS="-I$HOME/include/thrust". Then everything ran smoothly.

Thrust after 1.9.7 has a dependant submodule, so I used git clone instead of direct tarball download.
Using --CUDAPPFLAGS= seems a better choice. However I found we do not make use of it and in ${PETSC_DIR}/lib/petsc/conf/variables we have

PETSC_CUCOMPILE_SINGLE = ${CUDAC} -o $*.o {CUDAC_FLAGS} -c --compiler-options="{PCC_FLAGS} ${CFLAGS} ${CCPPFLAGS}"

nvcc calls the host C++ preprocessor. We must have the new path included in --compiler-options to make this approach work. Otherwise, see this unsuccessful attempt, in which I put the new include dir to the very beginning, but nvcc pushed it back:

nvcc -I/home/jczhang/include/thrust -c -Xcompiler -fPIC -g -I/home/jczhang/spack/opt/spack/linux-ubuntu18.04-skylake_avx512/gcc-8.3.0/openmpi-4.0.2-e2zcbqzufxnowcjpwgxq6sgl6frskjs4/include -g --generate-line-info --compiler-options="-fPIC -g -O0 -I/home/jczhang/petsc-cuda/include -I/home/jczhang/petsc-cuda/linux-cuda-complex/include -I/nfs/gce/software/custom/linux-ubuntu18.04-x86_64/cuda/10.2/include " /home/jczhang/petsc-cuda/src/vec/vec/impls/seq/seqcuda/veccuda2.cu -o linux-cuda-complex/obj/vec/vec/impls/seq/seqcuda/veccuda2.o --verbose

#$ gcc -D__CUDA_ARCH__=300 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ -fPIC -fPIC -g -O0 -I/home/jczhang/petsc-cuda/include -I/home/jczhang/petsc-cuda/linux-cuda-complex/include -I/nfs/gce/software/custom/linux-ubuntu18.04-x86_64/cuda/10.2/include -I"/home/jczhang/include/thrust" -I"/home/jczhang/spack/opt/spack/linux-ubuntu18.04-skylake_avx512/gcc-8.3.0/openmpi-4.0.2-e2zcbqzufxnowcjpwgxq6sgl6frskjs4/include" "-I/nfs/gce/software/custom/linux-ubuntu18.04-x86_64/cuda/10.2/bin/../targets/x86_64-linux/include" -D__CUDACC_VER_MAJOR__=10 -D__CUDACC_VER_MINOR__=2 -D__CUDACC_VER_BUILD__=89 -include "cuda_runtime.h" -m64 -g -gdwarf-2 "/home/jczhang/petsc-cuda/src/vec/vec/impls/seq/seqcuda/veccuda2.cu" -o "/tmp/tmpxft_00020735_00000000-6_veccuda2.cpp1.ii"

Use thrust::complex on both host and device (experimented in branch jczhang/force-use-thrust-complex)

DONE:
- Directly defined PetscScalar as thrust::complex in petscsystypes.h
- Added a CUDA+Complex CI. Fixed bugs showing up
  - Unlike std C++ library, thrust arithmetic functions, e.g., abs(), do not overload real numbers. So we can not do PetscAbsScalar(PetscReal) anymore. In case the arg is real, we need to use PetscAbsReal().
  - A test failure indicated thrust::pow() did not provide enough accuracy. I had to use std::pow(). For many arithmetic functions, I replaced them with their counterpart from std. As long as we do not call PetscPowScalar(complex) etc in kernels, we are fine.
TODO:
- When users configure --with-cuda --with-scalar-type=complex, if users also have --with-clanguage=c, raise an error and ask them to use --with-clanguage=cxx. Use C++ to compile the whole petsc code.
- Define PETSC_USE_THRUST_COMPLEX_ON_HOST_AND_DEVICE in petsconf.h.
- In petscsystypes.h, if defined(PETSC_USE_THRUST_COMPLEX_ON_HOST_AND_DEVICE), then define PetscScalar as thrust::complex.
- Since host C++ compiler and device NVCC compilers all see the same PetscScalar type, the ABI mismatch problem does not exist any more. If users compile their code with C compilers, there will be compilation errors since petscsystypes.h contains C++ stuff, thrust::complex.
CI result shows this approach worked.

Throw away thrust::complex (we can still use other thrust stuff) and use another complex library on device

Use cuDoubleComplex. CUDA has its C style complex type cuFloatComplex and cuDoubleComplex. Nvidia only provides few simple functions to do e.g., +, *, / operations and no pow, sin, cos, etc.
- With this basic CUDA C style complex, users cannot directly do "PetscComplex + PetscComplex". We have to extend it to overload +,-,*,/, which can be non-member operators and are simple to implement. The hard part is +=, *= etc, which have to be member operators and we need to break struct double2 {...}!

    struct __device_builtin__ __builtin_align__(16) double2
    {
        double x, y;
    };
    typedef double2 cuDoubleComplex

Use std C++ complex on both host and device
- Tried and there were "calling host only function from device function"-like errors. I guess this is because many std::complex constructors/functions are called without notice (hidden by beautiful C++ syntax).
Use std C complex on both host and device
- C complex (compilers) supports +, += etc, but I found I could not even do conj(x) in device code. Probably without a little twist, this will work as long as we do not need complicated arithmetic functions on device.

I prefer approach 1, which is the easiest and only requires a C++ compiler supporting c++11 and beyond. More importantly, it does not change complex types on host (still use C/C++ complex), so we are free of other potential interfacing issues.

Hope this analysis makes us wise when choosing complex libraries on Intel/AMD GPUs.

Edited Jun 05, 2020 by Junchao Zhang

Enable CUDA + Complex

Merge request reports