ARM NEON and CUDA in the same build results in build errors for SIMD

This issue has been observed when building LAMMPS with CUDA (across multiple versions) on NVIDIA Grace Hopper architecture within EESSI, and has also been reported in other projects such as Kokkos and TensorFlow: Builds using nvcc driving either clang or gcc have issues with the arm_neon.h inclusion. Error sample:

/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/GCCcore/12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/include/arm_neon.h(40): error: identifier "__Int8x8_t" is undefined
  typedef __Int8x8_t int8x8_t;
          ^

/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/GCCcore/12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/include/arm_neon.h(41): error: identifier "__Int16x4_t" is undefined
  typedef __Int16x4_t int16x4_t;
          ^

/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/GCCcore/12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/include/arm_neon.h(42): error: identifier "__Int32x2_t" is undefined
  typedef __Int32x2_t int32x2_t;
          ^

/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/GCCcore/12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/include/arm_neon.h(43): error: identifier "__Int64x1_t" is undefined
  typedef __Int64x1_t int64x1_t;
          ^

/cvmfs/software.eessi.io/versions/2023.06/software/linux/aarch64/nvidia/grace/software/GCCcore/12.3.0/lib/gcc/aarch64-unknown-linux-gnu/12.3.0/include/arm_neon.h(835): error: identifier "__builtin_aarch64_addhn2v2di_uuuu" is undefined
    return __builtin_aarch64_addhn2v2di_uuuu (__a, __b, __c);
           ^

Error limit reached.

Workarounds for building LAMMPS with CUDA on ARM:

1- @laraPPr : Avoid Kokkos during the CUDA build process by leveraging a hook that excludes Kokkos while compiling LAMMPS with CUDA enabled.

2- Disable NEON support explicitly to prevent header conflicts by setting the following compiler flags: CXXFLAGS="-O2 -ftree-vectorize -fno-math-errno -fopenmp -march=armv8-a+nosimd through a hook

Testing with CUDA 12.9.0 and GCC/12.3(CUDA-12.9.0.eb , UCX-CUDA-1.14.1-GCCcore-12.3.0-CUDA-12.9.0.eb , NCCL-2.18.3-GCCcore-12.3.0-CUDA-12.9.0.eb , LAMMPS-2Aug2023_update2-foss-2023a-kokkos-CUDA-12.9.0.eb) confirms that the issue remains unresolved.