Do not set EIGEN_HAS_ARM64_FP16_SCALAR_ARITHMETIC for cuda compilation
The previous version assumed that ARM and CUDA will never mix, for code that is shared between host and device this leads to miscompilation (reference to host function '__builtin_neon_vabsh_f16' in host device function)