evaluate hipSYCL with native HIP kernels

In case if SYCL native code would not produce the desired performance on AMD with hipSYCL, we can use native HIP kernels:

using SYCL kernel invocation with native HIP kernel body
hipSYCL backend iterop calling a separately compiled HIP kernel object

On the device-side there the difference between the above two is in the upstream clang vs ROCm clang HIP compilers which may or may not behave significantly differently, something we should verify. In the host-device code there may be differences (especially if explicit multipass is used), but I'm hoping that these will not have a major performance impact.