Integrate HIP/CUDA branch to the mainline
This MR will introduce GPU kernels and basic functionality, but the changes on the Python side are kept in minimum. For the later introduction of removed changes, see the following commits: