[OpenCL] Use amd_bitalign() for rotr64 if possible

This speeds up Argon2 by about 1-1.5% on AMD GPUs.
7 jobs for master in 40 minutes and 27 seconds (queued for 29 minutes and 57 seconds)
Status Job ID Name Coverage
  Build
passed #33952102
build-clang-cuda

00:02:50

passed #33952103
build-clang-nocuda

00:02:31

passed #33952100
build-gcc-cuda

00:03:00

passed #33952101
build-gcc-nocuda

00:02:52

passed #33952105
update-pocl

00:02:11

 
  Test
passed #33952108
test-clang-nocuda

00:19:15

passed #33952106
test-gcc-nocuda

00:32:03