[CUDA,OpenCL] Use RAM buffer for in/out blocks

This allows to do the CPU pre-/post-processing of a password batch to
be done in parallel with the GPU computation. This means we can now
assume the BLAKE2 computation cost to be hidden behind the GPU
computation time (for real).

This only adds the overhead of copying the data from/to the RAM buffer
to the GPU computation time, but this is fast thanks to the rectangular
copy operations that are used. This should significantly affect only
hashes with low cost parameters. For these the benchmark tool was
reporting too optimistic times before this commit.
7 jobs for master in 58 minutes and 1 second (queued for 1 second)
Status Job ID Name Coverage
  Build
passed #180243475
build-clang-cuda

00:03:35

passed #180243476
build-clang-nocuda

00:01:46

passed #180243469
build-gcc-cuda

00:03:06

passed #180243472
build-gcc-nocuda

00:01:38

passed #180243479
update-pocl

00:01:47

 
  Test
passed #180243484
test-clang-nocuda

00:54:25

passed #180243482
test-gcc-nocuda

00:49:37