[CUDA,OpenCL] Use RAM buffer for in/out blocks
This allows to do the CPU pre-/post-processing of a password batch to be done in parallel with the GPU computation. This means we can now assume the BLAKE2 computation cost to be hidden behind the GPU computation time (for real). This only adds the overhead of copying the data from/to the RAM buffer to the GPU computation time, but this is fast thanks to the rectangular copy operations that are used. This should significantly affect only hashes with low cost parameters. For these the benchmark tool was reporting too optimistic times before this commit.