GPU force buffer ops, reduction and comms on virial steps

Currently, Force buffer ops, force reduction and halo exchange fall back to CPU on virial steps. PME-PP comms are still active, but go from PME GPU to PP CPU.
Virial steps are irregular, so there is insignificant overall performance benefit to having GPU force buffer ops, reduction and comms on virial steps.
But code quality improvements related to reduced complexity of conditionals in do_force()/do_md() could be possible, due to more uniformity across steps.
GPU force Buffer ops on virial steps are relatively straightforward, but don't offer much improvement on their own
- see patch at https://gerrit.gromacs.org/c/gromacs/+/15960
GPU force reduction on virial steps is much more complex
- On non-virial steps, there is only one active force buffer and this is reduced with both the NB and PME forces.
- On virial steps, there exist 2 separate force buffers forceOut.forceWithShiftForces().force() and forceOut.forceWithVirial().force_. The former is reduced with the NB force, and the latter with the PME force.
- We currently have no concept of a second force buffer in the Stata Propagator or the GPU force reduction, and no concept of these distinct reductions.
- It seems that the additional complexity associated with introducing this would outweigh any reduced complexity in conditionals from having the GPU codepaths active on virial steps.
GPU force halo exchange on virial steps is relatively straightforward, but not clear there is any benefit to this without GPU force reduction.
My feeling is that the simplest solution is to keep force buffer ops, reductions and halo exchange on CPU for virial steps, as we have at the moment.

Edited Sep 29, 2020 by Alan Gray