Missing intra-warp synchronization in LINCS and SETTLE CUDA kernels
## Summary
When doing shared memory reduction, we don't call `__syncwarp` on the last iterations. This is not correct for Volta+ GPUs with
Issue exists since the LINCS and SETTLE kernels were introduced and can lead to incorrect results when using GPU Update.
Note: SYCL kernels are fine, although SETTLE had a related, even more serious, issue problem by !2484.
## Exact steps to reproduce
Build GROMACS with CUDA support. Run any code using GPU update with either SETTLE or LINCS under `compute-sanitizer --tool racecheck`. Observe:
```
$ GMX_FORCE_UPDATE_DEFAULT_GPU=1 compute-sanitizer --tool racecheck ./bin/mdrun-output-test --gtest_filter=MdrunCanWrite/Trajectories.ThatDifferInNstxout/2
...
========= Warning: Race reported between Write access at 0x1cc0 in void gmx::settle_kernel<(bool)1, (bool)1>(int, const int3 *, gmx::SettleParameters, const float3 *, float3 *, PbcAiuc, float, float3 *, float *)
========= and Read access at 0x1c90 in void gmx::settle_kernel<(bool)1, (bool)1>(int, const int3 *, gmx::SettleParameters, const float3 *, float3 *, PbcAiuc, float, float3 *, float *) [60 hazards]
========= and Read access at 0x1ec0 in void gmx::settle_kernel<(bool)1, (bool)1>(int, const int3 *, gmx::SettleParameters, const float3 *, float3 *, PbcAiuc, float, float3 *, float *) [4 hazards]
=========
...
```
## For developers: Why is this important?
We don't want data races on most popular([citation needed]) end-user GPUs.
## Possible fixes
Add `__syncwarp` call.
issue