thread affinity handling ignores errors
On recent clusters it is more common (due to slurm cgrups use node reservations that don't explicitly request all cores and GPUs per node imply non-exclusive nodes access) that when batch jobs resource request are not correct (e.g. cores per MPI task are off and total core count is < mdrun threads started) some of the threads fail to set affinities. The only sign of this failure in the log is a quite harmless looking note: "NOTE: Thread affinity was not set.", the errors that indicate that affinity setting failed are only printed as notes (not even warnings) to the console. This can easily lead to performance loss, e.g. see the attached case which is 2x slower compared to the identical run which manages to set affinities.
Moreover, when -pin on
is passed to mdrun, upon failing to set affinities, the pinning request will be ignored and mdrun will continue without affinities.
Unless affinities were set through the job/MPI launcher, such user errors have in the past not lead to performance degradation when mdrun -pin on
was used. It is however no longer possible to just recommend that users override external affinities and as these failures go easily unnoticed.
test_8R-4GPU_8x16_allgpu_dynCUDART_nstlist200_jID937184.log slurm-937184.out