incorrect runtime assertion catches CUDA API errors from GPU sanity checking - Redmine #2415
The compatibility/sanity chcking implemented in
is_gmx_supported_gpu_id() leaves the CUDA runtime status contain the
last error when the checks get interrupted by an API error and an
“insane” state would be reported. However, the called, findGpus()
runtime asserts on the API state which means that it will catch and
abort on errors that should not be fatal.
As a result, runs that detect an error during GPU detection will abort instead of skipping the device(s) that can’t be used.
(from redmine: issue id 2415, created on 2018-02-16 by pszilard, closed on 2018-11-05)
- Changesets:
- Revision 74400c15 by Szilárd Páll on 2018-02-20T00:20:50Z:
Avoid aborting mdrun when GPU sanity check detects errors
A release assertion was added which assumed that the GPU
compatibility/sanity checks return with a clean CUDA API state.
Consequently, any run that encountered a non-success return value from
the CUDA API would abort the run instead of continuing the run without
using the GPU in question.
This change adds code to handle and issue a note on the error
encountered as well as ensures that the CUDA API error state cleared
at the return of the GPU detection.
Fixes #2415
Change-Id: I5d7ed59ef8e4052a75b51c9a526b8dcb465ff611
- Revision 6a897857 by Szilárd Páll on 2018-08-15T19:43:55Z:
Improve GPU detection sanity check error message
When the unexpected condition is triggered some extra info on what type
of error has been left behind after a successful detection of a
compatible GPU is now printed to aid with identifying issues.
Refs #2415
Change-Id: I85e0da4c339df8184aa2dec49440ce2d0e83e8bf