Fatal Error when launching mdrun on host with busy/unavailable GPU(s) - Redmine #3178
Archive from user: Artem Shekhovtsov The launch of mdrun that does not require video cards exit with a fatal error if at least one video card is busy on the host at that time. <pre> gmx grompp -f test.mdp -c spc216.gro -p topol.top -o test.tpr gmx mdrun -deffnm test -ntmpi 1 -ntomp 1 -nb cpu -bonded cpu Result: ———————————————————————————- Program: gmx mdrun, version 2019.2 Source file: src/gromacs/gpu\_utils/gpu\_utils.cu (line 100) Fatal error: cudaFuncGetAttributes failed: all CUDA-capable devices are busy or unavailable For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors ———————————————————————————- I have this error in 2019.2, 2019.3, 2020.beta. Version - 2018.6 is not affected. All version builds with the same flags. cmake .. -DGMX\_BUILD\_OWN\_FFTW=ON -DREGRESSIONTEST\_DOWNLOAD=ON -DGMX\_GPU=on -DCMAKE\_INSTALL\_PREFIX=/data/user/shehovtsov/SOFTWARE/GROMACS/2019.2\_test \#report \#request </pre> *(from redmine: issue id 3178, created on 2019-10-25 by gmxdefault, closed on 2020-02-28)* * Relations: * relates #3399 * Changesets: * Revision 0a9b0ba74db6c70afc066eba21c767225b5feab3 by Szilárd Páll on 2020-02-26T15:18:29Z: ``` Avoid mdrun terminate due to GPU sanity check errors When a GPU is a exclusive or prohibited mode, early detection calls can fail and as a result an mdrun run abort with an error, even if all GPU offload is explicitly disabled by the user. This change adds a status code to handle the case of devices being unavailable. Additionally, other errors may be encountered during the dummy kernel sanity check (e.g. out of memory), but since the change that switches to using launchGpuKernel() wrapper did not handle the exception in the sanity checking, this can also abort a run even if the GPU in question is not selected to be used. This change adds code to catch the exception this and report the error and avoid abort the run. Fixes #3178 #3399 Change-Id: I0cdedbc02769084c172e4a42fe5c1af192007cec ``` * Uploads: * [test.mdp](/uploads/d564f8dab06b57b91baa6cca77d4b054/test.mdp) * [spc216.gro](/uploads/9bc12338da4fec7b118b28799a7aa37b/spc216.gro) * [topol.top](/uploads/480267457c96a1aa221e1037e28ef6b9/topol.top) * [test.log](/uploads/4eab72464ee5cbd4115f1eebbda731c8/test.log) * [check_log](/uploads/074a0eeb8821c71275da28d24ade317d/check_log) * [cmake_log](/uploads/ffb46b3d9be46e58f88a2ce8b495c38d/cmake_log) * [make_log](/uploads/7138cb0ef40e25cacf03bbe0d782df63/make_log) * [printenv](/uploads/fe37ec283ed2f76df232275619e2a3ab/printenv)
issue