Skip to content
GitLab
  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
    • Switch to GitLab Next
  • Sign in / Register
  • GROMACS GROMACS
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 275
    • Issues 275
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 76
    • Merge requests 76
  • Deployments
    • Deployments
    • Releases
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • GROMACS
  • GROMACSGROMACS
  • Issues
  • #3399
Closed
Open
Created Feb 25, 2020 by Paul Bauer@acmnpvOwner

out of memory errors cause abort during GPU detection - Redmine #3399

Report from user of an issue where the GPU detection / sanity checks abort mdrun due to out-of-memory error, instead of just marking the device not available:
https://mailman-1.sys.kth.se/pipermail/gromacs.org\_gmx-users/2020-February/128568.html

#report #request

(from redmine: issue id 3399, created on 2020-02-25 by pszilard, closed on 2020-02-28)

  • Relations:
    • relates #3178 (closed)
  • Changesets:
    • Revision 0a9b0ba7 by Szilárd Páll on 2020-02-26T15:18:29Z:
Avoid mdrun terminate due to GPU sanity check errors

When a GPU is a exclusive or prohibited mode, early detection calls can
fail and as a result an mdrun run abort with an error, even if all GPU
offload is explicitly disabled by the user.
This change adds a status code to handle the case of devices being
unavailable.

Additionally, other errors may be encountered during the dummy kernel
sanity check (e.g. out of memory), but since the change that switches
to using launchGpuKernel() wrapper did not handle the exception in the
sanity checking, this can also abort a run even if the GPU in question
is not selected to be used.
This change adds code to catch the exception this and report the error
and avoid abort the run.

Fixes #3178 #3399

Change-Id: I0cdedbc02769084c172e4a42fe5c1af192007cec
Assignee
Assign to
Time tracking