Skip to content

The platform should support GPU - support single GPU device

Description

Click to expand the user story involves tasks we are going to do to enable training or inference of a machine learning model that uses GPU from the grid of connected devices

Acceptance Criteria

Click to expand
  1. Detect and show the type of GPU with its specifications
  2. Check, if necessary libraries are installed to allow usage of Cuda cores
  3. Onboard device and show GPU stat on nomad
  4. Since the GPU cannot be shared among tasks the onboarding criteria should request to use them
Task Description Duration Status Comment
1 explore onboarding script explore how the onboarding script works 4 Hrs Done
2 onboard GPU Edit the current onboarding script to check for GPU and onboard it 8 Hrs Done
3 show GPU stat Show GPU stat on nomad 8 Hrs Done Nomad has no way of showing GPU stats
4 save GPU stat save GPU stat on the machine on a file 1 Hr Done
5 test usage run GPU intensive task and test if it works (what I checked was if I can track GPU stat and save them while running GPU intensive task) 8 Hrs Done
6 explore issues related to GPU check nomad GitHub to solidify your decisions on GPU usage and nomad stat on GPU 8 hrs Done
7 update nomad update nomad to latest version and test if it supports GPU 8 hrs Done
8 test GPU usage modify dockers to use GPU and see if nomad can schedule them 16 hrs Done
9 Test on Machines the onboarding script was able to identify gpus 4 hrs Done this was tested on dagims and tedroses machine and output is recorded in the comment
Edited by Janaina Senna