The platform should support GPU - support single GPU device
Description
Click to expand
the user story involves tasks we are going to do to enable training or inference of a machine learning model that uses GPU from the grid of connected devicesAcceptance Criteria
Click to expand
- Detect and show the type of GPU with its specifications
- Check, if necessary libraries are installed to allow usage of Cuda cores
- Onboard device and show GPU stat on nomad
- Since the GPU cannot be shared among tasks the onboarding criteria should request to use them
Task | Description | Duration | Status | Comment |
---|---|---|---|---|
1 explore onboarding script | explore how the onboarding script works | 4 Hrs | Done | |
2 onboard GPU | Edit the current onboarding script to check for GPU and onboard it | 8 Hrs | Done | |
3 show GPU stat | Show GPU stat on nomad | 8 Hrs | Done | Nomad has no way of showing GPU stats |
4 save GPU stat | save GPU stat on the machine on a file | 1 Hr | Done | |
5 test usage | run GPU intensive task and test if it works (what I checked was if I can track GPU stat and save them while running GPU intensive task) | 8 Hrs | Done | |
6 explore issues related to GPU | check nomad GitHub to solidify your decisions on GPU usage and nomad stat on GPU | 8 hrs | Done | |
7 update nomad | update nomad to latest version and test if it supports GPU | 8 hrs | Done | |
8 test GPU usage | modify dockers to use GPU and see if nomad can schedule them | 16 hrs | Done | |
9 Test on Machines | the onboarding script was able to identify gpus | 4 hrs | Done | this was tested on dagims and tedroses machine and output is recorded in the comment |
Edited by Janaina Senna