Run an application on NuNet that uses GPU resources
Description
Click to expand
Who
- This story should be worked within the scope of Decentralized GPU ML Cloud -- Phase 1, implying interactions with:
Externally:
- PGWADA is the person with whom we have partnership for the milestone Decentralized GPU ML Cloud -- Phase 1;
- PGWADA is skilled to get onto the testing, helping with development and architecture;
- PGWADA is willing to provide a machine to join NuNet network for testing;
- @pgwadapool
Internally:
- @avimanyu786 and core development team dealing with related issues;
- @dagiopia, @janaina.senna and @kabir.kbr with relation to conceptual architecture and design decisions;
What
- Run a ML application in NuNet using GPU resources. PGWADA will code this application.
- Create a webapp, as suggested by PGWADA, that should be able to launch a training, and once the training is complete, to launch a test. The webapp will provide the necessary inputs for training the model. The model can be assumed to be available in SingularityNet. PGWADA will do this webapp.
- Define an API so that the webapp can interact with the ML application running in NuNet (maybe build something like google Collab environment). NuNet team and PGWADA will define this API.
- We will use some handwritten digit images (MNIST dataset) to train the ML.
How
- In this Issue, NuNet will connect 5 stakeholders types (see figure above):
- Application logic: the webapp described in the What section;
- AI developer: ML code packaged into docker container according to SNET and NuNet specifications;
- Compute providers: HW with GPU (we will be using PGWADA, Kabir and Avi machines);
- Data providers: MNIST dataset;
- Data storage providers: we can get data directly from the source, if has a accessible weblink, or we can host it somewhere else (on IPFS, or at worst a separate docker volume on NuNet).
- NuNet will cover the providers (compute, data and storage) and PGWADA will cover the webapp and the ML code.
- The first test to run will be built using PyTorch or TensorFlow depending on what is supported by NuNet. This is a simple model to learn handwritten digits. The dataset will consist of MNIST. We will select 5000 Training samples, 500 verification sample. We will use 1000 for testing.
- The ML application will run inside a docker in NuNet with:
- OS: Ubuntu 20.04/21.04 LTS
- PyTorch - An open source machine learning framework that accelerates the path from research prototyping to production deployment; or
- Tensorflow - An end-to-end open source machine learning platform
- CUDA - CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs);
- cuDNN - The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.
- The code generated by NuNet team related to data providers should be added to the nunet-demo-gpu repository (to be created). The code added/changed by NuNet team related to core modules, as NuNet Adapter, should be added to the specific module repository. Where to store other codes will be defined on the fly.
Why
- Training a Machine learning (ML) model requires a lot of processing power which can be costly or difficult to obtain. GPUs are optimized for training artificial intelligence and deep learning models as they can process multiple computations simultaneously.
When
- After being able to onboard device and show GPU stat on nomad as Israel did here.
Acceptance Criteria
Click to expand
- Code added to the nunet-demo-gpu repository;
- Merge request (breaking it in more than one logical requests if it is big);
- Code review (at least one approval);
- Coded unit test for new features;
- Pass CI pipeline (ideally all the items below should be in place, but if not, pass in what we have in place):
- Standard formatted code (beautify)
- Static code analysis
- Pass all unit tests
- UI tests
- API tests
- Stress tests
Work Breakdown Structure (WBS)
Task | Description | Duration (tentative) | Status | Start Date | End Date | Comment |
---|---|---|---|---|---|---|
A | Data storage provider: Define if we will use a weblink to access the MNIST dataset, or if we will host it somewhere else (on IPFS, on a separate docker volume on NuNet) | 1 Day | Done | July 1 2022 | July 1 2022 | #126 (comment 1000798643) - We chose weblink based on meeting |
B | Configure a TensorFlow and PyTorch shell for ML to be able to interact with ML applications running on NuNet | 3 Days | Done | July 8 2022 | July 12 2022 | #126 (comment 1020443522) |
C | Run the ML containers with an onboarded GPU on NuNet | 11 Days | Done | June 28 2022 | July 12 2022 | |
C.1 | Onboard a machine with GPU resources on NuNet | 1 Week | Done | June 28 2022 | July 6 2022 | 1. #126 (comment 1010571010) 2. #126 (comment 1017026480) |
C.2 | Get the workflow with PGWADA and deploy docker containers for TensorFlow & PyTorch | 6 Days | Done | July 5 2022 | July 12 2022 | 1. #126 (comment 1016249720) 2. #126 (comment 1023747394) |
C.3 | Run the ML containers on the onboarded machine in task C.1 | 4 Days | Done | July 7 2022 | July 12 2022 | 1. #126 (comment 1020443522) 2. #126 (comment 1023846565) |
D | Run ML applications with the onboarded GPU command line | 1 Day | Done | July 14 2022 | July 14 2022 | #126 (comment 1026919433) |
D.1 | Provide the access to the dataset as defined in task A - Fashion MNIST using TensorFlow's Keras | 1 Day | Done | July 14 2022 | July 14 2022 | |
D.2 | Implement the application using the command line configured in task B (PGWADA) | 1 Day | Done | July 14 2022 | July 14 2022 | |
D.3 | Launch the ML training using the command line configured in task B to communicate with NuNet | 1 Day | Done | July 14 2022 | July 14 2022 | |
D.4 | Launch the inference test using the trained model (@kabir.kbr should the model be available at SingularityNet?) | 1 Day | Done | July 14 2022 | July 14 2022 | |
E | Create an additional option to pre-install PyTorch & TensorFlow when onboarding GPU | 1 Day | Done | July 13 2022 | July 13 2022 | #126 (comment 1025208429) |
Edited by Avimanyu Bandyopadhyay