Run an application on NuNet that uses GPU resources

Description

Click to expand

Who

This story should be worked within the scope of Decentralized GPU ML Cloud -- Phase 1, implying interactions with:

Externally:

PGWADA is the person with whom we have partnership for the milestone Decentralized GPU ML Cloud -- Phase 1;
PGWADA is skilled to get onto the testing, helping with development and architecture;
PGWADA is willing to provide a machine to join NuNet network for testing;
@pgwadapool

Internally:

@avimanyu786 and core development team dealing with related issues;
@dagiopia, @janaina.senna and @kabir.kbr with relation to conceptual architecture and design decisions;

What

Run a ML application in NuNet using GPU resources. PGWADA will code this application.
Create a webapp, as suggested by PGWADA, that should be able to launch a training, and once the training is complete, to launch a test. The webapp will provide the necessary inputs for training the model. The model can be assumed to be available in SingularityNet. PGWADA will do this webapp.
Define an API so that the webapp can interact with the ML application running in NuNet (maybe build something like google Collab environment). NuNet team and PGWADA will define this API.
We will use some handwritten digit images (MNIST dataset) to train the ML.

How

In this Issue, NuNet will connect 5 stakeholders types (see figure above):
- Application logic: the webapp described in the What section;
- AI developer: ML code packaged into docker container according to SNET and NuNet specifications;
- Compute providers: HW with GPU (we will be using PGWADA, Kabir and Avi machines);
- Data providers: MNIST dataset;
- Data storage providers: we can get data directly from the source, if has a accessible weblink, or we can host it somewhere else (on IPFS, or at worst a separate docker volume on NuNet).
NuNet will cover the providers (compute, data and storage) and PGWADA will cover the webapp and the ML code.
The first test to run will be built using PyTorch or TensorFlow depending on what is supported by NuNet. This is a simple model to learn handwritten digits. The dataset will consist of MNIST. We will select 5000 Training samples, 500 verification sample. We will use 1000 for testing.
The ML application will run inside a docker in NuNet with:
- OS: Ubuntu 20.04/21.04 LTS
- PyTorch - An open source machine learning framework that accelerates the path from research prototyping to production deployment; or
- Tensorflow - An end-to-end open source machine learning platform
- CUDA - CUDA® is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs);
- cuDNN - The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks.
The code generated by NuNet team related to data providers should be added to the nunet-demo-gpu repository (to be created). The code added/changed by NuNet team related to core modules, as NuNet Adapter, should be added to the specific module repository. Where to store other codes will be defined on the fly.

Why

Training a Machine learning (ML) model requires a lot of processing power which can be costly or difficult to obtain. GPUs are optimized for training artificial intelligence and deep learning models as they can process multiple computations simultaneously.

When

After being able to onboard device and show GPU stat on nomad as Israel did here.

Acceptance Criteria

Click to expand

Code added to the nunet-demo-gpu repository;
Merge request (breaking it in more than one logical requests if it is big);
Code review (at least one approval);
Coded unit test for new features;
Pass CI pipeline (ideally all the items below should be in place, but if not, pass in what we have in place):
- Standard formatted code (beautify)
- Static code analysis
- Pass all unit tests
- UI tests
- API tests
- Stress tests

Work Breakdown Structure (WBS)

Task	Description	Duration (tentative)	Status	Start Date	End Date	Comment
A	Data storage provider: Define if we will use a weblink to access the MNIST dataset, or if we will host it somewhere else (on IPFS, on a separate docker volume on NuNet)	1 Day	Done	July 1 2022	July 1 2022	#126 (comment 1000798643) - We chose weblink based on meeting
B	Configure a TensorFlow and PyTorch shell for ML to be able to interact with ML applications running on NuNet	3 Days	Done	July 8 2022	July 12 2022	#126 (comment 1020443522)
C	Run the ML containers with an onboarded GPU on NuNet	11 Days	Done	June 28 2022	July 12 2022
C.1	Onboard a machine with GPU resources on NuNet	1 Week	Done	June 28 2022	July 6 2022	1. #126 (comment 1010571010) 2. #126 (comment 1017026480)
C.2	Get the workflow with PGWADA and deploy docker containers for TensorFlow & PyTorch	6 Days	Done	July 5 2022	July 12 2022	1. #126 (comment 1016249720) 2. #126 (comment 1023747394)
C.3	Run the ML containers on the onboarded machine in task C.1	4 Days	Done	July 7 2022	July 12 2022	1. #126 (comment 1020443522) 2. #126 (comment 1023846565)
D	Run ML applications with the onboarded GPU command line	1 Day	Done	July 14 2022	July 14 2022	#126 (comment 1026919433)
D.1	Provide the access to the dataset as defined in task A - Fashion MNIST using TensorFlow's Keras	1 Day	Done	July 14 2022	July 14 2022
D.2	Implement the application using the command line configured in task B (PGWADA)	1 Day	Done	July 14 2022	July 14 2022
D.3	Launch the ML training using the command line configured in task B to communicate with NuNet	1 Day	Done	July 14 2022	July 14 2022
D.4	Launch the inference test using the trained model (@kabir.kbr should the model be available at SingularityNet?)	1 Day	Done	July 14 2022	July 14 2022
E	Create an additional option to pre-install PyTorch & TensorFlow when onboarding GPU	1 Day	Done	July 13 2022	July 13 2022	#126 (comment 1025208429)

Edited Jul 22, 2022 by Avimanyu Bandyopadhyay