Optimize Single GPU Allocation Per Container and Enable Inter-Container Communication for Distributed ML Jobs

Description

Who

The proposed changes and optimization can be implemented, tested, and validated by @avimanyu786. The entire development and QA team should also be involved in testing and validating these improvements.

What

We have identified an optimization opportunity in our Docker container script for running ML jobs in decentralized networks like libp2p. The current script allocates a single GPU per container, which aligns well with the decentralized and parallel nature of these networks. However, we can make improvements to enhance inter-container communication for distributed ML jobs running across multiple containers in such networks.

Why

Improving GPU allocation and inter-container communication will allow for more efficient utilization of available hardware resources, particularly in scenarios where multiple containers are performing a single ML job in a decentralized network. Addressing this could lead to significant performance improvements for distributed ML jobs and better alignment with the decentralized, parallel, and fault-tolerant design of these networks.

How

To optimize GPU allocation and inter-container communication in a decentralized network context, we propose the following changes to the current script:

Maintain the strategy of assigning a single GPU per container. This aligns with the decentralization and parallelization inherent in networks like libp2p, simplifies the setup by avoiding complex coordination between multiple GPUs on a single node, minimizes communication overhead, and supports the fault-tolerant design of these networks.
Implement libp2p or a similar technology to enhance inter-container communication. This will allow containers to share intermediate results, synchronize model parameters, etc., thereby improving the efficiency of distributed ML jobs.
Add functionality to the script to check the nature of the ML job and allocate GPUs and containers accordingly. If the ML job is designed for distributed computing in a decentralized network, it should be distributed across multiple containers, each with its own GPU.

When

Given the potential for significant performance improvements and better alignment with decentralized network design principles, we recommend making these enhancements as soon as possible. The exact timeline should align with the current development schedule and resource availability.

Acceptance Criteria

The Docker container script successfully assigns a single GPU to each container, aligning with the parallelization, node autonomy, communication pattern, and fault tolerance inherent in decentralized networks.
Multiple containers running a single distributed ML job in a decentralized network can communicate efficiently with each other.
The ml_job.py script, when designed for distributed computing in a decentralized network, can utilize the allocated GPU and distribute the computation across multiple containers.
If high severity security issues are detected, the ML job and dependencies are not used and the system prints appropriate warning messages.
All existing functionality works as expected without any regressions.

Endnote

This optimization needs thorough testing, especially in decentralized network environments with multiple GPUs and containers, to ensure that the changes work as expected and that they truly enhance the performance of distributed ML jobs. The specifics of the ML job, as well as the capabilities of the hardware and software, should also be considered in this evaluation.

Work Breakdown Structure (WBS)

Task	Description	Duration	Status	Start Date	End Date	Comment
A	Optimize single GPU allocation per container workflow	6 Hours	Done	July 14 2023	July 14 2023
B	Develop TensorFlow Dockerfile (NVIDIA GPU) for Distributed Computing on Horovod	2 Hours	Done	July 14 2023	July 14 2023
C	Develop Horovod version of Fashion MNIST code for Distributed Computing on TensorFlow	4 Hours	Done	July 17 2023	July 17 2023
D	Develop TensorFlow Dockerfile (AMD GPU) for Distributed Computing on Horovod	2 Hours	Done	July 17 2023	July 17 2023
E	Debugging an issue with AMD GPU visibility for jobs on Horovod containers	3 Hours	Done	July 24 2023	July 24 2023
F	Cross-vendor(AMD+NVIDIA) GPU Testing of TensorFlow based code on AMD Radeon VII(machine 1) + NVIDIA RTX 2060 (machine 2) as a single job	1 Hour	Done	July 24 2023	July 24 2023
G	Demo session with @sam.lake to demonstrate how each NVIDIA GPU bound container shares a single job to research on extending the functionality as P2P on other machines	30 minutes	Done	July 19 2023	July 19 2023
H	Successful testing session with @sam.lake to demonstrate how NVIDIA GPU bound containers on two machines each can share a single job through Hyprspace	30 minutes	Done	July 21 2023	July 21 2023
I	Successful testing session with @sam.lake to demonstrate how AMD and NVIDIA GPU bound containers on two separate machines each (respectively) can share a single job through Hyprspace	30 minutes	Done	July 24 2023	July 24 2023
J	PyTorch equivalent tasks for the above workflows (NVIDIA GPUs)	4 Days	Done	July 24 2023	July 28 2023
K	PyTorch equivalent tasks for the above workflows (AMD+NVIDIA GPU)	3 Days	Infeasible	July 31 2023	Aug 2 2023
L	Cross-vendor(AMD+NVIDIA) GPU Testing of TensorFlow based code on AMD Radeon VII(machine 1) + NVIDIA RTX 2060 (machine 2) as a single job(improved version) on Hyprspace	8 Hours	Done	July 25 2023	July 25 2023	https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/horovod/horovod-tf-job-hyprspace-test.py
M	Researching on alternate ways to synchronize Horovod jobs	3 Days	Done	July 26 2023	July 28 2023
M.1	Brainstorming on peer to peer mechanism/readiness for inter-container communication on Hyprspace	3 Days	Done	July 26 2023	July 28 2023
M.2	Developed mechanism to auto-generate hyprspace configuration for all workers - This can be an SPD feature	8 Hours	Done	July 27 2023	July 27 2023

Edited Aug 22, 2023 by Avimanyu Bandyopadhyay