Optimize Single GPU Allocation Per Container and Enable Inter-Container Communication for Distributed ML Jobs
Description
Who
The proposed changes and optimization can be implemented, tested, and validated by @avimanyu786. The entire development and QA team should also be involved in testing and validating these improvements.
What
We have identified an optimization opportunity in our Docker container script for running ML jobs in decentralized networks like libp2p. The current script allocates a single GPU per container, which aligns well with the decentralized and parallel nature of these networks. However, we can make improvements to enhance inter-container communication for distributed ML jobs running across multiple containers in such networks.
Why
Improving GPU allocation and inter-container communication will allow for more efficient utilization of available hardware resources, particularly in scenarios where multiple containers are performing a single ML job in a decentralized network. Addressing this could lead to significant performance improvements for distributed ML jobs and better alignment with the decentralized, parallel, and fault-tolerant design of these networks.
How
To optimize GPU allocation and inter-container communication in a decentralized network context, we propose the following changes to the current script:
-
Maintain the strategy of assigning a single GPU per container. This aligns with the decentralization and parallelization inherent in networks like libp2p, simplifies the setup by avoiding complex coordination between multiple GPUs on a single node, minimizes communication overhead, and supports the fault-tolerant design of these networks.
-
Implement libp2p or a similar technology to enhance inter-container communication. This will allow containers to share intermediate results, synchronize model parameters, etc., thereby improving the efficiency of distributed ML jobs.
-
Add functionality to the script to check the nature of the ML job and allocate GPUs and containers accordingly. If the ML job is designed for distributed computing in a decentralized network, it should be distributed across multiple containers, each with its own GPU.
When
Given the potential for significant performance improvements and better alignment with decentralized network design principles, we recommend making these enhancements as soon as possible. The exact timeline should align with the current development schedule and resource availability.
Acceptance Criteria
- The Docker container script successfully assigns a single GPU to each container, aligning with the parallelization, node autonomy, communication pattern, and fault tolerance inherent in decentralized networks.
- Multiple containers running a single distributed ML job in a decentralized network can communicate efficiently with each other.
- The ml_job.py script, when designed for distributed computing in a decentralized network, can utilize the allocated GPU and distribute the computation across multiple containers.
- If high severity security issues are detected, the ML job and dependencies are not used and the system prints appropriate warning messages.
- All existing functionality works as expected without any regressions.
Endnote
This optimization needs thorough testing, especially in decentralized network environments with multiple GPUs and containers, to ensure that the changes work as expected and that they truly enhance the performance of distributed ML jobs. The specifics of the ML job, as well as the capabilities of the hardware and software, should also be considered in this evaluation.
Work Breakdown Structure (WBS)
| Task | Description | Duration | Status | Start Date | End Date | Comment |
|---|---|---|---|---|---|---|
| A | Optimize single GPU allocation per container workflow | 6 Hours | Done | July 14 2023 | July 14 2023 | |
| B | Develop TensorFlow Dockerfile (NVIDIA GPU) for Distributed Computing on Horovod | 2 Hours | Done | July 14 2023 | July 14 2023 | |
| C | Develop Horovod version of Fashion MNIST code for Distributed Computing on TensorFlow | 4 Hours | Done | July 17 2023 | July 17 2023 | |
| D | Develop TensorFlow Dockerfile (AMD GPU) for Distributed Computing on Horovod | 2 Hours | Done | July 17 2023 | July 17 2023 | |
| E | Debugging an issue with AMD GPU visibility for jobs on Horovod containers | 3 Hours | Done | July 24 2023 | July 24 2023 | |
| F | Cross-vendor(AMD+NVIDIA) GPU Testing of TensorFlow based code on AMD Radeon VII(machine 1) + NVIDIA RTX 2060 (machine 2) as a single job | 1 Hour | Done | July 24 2023 | July 24 2023 | |
| G | Demo session with @sam.lake to demonstrate how each NVIDIA GPU bound container shares a single job to research on extending the functionality as P2P on other machines | 30 minutes | Done | July 19 2023 | July 19 2023 | |
| H | Successful testing session with @sam.lake to demonstrate how NVIDIA GPU bound containers on two machines each can share a single job through Hyprspace | 30 minutes | Done | July 21 2023 | July 21 2023 | |
| I | Successful testing session with @sam.lake to demonstrate how AMD and NVIDIA GPU bound containers on two separate machines each (respectively) can share a single job through Hyprspace | 30 minutes | Done | July 24 2023 | July 24 2023 | |
| J | PyTorch equivalent tasks for the above workflows (NVIDIA GPUs) | 4 Days | Done | July 24 2023 | July 28 2023 | |
| K | PyTorch equivalent tasks for the above workflows (AMD+NVIDIA GPU) | 3 Days | Infeasible | July 31 2023 | Aug 2 2023 | |
| L | Cross-vendor(AMD+NVIDIA) GPU Testing of TensorFlow based code on AMD Radeon VII(machine 1) + NVIDIA RTX 2060 (machine 2) as a single job(improved version) on Hyprspace | 8 Hours | Done | July 25 2023 | July 25 2023 | https://gitlab.com/nunet/ml-on-gpu/ml-on-gpu-service/-/raw/develop/examples/horovod/horovod-tf-job-hyprspace-test.py |
| M | Researching on alternate ways to synchronize Horovod jobs | 3 Days | Done | July 26 2023 | July 28 2023 | |
| M.1 | Brainstorming on peer to peer mechanism/readiness for inter-container communication on Hyprspace | 3 Days | Done | July 26 2023 | July 28 2023 | |
| M.2 | Developed mechanism to auto-generate hyprspace configuration for all workers - This can be an SPD feature | 8 Hours | Done | July 27 2023 | July 27 2023 |