Feature Request: Distributed GPU Training Support in GitLab CI
Problem to solve
GitLab GPU runners are currently limited to executing workloads on a single machine. This constraint presents challenges for teams requiring the distributed training capabilities that are essential for AI/ML workloads. For example:
- Organizations must rely on alternative platforms (e.g. Kubernetes, Ray, Spark) for distributed training. This requires them to set up and maintain external compute clusters, which involves additional complexity and overhead. They must also train development teams on these additional systems
- Development teams are forced to exist the GitLab ecosystem when dealing with multi-node, multi-GPU workloads. This results in a fragmented workflow with all of the negative consequences of context-switching
- Modern AI/ML tasks—for example, training large models or handling high-demand inference—demand the ability to scale horizontally across multiple GPUs and nodes, which is not currently possible in GitLab. This creates a barrier to migrating AI/ML workloads into GitLab
Proposal
Enhance GitLab CI by integrating native support for distributed GPU training across multiple runners in order to facilitate orchestration of AI/ML and Data Science workloads without needing to rely on external compute clusters. This could involve:
- Developing an internal orchestration layer to manage job distribution across multiple GPU runners
- Supporting popular training libraries such as PyTorch DDP, Ray, and Spark
- Abstracting the complexities of horizontal scaling so that developers can focus on model development rather than infrastructure management
- Provide an easy way for developers to specify resource requirements
- Define practical scaling limits aligned with common use-cases (e.g. supporting up to ~8 GPU/nodes per job for inference tasks)
- Integrate seamlessly with current GitLab CI capabilities / workflows
The benefits of implementing this feature include:
- Keeping all CI and AI/ML workflows within GitLab
- Simplifying horizontal scaling complexities so developers can specify resource needs without having to manage the underlying infrastructure
- Streamlining the deployment process by removing the need to transition to external compute clusters for distributed workloads
- Positioning GitLab as a robust solution for AI/ML and Data Science teams
Primary use-cases:
- Facilitating the training of large-scale models across multiple GPU runners using frameworks such as PyTorch DDP: https://pytorch.org/tutorials/beginner/dist_overview.html
- Supporting high-demand inference tasks by efficiently distributing workloads across available GPU resources
- Enabling multi-step AI workflows that require coordination across various resources