Feature Request: Distributed GPU Training Support in GitLab CI

Problem to solve

GitLab GPU runners are currently limited to executing workloads on a single machine. This constraint presents challenges for teams requiring the distributed training capabilities that are essential for AI/ML workloads. For example:

Organizations must rely on alternative platforms (e.g. Kubernetes, Ray, Spark) for distributed training. This requires them to set up and maintain external compute clusters, which involves additional complexity and overhead. They must also train development teams on these additional systems
Development teams are forced to exist the GitLab ecosystem when dealing with multi-node, multi-GPU workloads. This results in a fragmented workflow with all of the negative consequences of context-switching
Modern AI/ML tasks—for example, training large models or handling high-demand inference—demand the ability to scale horizontally across multiple GPUs and nodes, which is not currently possible in GitLab. This creates a barrier to migrating AI/ML workloads into GitLab

Proposal

Enhance GitLab CI by integrating native support for distributed GPU training across multiple runners in order to facilitate orchestration of AI/ML and Data Science workloads without needing to rely on external compute clusters. This could involve:

Developing an internal orchestration layer to manage job distribution across multiple GPU runners
Supporting popular training libraries such as PyTorch DDP, Ray, and Spark
Abstracting the complexities of horizontal scaling so that developers can focus on model development rather than infrastructure management
- Provide an easy way for developers to specify resource requirements
- Define practical scaling limits aligned with common use-cases (e.g. supporting up to ~8 GPU/nodes per job for inference tasks)
- Integrate seamlessly with current GitLab CI capabilities / workflows

The benefits of implementing this feature include:

Keeping all CI and AI/ML workflows within GitLab
Simplifying horizontal scaling complexities so developers can specify resource needs without having to manage the underlying infrastructure
Streamlining the deployment process by removing the need to transition to external compute clusters for distributed workloads
Positioning GitLab as a robust solution for AI/ML and Data Science teams

Primary use-cases:

Facilitating the training of large-scale models across multiple GPU runners using frameworks such as PyTorch DDP: https://pytorch.org/tutorials/beginner/dist_overview.html
Supporting high-demand inference tasks by efficiently distributing workloads across available GPU resources
Enabling multi-step AI workflows that require coordination across various resources