High Availability support for Runner Manager with Docker Autoscaler and AWS Fleet
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Proposal
High Availability Runner Manager with Docker Autoscaler and AWS Fleet Support
Overview
Enable GitLab Runner Manager to operate in a highly available configuration, eliminating single points of failure and enabling zero-downtime deployments. This feature will allow multiple Runner Manager instances to coordinate through a shared state backend, ensuring continuous CI/CD pipeline execution even during instance replacements, updates, or failures.
Problem Statement
Currently, Runner Manager instances are single points of failure. When a Runner Manager needs to be replaced, updated, or experiences an unexpected failure:
- All associated runners and jobs are affected
- Pipeline failures occur
- Manual intervention is required to restore service
- Maintenance windows cause CI/CD downtime
Success Criteria
- Availability: "Zero downtime" availability for HA Runner deployments
- Failover Performance: Complete failover in < 30 seconds
- Job Continuity: "Zero" job failures during planned maintenance
Related Issues and Documentation
Edited by 🤖 GitLab Bot 🤖