High Availability support for Runner Manager with Docker Autoscaler and AWS Fleet

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Proposal

High Availability Runner Manager with Docker Autoscaler and AWS Fleet Support

Overview

Enable GitLab Runner Manager to operate in a highly available configuration, eliminating single points of failure and enabling zero-downtime deployments. This feature will allow multiple Runner Manager instances to coordinate through a shared state backend, ensuring continuous CI/CD pipeline execution even during instance replacements, updates, or failures.

Problem Statement

Currently, Runner Manager instances are single points of failure. When a Runner Manager needs to be replaced, updated, or experiences an unexpected failure:

All associated runners and jobs are affected
Pipeline failures occur
Manual intervention is required to restore service
Maintenance windows cause CI/CD downtime

Success Criteria

Availability: "Zero downtime" availability for HA Runner deployments
Failover Performance: Complete failover in < 30 seconds
Job Continuity: "Zero" job failures during planned maintenance