High Availability support for Runner Manager with Docker Autoscaler and AWS Fleet

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Proposal

High Availability Runner Manager with Docker Autoscaler and AWS Fleet Support

Overview

Enable GitLab Runner Manager to operate in a highly available configuration, eliminating single points of failure and enabling zero-downtime deployments. This feature will allow multiple Runner Manager instances to coordinate through a shared state backend, ensuring continuous CI/CD pipeline execution even during instance replacements, updates, or failures.

Problem Statement

Currently, Runner Manager instances are single points of failure. When a Runner Manager needs to be replaced, updated, or experiences an unexpected failure:

  • All associated runners and jobs are affected
  • Pipeline failures occur
  • Manual intervention is required to restore service
  • Maintenance windows cause CI/CD downtime

Success Criteria

  • Availability: "Zero downtime" availability for HA Runner deployments
  • Failover Performance: Complete failover in < 30 seconds
  • Job Continuity: "Zero" job failures during planned maintenance

Related Issues and Documentation

Edited by 🤖 GitLab Bot 🤖