High Availability support for Runner Manager with Docker Autoscaler and AWS Fleet

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

  • Close this issue

Proposal

High Availability Runner Manager with Docker Autoscaler and AWS Fleet Support

Overview

Enable GitLab Runner Manager to operate in a highly available configuration, eliminating single points of failure and enabling zero-downtime deployments. This feature will allow multiple Runner Manager instances to coordinate through a shared state backend, ensuring continuous CI/CD pipeline execution even during instance replacements, updates, or failures.

Problem Statement

Currently, Runner Manager instances are single points of failure. When a Runner Manager needs to be replaced, updated, or experiences an unexpected failure:

  • All associated runners and jobs are affected
  • Pipeline failures occur
  • Manual intervention is required to restore service
  • Maintenance windows cause CI/CD downtime

Success Criteria

  • Availability: "Zero downtime" availability for HA Runner deployments
  • Failover Performance: Complete failover in < 30 seconds
  • Job Continuity: "Zero" job failures during planned maintenance

Related Issues and Documentation

  • Runner Fleet Scaling Architecture
  • Docker Autoscaler Documentation
  • AWS Fleet Integration Guide
Edited Jul 08, 2025 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading