Enable independent Gitaly deployments without Rails rollout order dependency

Executive Summary

GitLab's current deployment architecture requires Gitaly to be upgraded before Rails components to maintain zero downtime, creating significant operational complexity and limiting deployment flexibility. This constraint is particularly problematic for Kubernetes environments where independent component rollouts are the default behavior.

Business Impact:

GitLab Dedicated: Risk of losing zero-downtime upgrade capabilities in production
Self-Managed: Blocking progress toward offering zero-downtime upgrades on cloud native deployments
Operational Overhead: Complex orchestration requirements across all deployment methods

Root Cause: Lack of backward-forward API compatibility between Gitaly and Rails components means deploying Rails before Gitaly results in failed gRPC calls and potential outages lasting the entire rollout window.

Proposed Solutions:

Long-term: Build API compatibility to enable independent, out-of-order deployments
Short-term: Implement orchestration-based solutions with separate Helm charts

Overview

This issue documents the technical requirements and challenges for deploying Gitaly independently from GitLab Rails applications in all environments, based on discussions about rollout order dependencies and zero downtime upgrade capabilities across different GitLab deployment scenarios (Slack link).

Problem Statement

Currently, GitLab production deployments require careful orchestration to ensure Gitaly is upgraded before Rails components. This dependency creates significant operational complexity and limits deployment flexibility, particularly in Kubernetes environments where independent component rollouts is the default behavior for software released with a single helm chart.

Key Question: Can we remove the constraint that requires Gitaly to be deployed before Rails components?

As of today, if we introduce a new gRPC call in gitaly, and we deploy the rails components first, they will start using an unexisting gRPC call and fail the request.

Current State Analysis

GitLab.com Production (VM-based)

Current approach: Manual orchestration ensures Gitaly nodes are upgraded before Rails
Critical constraint: Gitaly must rollout before Ruby nodes to maintain zero downtime
Mechanism: The old gitaly process is swapped in-place with the new one that will inherith the existing sockets

GitLab Dedicated (Kubernetes - In Development)

In the current workstreams, nothing was done to address the rollout order requirements.

Current limitations identified:

Pod restarts inherently cause downtime (no equivalent to VM graceful reload)
Without Raft, Gitaly won't be HA, resulting in downtime during upgrades
Using a single helm chart we have no control over the rollout order of Gitaly pods and GitLab rails pods, this breaks the assumption that Gitaly is always deployed first.

Open questions:

Have we measured the downtime induced by pod restarts compared to VM-based graceful reloads? Especially during an application upgrade where the new gitaly image in not yet on the node pools.

GitLab.com Staging/Pre-production (Kubernetes)

Approach: Factor Gitaly out from main Helm Chart reference
Implementation: Enable Gitaly only in first release deployment, then deploy rest of GitLab in subsequent release
Result: Ensures Gitaly pods are deployed before Rails pods

Self-Managed Charts

Current status: No documented or publicly supported approach to ZDU in Cloud Hybrid

Single chart deployment model requires independent component rollouts, we cannot enforce the rollout order
As an alternative an orchestrator will be necessary (i.e. GET or an Operator), but we do not control how customers install our charts

Technical Challenges

1. Component Inter-dependencies and API Compatibility

Root issue: Lack of proper API contracts and compatibility across GitLab components

Gitaly/Rails cannot handle slight differences across API calls
Version variances require careful orchestration instead of built-in compatibility
Current gRPC patterns mandate Gitaly-before-Rails deployment order

2. Kubernetes StatefulSet Rollout Behavior

Questions requiring investigation:

What is the rollout timeline for StatefulSet updates?
Does it kill the old pod then start creating the new one (including image download)?
How does this compare to VM-based graceful reload capabilities?
Will we be able to make use of Gitaly on kubernetes in our platforms before RAFT HA is available?

Ideal state

1. Component Independence

Goal: Make each component deployable independently and out of order (in alignment with gitlab-com/gl-infra/delivery#21572)

Requirement: Build backward-forward compatibility into GitLab/Gitaly
Benefit: Eliminates need for orchestration-based solutions
Challenge: Requires significant investment in API contract standardization and validation

2. Orchestration-based Solutions (Interim solution)

Use deployment orchestrators to handle application components that cannot understand API differences
Challange: Requires to implement the same concept in different tools: release-tools/deployer, k8s-workload (for each environment independently), GET or Instrumentor.

Separate Helm releases:

Factor Gitaly into independent chart
Deploy separately before main GitLab components

The interim solution will still be valuable when we have full component deployment independence because in environments we closely control, having charts for each components allow us to deploy components independently without waiting for the complete rollout of other components. Conversely for our self-managed users, the benefit of a single helm chart to upgrade the whole stack is paramount to avoid version drifting between components of a release.

Questions for Gitaly Team

API Maturity: What is the current status of backward-forward compatibility between Gitaly versions and Rails versions?
Breaking Changes: What are the specific API incompatibilities that require the current deployment order constraint?
Roadmap: Is there a plan to achieve API compatibility that would allow out-of-order deployments?

Impact Assessment

High Risk Scenarios

Order violation: Deploying Rails before Gitaly could cause outages lasting entire rollout window
Dedicated deployment gaps: New Kubernetes-based approach may lose zero-downtime capabilities

Benefits of Resolution

Operational simplicity: Eliminate complex orchestration requirements
Deployment flexibility: Enable independent component scaling and updates
Consistent experience: Align self-managed and SaaS deployment capabilities

Technical References

Helmfile Configuration: GitLab.com Kubernetes Workloads
GitLab Dedicated Blueprint: Zero Downtime Upgrades Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information