Enable independent Gitaly deployments without Rails rollout order dependency
Executive Summary
GitLab's current deployment architecture requires Gitaly to be upgraded before Rails components to maintain zero downtime, creating significant operational complexity and limiting deployment flexibility. This constraint is particularly problematic for Kubernetes environments where independent component rollouts are the default behavior.
Business Impact:
- GitLab Dedicated: Risk of losing zero-downtime upgrade capabilities in production
- Self-Managed: Blocking progress toward offering zero-downtime upgrades on cloud native deployments
- Operational Overhead: Complex orchestration requirements across all deployment methods
Root Cause: Lack of backward-forward API compatibility between Gitaly and Rails components means deploying Rails before Gitaly results in failed gRPC calls and potential outages lasting the entire rollout window.
Proposed Solutions:
- Long-term: Build API compatibility to enable independent, out-of-order deployments
- Short-term: Implement orchestration-based solutions with separate Helm charts
Overview
This issue documents the technical requirements and challenges for deploying Gitaly independently from GitLab Rails applications in all environments, based on discussions about rollout order dependencies and zero downtime upgrade capabilities across different GitLab deployment scenarios (Slack link).
Problem Statement
Currently, GitLab production deployments require careful orchestration to ensure Gitaly is upgraded before Rails components. This dependency creates significant operational complexity and limits deployment flexibility, particularly in Kubernetes environments where independent component rollouts is the default behavior for software released with a single helm chart.
Key Question: Can we remove the constraint that requires Gitaly to be deployed before Rails components?
As of today, if we introduce a new gRPC call in gitaly, and we deploy the rails components first, they will start using an unexisting gRPC call and fail the request.
Current State Analysis
GitLab.com Production (VM-based)
- Current approach: Manual orchestration ensures Gitaly nodes are upgraded before Rails
- Critical constraint: Gitaly must rollout before Ruby nodes to maintain zero downtime
- Mechanism: The old gitaly process is swapped in-place with the new one that will inherith the existing sockets
GitLab Dedicated (Kubernetes - In Development)
In the current workstreams, nothing was done to address the rollout order requirements.
Current limitations identified:
- Pod restarts inherently cause downtime (no equivalent to VM graceful reload)
- Without Raft, Gitaly won't be HA, resulting in downtime during upgrades
- Using a single helm chart we have no control over the rollout order of Gitaly pods and GitLab rails pods, this breaks the assumption that Gitaly is always deployed first.
Open questions:
- Have we measured the downtime induced by pod restarts compared to VM-based graceful reloads? Especially during an application upgrade where the new gitaly image in not yet on the node pools.
GitLab.com Staging/Pre-production (Kubernetes)
- Approach: Factor Gitaly out from main Helm Chart reference
- Implementation: Enable Gitaly only in first release deployment, then deploy rest of GitLab in subsequent release
- Result: Ensures Gitaly pods are deployed before Rails pods
Self-Managed Charts
Current status: No documented or publicly supported approach to ZDU in Cloud Hybrid
- Single chart deployment model requires independent component rollouts, we cannot enforce the rollout order
- As an alternative an orchestrator will be necessary (i.e. GET or an Operator), but we do not control how customers install our charts
Technical Challenges
1. Component Inter-dependencies and API Compatibility
Root issue: Lack of proper API contracts and compatibility across GitLab components
- Gitaly/Rails cannot handle slight differences across API calls
- Version variances require careful orchestration instead of built-in compatibility
- Current gRPC patterns mandate Gitaly-before-Rails deployment order
2. Kubernetes StatefulSet Rollout Behavior
Questions requiring investigation:
- What is the rollout timeline for StatefulSet updates?
- Does it kill the old pod then start creating the new one (including image download)?
- How does this compare to VM-based graceful reload capabilities?
- Will we be able to make use of Gitaly on kubernetes in our platforms before RAFT HA is available?
Ideal state
1. Component Independence
Goal: Make each component deployable independently and out of order (in alignment with gitlab-com/gl-infra/delivery#21572)
- Requirement: Build backward-forward compatibility into GitLab/Gitaly
- Benefit: Eliminates need for orchestration-based solutions
- Challenge: Requires significant investment in API contract standardization and validation
2. Orchestration-based Solutions (Interim solution)
- Use deployment orchestrators to handle application components that cannot understand API differences
- Challange: Requires to implement the same concept in different tools: release-tools/deployer, k8s-workload (for each environment independently), GET or Instrumentor.
Separate Helm releases:
- Factor Gitaly into independent chart
- Deploy separately before main GitLab components
The interim solution will still be valuable when we have full component deployment independence because in environments we closely control, having charts for each components allow us to deploy components independently without waiting for the complete rollout of other components. Conversely for our self-managed users, the benefit of a single helm chart to upgrade the whole stack is paramount to avoid version drifting between components of a release.
Questions for Gitaly Team
-
API Maturity: What is the current status of backward-forward compatibility between Gitaly versions and Rails versions?
-
Breaking Changes: What are the specific API incompatibilities that require the current deployment order constraint?
-
Roadmap: Is there a plan to achieve API compatibility that would allow out-of-order deployments?
Impact Assessment
High Risk Scenarios
- Order violation: Deploying Rails before Gitaly could cause outages lasting entire rollout window
- Dedicated deployment gaps: New Kubernetes-based approach may lose zero-downtime capabilities
Benefits of Resolution
- Operational simplicity: Eliminate complex orchestration requirements
- Deployment flexibility: Enable independent component scaling and updates
- Consistent experience: Align self-managed and SaaS deployment capabilities
Technical References
- Helmfile Configuration: GitLab.com Kubernetes Workloads
- GitLab Dedicated Blueprint: Zero Downtime Upgrades Architecture