Headcount reset: Accelerate Gitaly HA for GitLab.com availability and customer pathway off NFS
I propose that we should want to deliver a Generally Available MVC gitlab-org&842 (closed) of Gitaly HA in the 2020-03-22 release of GitLab (6 releases from now). It would include:
- eventual consistency (less than a minute replication delay)
- manual failover (simple admin interface)
This should be usable by:
- GitLab.com to improve availability (e.g. mitigate gitlab-com/gl-infra/production#1222 (closed))
- key customers to provide enough availability to eliminate NFS (https://gitlab.my.salesforce.com/00161000013aRjG, https://na34.salesforce.com/0016100000KvaIg etc)
- unblock AWS quickstart
- assist with scalability of self-managed GitLab instances
Gitaly HA is strategically important for winning and retaining large enterprise customers because GitLab administrators must meet internally set SLAs for tools like GitLab that are critical to the productivity of engineers, or make deployments.
The professional services team is also looking at bespoke HA workarouds to avoid NFS. We should accelerate Gitaly HA and avoid supporting custom workarounds. https://gitlab.com/gitlab-com/customer-success/professional-services-group/tools/proliferate/issues/67
Ask
3 engineers for 6 releases, starting 12.5 to allow:
- parallelization of work on failover and replication
- rapidly address feedback from GitLab production team
- support an ambitious verification/migration timeline for GitLab.com
Problem
At the current capacity, in the most optimistic scenario, it will be the end of January 2020 by the time we have the minimum foundation to do replication and a manual failover of Gitaly HA in staging or production.
Migrating an entire Gitaly node, and eventually all Gitaly nodes to a HA configuration, is at least a 3 month project, and should be expected to generate a substantial number of production requests, bugs and urgent feedback. It is unlikely we have sufficient capacity to support a migration at this pace.
Best guess timeframes:
Scenario | HA Foundation ETA | HA Validation ETA |
---|---|---|
Current | Feb 2020 | Jul 2020 |
Current + 3 | Dec 2019 | Mar 2020 |
gitlab-org&842 (closed)
Stage 1: HA FoundationThe current priority is to implement, and enable for one project on GitLab.com, an alpha minimal workflow. This is being tracked by the following three epics:
- In development Single node, no replication configuration gitlab-org&1877 (closed)
- Multi node, replication, no failover gitlab-org&289 (closed)
- Manual failover support gitlab-org&1185 (closed)
In order to reach General Availability by 2020-03-22 release, we think that all three items above must be complete and enabled in Staging and Production for a Gitaly node containing only one project, likely gitlab-org/gitaly
.
Concerns:
- anticipating feedback from production team that will need to be addressed
- assuming no feedback or unforseen challenges, seems more likely that this will actually take until end of January. Pessimistic estimate is more like end of February.
gitlab-org&2006 (closed)
Stage 2: HA Validation and Production ReadinessBefore we can recommend large customers using this feature, we need to validate the feature at production scale and deploy to GitLab.com
- Enable Praefect in Rails application tests and Quality tests
- Simulate and test production like replication load
- Verify behavior in 10k, 25k and 50k reference architectures
- Migrate one entire Gitaly node to HA configuration (200k projects)
- Migrate remaining Gitaly nodes to HA configuration (7m projects)
Concerns:
- much like testing Geo, expect feedback and need sufficient capacity to address quickly
- need to build extensive performance testing tools to prevent regressions
- taking a conservative approach to even migrating a single Gitaly node could take longer than a month if we encounter problems