Draft: Deployments & Releases for Cells and Dedicated: Delivery PoV
DRAFT, FEEDBACK WELCOME
Purpose
The purpose of this issue is to transparently expose the Delivery Group's PoV on the current horizontal scaling direction of GitLab and facilitate discussion with other groups to align on strategy and implementation details over the next 6-18 months. Cells - Workstreams: Dependencies & timings is a key document describing the current Cells roadmap and high level considerations. There is also the Cell Blueprint which contains information about the technical roadmap.
Context
Cells is moving and it is moving extremely fast ( Cells: Deployment architecture (gitlab-org/gitlab!131657 - merged)). As the Delivery Group, we have built a wealth of deep expertise in the packaging, rollout, deployment and release of GitLab across technology stacks (VMs, K8s, multi-cluster) and across deployment targets (gitlab.com, staging environments & release specific environments/installations). In a world where there are many dedicated tenants and many Cells, we expect to be providing and managing the tooling that allows us to roll out new changes to GitLab in a predictable, controllable, efficient and safe manner.
Deployments
Deployments to GitLab.com rollout an auto-deploy package of latest changes on a near-automated basis. The deployment pipeline runs multiple times per day.
Post-deploy migrations (PDM) are run once per day from a separate pipeline. Since PDM are a blocking function that prevents auto-deploy packages from being rolled back we improve our chance of rapid recovery by limiting the execution of PDM.
Rollout, rollback, and failure resolution are manual activities that are carried out by Release Managers with support from the EOC, Quality, and Development.
Releases
Monthly and Patch releases consist of a package of all changes that have been successfully applied to GitLab.com since the previous release.
Security releases run from the Security mirror and include steps to merge and deploy all security fixes to GitLab.com as part of the release preparation.
Buckets of work
Business & Product Constraints
- What are the cost constraints?
- Blue Green could double costs with a benefit of making rollout easier and rollback possible?
- What are the ideal compromises?
PDM
- Pre and post-deployment migrations come with application dependencies. For example, pre-deployment migrations always update before the application rollout, and post-deploy migrations are always after the application rollout. What are the cells database and application migration requirements and constraints?
- If cells databases need to support post-deploy migrations, what could the application do to validate that required migrations have been executed successfully before attempting to use the updated schema?
- Should PDM be an admin function?
Release & Packaging
The existing auto-deploy package is a bleeding-edge package containing all the latest merged changes. At the same time, Dedicated is running the n-1 GitLab version. We will need to decide on the best package to rollout to Cells considering:
- Package creation. Currently, the auto-deploy process includes package creation, validation, and rollout. This may not fit well with a fleet of Cells.
- Package stability. The existing auto-deploy rollout includes multiple test phases and a baking time on canary. We will need to consider how to map a similar validation process into a Cells rollout.
- Speed of mitigation/resolution: Current release processes include steps to apply fixes to GitLab.com before packaging. This gives us confidence in the fix and makes sure .com remains secure. We will need a similar process to roll out urgent fixes to Cells.
Deployment & Rollout
- Deployments must be forwards and backward-compatible - a rolling deployment will mean that we can have multiple application versions running on a single cell
- A failing deployment or a failing cell health check should initiate an automatic rollback to the previous version
- A failure appearing across the fleet should trigger a rollback of the entire fleet. For example, if 30% of cell rollouts fail, we should rollback 100% of cells.
- Proposed rollout strategy - #12770 (comment 1578738365) key points:
- Drop the use of Canary, use the cells fleet as a gradual rollout model
- Today, it takes about 50 minutes to roll out a version upgrade with Dedicated in a 3k reference architecture (without Geo). We could break the fleet into tiers and propagate the version upgrades sequentially between tiers and in parallel inside a single tier
- As confidence in a package grows, we can roll out to more cells simultaneously