Stage 4: Change the order of deployment: gstg-cny, gprd-cny, gstg, gprd
Problem statement
Issue opened as a response to gitlab-org&6401 (comment 659429028)
We've recently added the gstg-cny environment into the deployment pipeline to support testing mixed-deployments using the gstg-cny and staging environments.
The intention is to simulate the gprd-cny and gprd setup by deploying new packages to gstg-cny and running backwards compatibility tests to validate against the gstg-cny environment and the staging environment.
Because of the manual nature of the production promotion not every package that deploys to the staging or canary environments is guaranteed to reach production. As a result we will end up with a situation where the new staging environment will test the compatibility of every package with the previous one, but we may have gaps in version when running on gprd and gprd-cny.
As an example, if we consider the following timeline:
- production and staging are both running V1
- deploy V2 on gstg-cny
- test V1 - V2 compatibility
- deploy V2 to gprd-cny
- deploy V3 on gstg-cny
- test V2 - V3 compatibility
- deploy V3 to gprd-cny
At this point in time, we are running V1 and V3 in gprd and gprd-cny and this pair of version in untested on staging.
Proposed solution
After discussion, we've agreed to implement Option 4:
Option 4 - Tie the canaries together and move post-deployment migrations after package deployments.
gstg-cny built out to be a full environment and the weight of traffic changed to automatically route the majority of staging traffic to gstg-cny. We'll use the Staging environment as the dependable environment that maintains consistency against production. This will allow mixed-deployment tests to run against gstg-cny using staging.
Post-deployment migrations will move to the end of the pipeline to maintain database consistency through package deployments.
- deploy gstg-cny. Run full QA including mixed deployment tests using the gstg-cny + gstg environment
- deploy gprd-cny.
- Run full QA including mixed deployment tests using the gprd-cny + gprd environment
- baking time 30mins
- manual promotion
- deploy gprd and gstg almost in parallel to keep in sync. gstg will start deployment first to test for deployment issues
- run post-deployment migrations on gstg
- run full QA
- run post-deployment migrations on gprd
Full details have been mapped out in https://docs.google.com/presentation/d/1pj1vUI7EI1gBiKWjzRk6WEo39FXpnEvFNDqxgmqN0Yc/edit#slide=id.gf6d6720339_1_96
Pros:
- Solves the problem of needing to control package versions for mixed-deployment tests without needing additional tooling
- Existing tests and new mixed-deployment tests all adding value
- Post-deployment migrations are moved to the end of the deployment pipeline, this keeps environments consistent and allows gstg and gprd to be rolled back if needed
Cons:
- Deployment pipelines need to be re-worked to change the deployment order
In this issue we consider approaches to improve the likelihood that the package versions being tested on the staging environments are the same versions to later deploy to production.
Alternative options considered
Option 1 - use rollbacks to keep staging in sync with production:
Keep gstg and gprd in sync using rollbacks - If a package doesn't reach production we would need to check and rollback the staging environment before we can test from gstg-cny.
Pros:
- we bake rollbacks into our deployment pipeline
Cons:
- We need complicated magic to coordinate the deployments and rollbacks of several environments. Additional tooling is likely to be needed and the release manager would likely have extra work to keep track of the additional events required to manage this approach.
- Rollbacks can only be performed on packages that don't include post-deployment migrations. Currently, we have at least one package per day that includes migrations so we may be limited on how usable this solution is.
Option 2 - Re-order the environments to deploy to canaries before main staging & production fleets
We reorder the deployment pipeline so that we deploy to gstg-cny and then gprd-cny. We only deploy gstg after the manual approval to continue deploying through to production.
The new pipeline will look something like this:
- deploy gstg-cny
- run QA (mixed deployment)
- deploy gprd-cny
- run QA (mixed deployment - same versions we tested in gstg)
- baking time 1hr
- manual promotion
- deploy gstg
- run QA only on the main stage
- deploy gprd
Pros:
- solves the problem of needing to control package versions for mixed-deployment tests without needing additional tooling
Cons:
- Changes the natural order of environment progression
- We need everyone in engineering to adapt to the change - developers would shift to testing on gstg-cny, infra would test config changes on gstg-cny
- gstg-cny isn't yet a full environment. We would need to add in gitaly and praefect plus staging-like monitoring to support the extended testing
- Extends the pipeline duration by ~2hrs
- staging qa could still fail and leave us with different packaage versions between staging and production
Option 3 - rename gstg-cny and consider it as a dedicated mixed-deployment test environment that is locked in sync with production.
Change the use of gstg-cny to lock it in sync with production. Mixed-deployment tests run as part of the staging deployment testing staging against gstg-cny.
We would need to work out exactly how to order post-deployment migrations around the mixed-deployment tests but I think it would be possible to make this work.
- deploy gstg
- run QA (mixed deployment using the gstg + gstg-cny environment)
- deploy gprd-cny
- run QA (mixed deployment - same versions we tested in gstg)
- baking time 1hr
- manual promotion
- deploy gprd and gstg-cny in parallel to keep in sync
Pros:
- solves the problem of needing to control package versions for mixed-deployment tests without needing additional tooling
- developers don't need to update their workflow, testing would continue on staging
- Removes the need to extend gstg-cny into a full environment with gitaly, praefect and extended monitoring
Cons:
- We need to rename gstg-cny to make it clear it isn't a canary but a mixed-deployment test environment
- Deployment pipelines need to be re-worked to change the deployment order
- This environment becomes specific to this use-case - this may be good or bad, it would also be the only environment that runs in this way which could be confusing
Visuals
Evidence of these three options solving the mixed-deployment test problem: https://docs.google.com/presentation/d/1pj1vUI7EI1gBiKWjzRk6WEo39FXpnEvFNDqxgmqN0Yc/edit?usp=sharing (small caveat - this is horribly complicated so please shout if you see mistakes in the logic)
