Check deployment strategies to determine if anything prevents us from upgrading in some areas
Summary
We run mixed production deployments where our canary fleet uses the latest production code before it is rolled out to the rest of the production fleet. Backwards incompatible changes can prevent deployments and cause serious production issues.
In general: With multi-node setups, at any given time two versions of the application can run, which means that the code needs to be forward compatible.
This incident and corrective action pointed to a case where we can't upgrade:
- Incident gitlab-com/gl-infra/production#3441 (closed)
- Corrective action #10546 (closed)
Other examples of incidents related to mixed deployment compatibility issues:
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9176
- gitlab-com/gl-infra/production#3632 (comment 509626364)
Proposal
We need to do an audit of teams to determine if there is anything else that falls into this category. This is an action item that we are tracking for the next engineering-wide retrospective.
Outcome
Responses were summarized in #11010 (comment 545261471) and a follow-up issue was opened to create a training issue template using the information gathered here.
Instructions
- Please see the summary and example incidents for more context about this issue. If you have other examples to add, it would be much appreciated!
- If your team is working on (or planning to work on) anything that may have backward compatibility issues with our mixed deployment strategy, please comment in the issue with more detail and answer the following questions as well:
- What strategies does your team have to identify, prevent, and mitigate potential canary/main compatibility issues in production?
- How can we better identify, prevent, and mitigate this type of problem as an organization?
- Add a
✅ inAudit Complete?for your team in the table below. - Ping @nhxnguyen if you have any questions or feedback about this audit. Ideally, we will complete this before the 13.10 live retrospective. Thank you in advance!
| Team | Eng Manager | Audit Complete? |
|---|---|---|
| Configure | @nicholasklick | |
| Create:Editor | @rkuba | |
| Create:Source Code BE | @sean_carroll | |
| Create:Source Code, Code Review FE | @andr3 | |
| Create:Code Review BE | @m_gill | |
| Database | @craig-gomes | |
| Distribution | @mendeni | |
| Ecosystem BE | @mnohr | |
| Ecosystem FE | @leipert | |
| Global Search | @changzhengliu | |
| Growth:Activation, Adoption, Conversion, Expansion | @pcalder | |
| Fulfillment:License | @jameslopez | |
| Fulfillment:Purchase | @chris_baus | |
| Fulfillment:Purchase FE | @rhardarson | |
| Fulfillment:Utilization | @csouthard | |
| Geo | @nhxnguyen | |
| Gitaly | @zj-gitlab | |
| Manage:Access, Import BE | @lmcandrew | |
| Manage:Access,Compliance, Import FE | @dennis | |
| Manage:Optimize, Compliance BE | @djensen | |
| Manage:Optimize FE | @wortschi | |
| Memory | @craig-gomes | |
| Monitor | @crystalpoole | |
| Package | @dcroft | |
| Plan | @johnhope | |
| Product Intelligence | @jeromezng | |
| Release | @nicolewilliams | |
| Secure:Composition Analysis BE | @gonzoyumo | |
| Secure:Dynamic Analysis, Fuzz Testing BE | @sethgitlab | |
| Secure:Static Analysis BE | @twoodham | |
| Secure FE | @nmccorrison | |
| Threat Management BE | @thiagocsf | |
| Threat Management FE | @lkerr | |
| Verify:CI, Pipeline Authoring BE | @cheryl.li | |
| Verify: Pipeline Authoring, CI FE | @samdbeckham | |
| Verify:Runner | @erushton | |
| Verify:Testing | @rickywiens |