Proposal: Deploy at most one change per deployment
Summary
In the context of production#7284 (closed) it took quite a lot of time to discover the actual root cause of the issue. Part of the problem was that the deployment pipeline had upgraded multiple parts of the application stack at the same time, where the upgrade included both changes to Gitaly itself and changes to the Gitaly configuration via Omnibus. And because changes in Gitaly itself are the most frequent cause of incidents in production, SREs naturally first went down this path and tried to revert the upgraded Gitaly version.
Speaking from my own experience it's always hard to debug something when multiple parts of a system have changed at the same time. We might want to investigate whether it makes sense to adjust the deployment strategy to roll out at most one change per deployment in order to keep the amount of changes minimal and test these changes in isolation. Chances are high that this would have signficantly sped up the process to discover the actual root cause in this incident.
Related Incident(s)
Originating issue(s): production#7284 (closed)
Desired Outcome/Acceptance Criteria
It becomes easier to find the root cause of an incident when changes are introduced "atomically", where only one thing changes at a time. Ultimately, this can help resolving incidents quicker in a subset of cases.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'priority::4')