Improve visibility into changed Omnibus configuration
Summary
The process to discover that the root cause that led to production#7284 (closed) was a configuration change in Omnibus took quite a long time. As SREs pointed out, this is most importantly caused by the fact that changes in Omnibus don't frequently cause incidents like this, and thus nobody took a look at these changes at first but instead only inspected changes in Gitaly itself.
We should have a look at increasing visibility into Omnibus changes that have been rolled out as part of a deployment to give SREs a better indicator. This can be done e.g. by adding annotations whenever the Omnibus version changes. Ideally, it would also be possible to retrieve an exact diff of the configuration changes that the Omnibus version bump has caused.
Related Incident(s)
Originating issue(s): production#7284 (closed)
Desired Outcome/Acceptance Criteria
SREs get better indicators that the root cause of an incident might have been changes in the Omnibus configuration.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose out of -
Give context for what problem this corrective action is trying to prevent from re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'priority::4')