Discussion issue - corrective action - technical safety in Puma version upgrades
corrective action related to 2021-01-19 - Elevated Error rates across the fleet.
The infrastructure team would like to discuss some better safety mechanisms for changes like this that get rolled out the production. In this case, the new Puma version, using a new ruby version, introduced a new GC compaction created instability in the entire web fleet. The incident impact itself this time was relatively low. However, we categorized this as a near miss. Had this instability been even across 20% of the fleet rather than just 1-2 web or api nodes at a time, GitLab.com would have been in a rapidly degrading/outage situation.
Related to the initial incident review, we decided to make a corrective action to discuss things like:
- Can we roll this type of change out across a subset of web nodes and monitor for 2-5 days before further rolling out?
- Given that the Ruby version was rolled out with the omnibus, how can Infra help control this type of rollout?
- What other protections can we have to help protect ourselves on this kind of change?
To start adding, @andrewn, @craig-gomes, @marin, @amyphillips but please add team members as you see fit.
I'll look to get this topic discussed on the 2021-02-16 incident review even though the incident itself has been reviewed.