Reliability Improvements Focus Epic
In the last month we have seen the number of alerts and incidents rise significantly. [insert graph here] When there are a high frequency of alerts, it becomes more challenging for EOC to fully investigate each one, and leads to mental fatigue. This has multiple negative consequences, from poor quality of life for EOCs to potentially longer Times to Mitigation or Resolution, and the inability to fully investigate all alerts/incidents. [see comment](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15446#note_882045964) This epic is to capture a Rapid Action style process to address the above concern, with ~6 SRE (not on call) in multiple time zones dedicated to this work for 1-2 weeks beginning on 2022-03-28. Those engineers will be setting aside project work for this time frame. While we understand that this will potentially impact delivery timelines for said projects, the benefit of this work outweighs that risk - although there may be projects that are identified as too time sensitive and engineers will not be taken from those projects to work on this. Lead: @cmcfarland Engineers: - @devin - @nnelson - @pguinoiseau - @steveazz Engineering Managers: - @afappiano - @amoter Proposed areas of action: - [x] Review production alerts from the last few weeks - [ ] Remove alerts that are un-actionable - [ ] Tune noisy alerts that we want to keep - [ ] Review long-standing silences - [ ] Ensure that all links in alerts are working and pointing to the correct places - [ ] Otherwise improve our infrastructure/configuration in ways that helps address the number of recent alerts - [ ] Track alert metrics to confirm that the above actions are resulting in a lower number of overall alerts [This spreadsheet](https://docs.google.com/spreadsheets/d/1w6bBYf8pMygF7iqDK32Jx-uKszpfnX37Hp9uRqJf4vw/edit?usp=sharing) captures the alerts for the past month and identifies the noisiest ones.
epic