improve incident management tooling
There are several open issues to improve incident management which need to get finished:
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5543
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6424
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5359
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5508 (was this ever used?)
- gitlab-com/runbooks#19 (moved)
We also should try to reduce the number of bots that need to be used and make them easy and consistent to use and test.
During last incident the /start-incident
slack command failed to create an incident issue. We should regularly test our tooling and think about regular incident trainings.
Edited by Henri Philipps