You need to sign in or sign up before continuing.
Improve Incident Management process when gitlab.com has availability issues
Problem to solve
When facing incidents that impact the availability of gitlab.com, the current Incident Management process falls short of guidance and should be improved.
Proposal
Following the feedback from Incident Review for Site-wide Outage for GitLab... (production#15999 - closed), let's identify gaps and how to close them. The "availability issues" term can cover several situations that we need to distinguish as they might call for different solutions:
-
Problem: the incident issue can't be created or is not accessible
- Impact: the team can't communicate internally in a written/async form during the incident and we lack transparency as our users can't follow the incident management in details (only the status page might be available).
-
Suggested solutions:
- We should update the handbook page with a dedicated process to follow.
- We usually fallback to a google doc, though, when incident issue eventually gets created we have to reconcile content.
- Hosting a dedicated public GitLab instance for incident handling
- We had held off of using ops and a public issue/project because we did not want to have heavy traffic on the instance we were using to recover.
- A different system for incident updates that is linked from status.gitlab.com (imho contradicts dogfooding the product)
- existing proposal in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15255#note_850562577
- dogfood the gitlab status page feature as suggested in gitlab-com/www-gitlab-com#7012
-
Problem: the runbooks git repo is not accessible
- Impact: the team members don't have access to critical documentation and processes to manage and resolve the incindentdetails
-
Suggested solutions:
- runbooks repo is mirrored on ops: https://ops.gitlab.net/gitlab-com/runbooks/-/tree/master
- We should mention this backup location in the IM handbook and IMOC onboarding issue.
Edited by Olivier Gonzalez