Skip to content

Improve Incident Management process when gitlab.com has availability issues

Problem to solve

When facing incidents that impact the availability of gitlab.com, the current Incident Management process falls short of guidance and should be improved.

Proposal

Following the feedback from Incident Review for Site-wide Outage for GitLab... (production#15999 - closed), let's identify gaps and how to close them. The "availability issues" term can cover several situations that we need to distinguish as they might call for different solutions:

  1. Problem: the incident issue can't be created or is not accessible
    • Impact: the team can't communicate internally in a written/async form during the incident and we lack transparency as our users can't follow the incident management in details (only the status page might be available).
    • Suggested solutions:
      • We should update the handbook page with a dedicated process to follow.
      • We usually fallback to a google doc, though, when incident issue eventually gets created we have to reconcile content.
      • Hosting a dedicated public GitLab instance for incident handling
        • We had held off of using ops and a public issue/project because we did not want to have heavy traffic on the instance we were using to recover.
      • A different system for incident updates that is linked from status.gitlab.com (imho contradicts dogfooding the product)
  2. Problem: the runbooks git repo is not accessible
    • Impact: the team members don't have access to critical documentation and processes to manage and resolve the incindentdetails
    • Suggested solutions:
Edited by Olivier Gonzalez