Incident Management

GitLab's broad scope allows us to bring a unique perspective (data) and capability to a number of problems, one of which is incident management.

  • We have a large quantity of inbound signals in metrics, logging, tracing, and error tracking. This will allow us to quickly detect incidents, and begin to track them.
  • We also have an automation tool in GitLab CI, which can be used for a broad set of tasks including deployment of updated software and chatops
  • Being a git repository, we also know about the source code but also frequently host the content of the runbooks themselves as well.
  • We're also working on integration with Jupyter notebooks, which could provide another interesting set of capabilities

You could imagine a flow like the below:

  1. An alert(s) is triggered from metrics/logging/error tracking/tracing
  2. An issue is automatically created to begin tracking work on the issue (https://gitlab.com/gitlab-org/gitlab-ee/issues/4925)
  3. Notify the proper channels like Slack/PagerDuty/etc. (https://gitlab.com/gitlab-org/gitlab-ee/issues/3627)
  4. Update service status page, optionally link to issue
  5. Search runbook repository for applicable runbooks
  6. Spin up desired runbook (or a new one) in something like a python notebook
  7. After incident is resolved, persist a copy of the runbook in the issue. If it's a new runbook, create an MR so it can be captured
  8. Notify proper channels of the resolution (https://gitlab.com/gitlab-org/gitlab-ee/issues/3627)
  9. Update service status page
Edited May 16, 2018 by silv
Assignee Loading
Time tracking Loading