Incident Management
GitLab's broad scope allows us to bring a unique perspective (data) and capability to a number of problems, one of which is incident management.
- We have a large quantity of inbound signals in metrics, logging, tracing, and error tracking. This will allow us to quickly detect incidents, and begin to track them.
- We also have an automation tool in GitLab CI, which can be used for a broad set of tasks including deployment of updated software and chatops
- Being a git repository, we also know about the source code but also frequently host the content of the runbooks themselves as well.
- We're also working on integration with Jupyter notebooks, which could provide another interesting set of capabilities
You could imagine a flow like the below:
- An alert(s) is triggered from metrics/logging/error tracking/tracing
- An issue is automatically created to begin tracking work on the issue (https://gitlab.com/gitlab-org/gitlab-ee/issues/4925)
- Notify the proper channels like Slack/PagerDuty/etc. (https://gitlab.com/gitlab-org/gitlab-ee/issues/3627)
- Update service status page, optionally link to issue
- Search runbook repository for applicable runbooks
- Spin up desired runbook (or a new one) in something like a python notebook
- After incident is resolved, persist a copy of the runbook in the issue. If it's a new runbook, create an MR so it can be captured
- Notify proper channels of the resolution (https://gitlab.com/gitlab-org/gitlab-ee/issues/3627)
- Update service status page
Edited by silv