Alert and Incident Automation UX Research
What’s this issue all about?
We have two closely related problem validation efforts running currently: Alerting Problem Validation and Incident Automation Problem Validation.
We want to better understand how users manage and fine-tune their alerts so they are only being alerted when it's actually important. We also want to understand what users do after they have been alerted that an incident is occurring, to better understand how to automate as much of those processes/actions with GitLab as possible.
Currently, users have the option to automatically create incidents (e.g. issues in GitLab) when GitLab receives an alert. Once GitLab has received the alert, users have the option of selecting an issue template so that they can customize what their "incidents" look like. This issue template is hard-coded in markdown. Incidents in GitLab are composed of the alert payload and the information that was hard-coded in the selected issue template. After a user has been paged, they first look at the incident to start understanding what happened. Following that, they are going to go into other tools as they begin initial investigation. This process can be arduous and confusing. We have the opportunity to automate much of this and present the user with helpful, relevant information in the incident.
In terms of how alerts currently link to Gitlab issues: as we've built it today, all alerts create incident issues. Our internal teams aren't using this functionality because all alerts shouldn't be issues. If we had a way for users to view and fine-tune alerts within GitLab, we could potentially help improve the process whereby alerts are turned into issues, perhaps through automation or at least by keeping all the relevant information within a single tool.
Who is the target user of the feature?
- Primary persona: Devon the DevOps Engineer
- Secondary persona: Sasha the Software Developer
What questions are you trying to answer?
- How they are getting alerts currently?
- Are they doing any fine-tuning of their alerts?
- What information is important to users in a fire-fight?
- What information is important when trying to figure out what has gone wrong?
Core questions
- How are you alerted?
- After you've received an alert, what do you do?
- What information is most important to you when figuring out where you need to start your investigation?
- What do your incidents contain today that you find helpful?
- When investigating incidents and searching through metrics, logs, traces, how are you filtering your searches (this will help us understand if we might be able to include links to pre-filtered searches of metrics, logs, traces)?
What hypotheses and/or assumptions do you have?
- The incident contains too much information
- Majority of users look for the service/tool that created the incident and immediately navigate to that tool for investigation
- Incident issues do not contain enough relevant context
- Collecting pertinent info and adding it to the incident is a manual process and often not done Determining correct thresholds is ambiguous and iterative
- Configuring alerts for multiple tools in a monitoring stack is hard and particularly stressful during an incident
- Alerts should have a natural link to other parts of the devops workflow; e.g. creating issues for bugs, providing inputs to an incident, pinpointing the code that led to the failed deployment, etc
What decisions will you make based on the research findings?
Based on the research findings we will prioritize the next most important improvements we can make to the beginning of the Triage workflow, specifically what happens after a user is paged by an incident
When do you need this research to be completed? (Milestone or date)
End of 12.9