Record observations from SRE shadow as consumable issues or content
The below are the notes I took while shadowing an SRE during incidents. I want to turn this into a few issues & share the things I've learned with the team where applicable. The issue serves as a place to keep track of that work and keep me from forgetting about it.
## SRE On-Call Observations
#### Incident! - https://gitlab.com/gitlab-com/gl-infra/production/issues/1624
*Problem:* gitlab registry is down.
*Cause:* a dry-run wasn’t actually a dry-run. MR impacted live env.
*Fix:* Redeploy config from last master.
*Validation:* Alerts resolved, pull an image.
Things that were harder than they needed to be:
- Access to the current incident management process - what are our current procedures for communication & investigations?
- What _should_ this metric/data look like on a normal day?
- Access to the pods & services - do you have access to the servers and services you need for this piece of the architecture?
- Out of date documentation
- Debugging through the layers of logs/charts/etc is like a puzzle. Knowing when to jump to the next service, what that service is, what & where to look for is something that currently requires experience.
- Knowing who has the context is invaluable.
#### Incident! - https://gitlab.com/gitlab-com/gl-infra/production/issues/1630
*Problem:* CPU usage on a single server was higher than normal.
*Cause:* Unknown.
*Fix:* Wait.
*Validation:* Observation.
Things that were harder than they needed to be:
- This incident occurred over hand-off, so it was hard to know what action was require, if any
- When you aren’t starting from a metric or alert, getting to the metrics/data you need is hard
Bonus bite: The command line tool HTOP was more helpful than TOP for observing the CPU usage breakdown.
#### General Takeaways
Theses are the questions I want to be able to answer in initial response to an incident:
- Where can I look for information?
- Which information am I looking for?
- What impacts this metric?
- Do I have access this this? (service, server, etc)
- Who is a subject matter expert that can help?
Some ideas on addressing these questions:
- Can we associated error file paths with recent authors? Can metrics be given labels? Then can we view recently merged MRs with those labels?
- Can we link runbooks in the metric dropdown on a chart?
- let users link a url
- give option to create a new page from the UI in the wiki for the project
- Can we display the alert in the UI next to the metric which triggered it?
- Can we potentially point people to domain experts from an incident/alert/error?
Some sticking points/observations outside of incidents:
- Confusion and false positives can stem from color-confusion on charts
- Alerting, queries, and metric definitions are very much a part of the development cycle. Ideally, they are configured as new work is added, not reactionarily.
- I don’t think there’s currently a good visual representation of “what alerts are firing right now?” from the metrics dashboard
Edited by Sarah Yasonik