Record observations from SRE shadow as consumable issues or content
The below are the notes I took while shadowing an SRE during incidents. I want to turn this into a few issues & share the things I've learned with the team where applicable. The issue serves as a place to keep track of that work and keep me from forgetting about it.
## SRE On-Call Observations #### Incident! - https://gitlab.com/gitlab-com/gl-infra/production/issues/1624 *Problem:* gitlab registry is down. *Cause:* a dry-run wasn’t actually a dry-run. MR impacted live env. *Fix:* Redeploy config from last master. *Validation:* Alerts resolved, pull an image. Things that were harder than they needed to be: - Access to the current incident management process - what are our current procedures for communication & investigations? - What _should_ this metric/data look like on a normal day? - Access to the pods & services - do you have access to the servers and services you need for this piece of the architecture? - Out of date documentation - Debugging through the layers of logs/charts/etc is like a puzzle. Knowing when to jump to the next service, what that service is, what & where to look for is something that currently requires experience. - Knowing who has the context is invaluable. #### Incident! - https://gitlab.com/gitlab-com/gl-infra/production/issues/1630 *Problem:* CPU usage on a single server was higher than normal. *Cause:* Unknown. *Fix:* Wait. *Validation:* Observation. Things that were harder than they needed to be: - This incident occurred over hand-off, so it was hard to know what action was require, if any - When you aren’t starting from a metric or alert, getting to the metrics/data you need is hard Bonus bite: The command line tool HTOP was more helpful than TOP for observing the CPU usage breakdown. #### General Takeaways Theses are the questions I want to be able to answer in initial response to an incident: - Where can I look for information? - Which information am I looking for? - What impacts this metric? - Do I have access this this? (service, server, etc) - Who is a subject matter expert that can help? Some ideas on addressing these questions: - Can we associated error file paths with recent authors? Can metrics be given labels? Then can we view recently merged MRs with those labels? - Can we link runbooks in the metric dropdown on a chart? - let users link a url - give option to create a new page from the UI in the wiki for the project - Can we display the alert in the UI next to the metric which triggered it? - Can we potentially point people to domain experts from an incident/alert/error? Some sticking points/observations outside of incidents: - Confusion and false positives can stem from color-confusion on charts - Alerting, queries, and metric definitions are very much a part of the development cycle. Ideally, they are configured as new work is added, not reactionarily. - I don’t think there’s currently a good visual representation of “what alerts are firing right now?” from the metrics dashboard