Record observations from SRE shadow as consumable issues or content

The below are the notes I took while shadowing an SRE during incidents. I want to turn this into a few issues & share the things I've learned with the team where applicable. The issue serves as a place to keep track of that work and keep me from forgetting about it.

## SRE On-Call Observations

#### Incident! - https://gitlab.com/gitlab-com/gl-infra/production/issues/1624

*Problem:* gitlab registry is down.
*Cause:* a dry-run wasn’t actually a dry-run. MR impacted live env.  
*Fix:* Redeploy config from last master.
*Validation:* Alerts resolved, pull an image.

Things that were harder than they needed to be:
- Access to the current incident management process - what are our current procedures for communication & investigations?
- What _should_ this metric/data look like on a normal day?
- Access to the pods & services - do you have access to the servers and services you need for this piece of the architecture?
- Out of date documentation
- Debugging through the layers of logs/charts/etc is like a puzzle. Knowing when to jump to the next service, what that service is, what & where to look for is something that currently requires experience.
- Knowing who has the context is invaluable. 

#### Incident! - https://gitlab.com/gitlab-com/gl-infra/production/issues/1630

*Problem:* CPU usage on a single server was higher than normal.
*Cause:* Unknown.
*Fix:* Wait.
*Validation:* Observation.

Things that were harder than they needed to be:
- This incident occurred over hand-off, so it was hard to know what action was require, if any
- When you aren’t starting from a metric or alert, getting to the metrics/data you need is hard

Bonus bite: The command line tool HTOP was more helpful than TOP for observing the CPU usage breakdown.

#### General Takeaways

Theses are the questions I want to be able to answer in initial response to an incident:
- Where can I look for information?
- Which information am I looking for?
- What impacts this metric?
- Do I have access this this? (service, server, etc)
- Who is a subject matter expert that can help?

Some ideas on addressing these questions:
- Can we associated error file paths with recent authors? Can metrics be given labels? Then can we view recently merged MRs with those labels?  
- Can we link runbooks in the metric dropdown on a chart?
	- let users link a url
	- give option to create a new page from the UI in the wiki for the project
- Can we display the alert in the UI next to the metric which triggered it?
- Can we potentially point people to domain experts from an incident/alert/error?

Some sticking points/observations outside of incidents:
- Confusion and false positives can stem from color-confusion on charts
- Alerting, queries, and metric definitions are very much a part of the development cycle. Ideally, they are configured as new work is added, not reactionarily. 
- I don’t think there’s currently a good visual representation of “what alerts are firing right now?” from the metrics dashboard

Edited Feb 28, 2020 by Sarah Yasonik

Transfer in progress

Record observations from SRE shadow as consumable issues or content