Skip to content
GitLab Next
    • GitLab: the DevOps platform
    • Explore GitLab
    • Install GitLab
    • How GitLab compares
    • Get started
    • GitLab docs
    • GitLab Learn
  • Pricing
  • Talk to an expert
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
    • Menu
    Projects Groups Snippets
  • Get a free trial
  • Sign up
  • Login
  • Sign in / Register
  • R respond
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 15
    • Issues 15
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • GitLab.org
  • monitor
  • respond
  • Issues
  • #12
Closed
Open
Created Feb 22, 2020 by Sarah Yasonik@syasonikMaintainer

Record observations from SRE shadow as consumable issues or content

The below are the notes I took while shadowing an SRE during incidents. I want to turn this into a few issues & share the things I've learned with the team where applicable. The issue serves as a place to keep track of that work and keep me from forgetting about it.

## SRE On-Call Observations

#### Incident! - https://gitlab.com/gitlab-com/gl-infra/production/issues/1624

*Problem:* gitlab registry is down.
*Cause:* a dry-run wasn’t actually a dry-run. MR impacted live env.  
*Fix:* Redeploy config from last master.
*Validation:* Alerts resolved, pull an image.

Things that were harder than they needed to be:
- Access to the current incident management process - what are our current procedures for communication & investigations?
- What _should_ this metric/data look like on a normal day?
- Access to the pods & services - do you have access to the servers and services you need for this piece of the architecture?
- Out of date documentation
- Debugging through the layers of logs/charts/etc is like a puzzle. Knowing when to jump to the next service, what that service is, what & where to look for is something that currently requires experience.
- Knowing who has the context is invaluable. 

#### Incident! - https://gitlab.com/gitlab-com/gl-infra/production/issues/1630

*Problem:* CPU usage on a single server was higher than normal.
*Cause:* Unknown.
*Fix:* Wait.
*Validation:* Observation.

Things that were harder than they needed to be:
- This incident occurred over hand-off, so it was hard to know what action was require, if any
- When you aren’t starting from a metric or alert, getting to the metrics/data you need is hard

Bonus bite: The command line tool HTOP was more helpful than TOP for observing the CPU usage breakdown.

#### General Takeaways

Theses are the questions I want to be able to answer in initial response to an incident:
- Where can I look for information?
- Which information am I looking for?
- What impacts this metric?
- Do I have access this this? (service, server, etc)
- Who is a subject matter expert that can help?

Some ideas on addressing these questions:
- Can we associated error file paths with recent authors? Can metrics be given labels? Then can we view recently merged MRs with those labels?  
- Can we link runbooks in the metric dropdown on a chart?
	- let users link a url
	- give option to create a new page from the UI in the wiki for the project
- Can we display the alert in the UI next to the metric which triggered it?
- Can we potentially point people to domain experts from an incident/alert/error?

Some sticking points/observations outside of incidents:
- Confusion and false positives can stem from color-confusion on charts
- Alerting, queries, and metric definitions are very much a part of the development cycle. Ideally, they are configured as new work is added, not reactionarily. 
- I don’t think there’s currently a good visual representation of “what alerts are firing right now?” from the metrics dashboard
Edited Feb 28, 2020 by Sarah Yasonik
Assignee
Assign to
Time tracking