Site Reliability Engineer Onboarding Issue 3 - Oncall Onboarding - Steve Azzopardi
Welcome to your oncall onboarding issue!
This is the third and final of your onboarding issues.
In order to join oncall, at a high level you should:
- Alerting - know how to find silences and create them.
- Join the shadow rotation in PagerDuty for a few days and shadow a current oncall.
- Join the shadow rotation in PagerDuty a second time and communicate with the EOC that you will take primary with them as a fallback. Record a log of:
- how many alerts you acknowledge
- how many alerts felt actionable
- how many alerts "made sense", where you knew what you needed to do or where to look
- Ideally, you are ready to join when the ratio of made sense to acknowledge is above 80% with working on at least 10 alerts.
Generalized investigation steps during an incident
An incident starts with trying to efficiently identify the nature of the problem, drilling down from:
- PagerDuty alert
- Grafana dashboard for the alerting service
- Kibana log events for that service, often starting with one of the quick links from the Grafana dashboard
- Possibly looking at other Grafana dashboards if the above indicates that the alerting service is having trouble due to its dependency on another service (e.g. Rails having lots of SQL statement timeouts may indicate trouble on the database or its connection pooler).
Once we identify the affected component and the nature of its problem, that usually gives us enough info to understand what kind of solutions are likely to be helpful -- and that may mean getting help from domain experts in whatever component of the app code or infrastructure that we identified as contributing causes of the incident.
Remember that you are not alone. At any point you can ask for help from other SREs in the #infrastructure-lounge channel, someone will be happy to join you in zoom. You can also escalate to the Incident Manager On Call (IMOC) at any time, if you need a second opinion, a different perspective, need help knowing who to reach out to on other teams, etc.
The rest of this issue gives some practical steps/exercises for things you should know how to do.
Asking for help
Make sure you know how to:
-
Page IMOC by typing /pd triggerin slack, then choosingGitLab Production - IMOCunderImpacted Service. -
Page CMOC by typing /pd triggerin slack, then choosingGitLab Production - CMOCunderImpacted Service. -
Page Security by typing /security Please joing us for incident#123in slack. Handbook. -
Page Dev by typing /devoncall incident-issue-url into #dev-escalation. Handbook.
Tools
Incident Management
-
To declare an incident via Slack: /incident declare. -
When you're ready, add yourself to the EOC Shadow PD Schedule. -
Checkout an example Alert in #production, Explore theRunbook,Dashboard, the discription and the relatedPrometheus graphby clickingshow more. Note that any of these links could be outdated, so proceed your evaluation with caution. -
Understand when an Incident Review is required by viewing The Incident Review Handbook. -
Checkout Scenario 3 Youtube recording in this Firedrill doc to give you an idea of the k8s-related issues you might encounter in gitlab.com.
Security
-
Explain all traffic policing mechanisms we have available: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/rate-limiting/README.md -
How to block a user? Runbook for dealing with CI/CD Abuse. -
How to add a rate limit for a path? From the Runbook. -
Disabling things in HAProxy. Project import - Block example
Delivery
-
Create a hot-patch against production with a single change to a source file that adds a comment. Assign the MR to one of the current release managers. view documentation -
Get the current state of GitLab.com Canary stage using GitLab Chatops. view documentation -
Find the latest auto-deploy pipeline on ops.gitlab.net and get the current deploy status on all environments using GitLab Chatops. view documentation -
Setup your workstation to ensure you have access to the zonal and regional k8s clusters. view documentation
Observability
-
Locate the General SLA dashboard and find the panel for Sidekiq Queue Lengths per Queue -
Read the SLI apdex troubleshooting tutorial: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/monitoring/apdex-alerts-guide.md -
Ensure you know how to Silence an alert. view documentation -
Create a visualization in Kibana for all errors grouped by status code. view documentation
Reliability
-
Familiarize yourself with how to create incidents from Slack. -
Get the current HAProxy state of all nodes using the command line. view documentation -
First drain and then ready connections from one of the zonal clusters in staging. view documentation -
Join the following slack channels: #incident-management,#production,#releases,#f_upcoming_release,#alerts_general,#alerts,#dev-escalation