Site Reliability Engineer Onboarding Issue 3 - Oncall Onboarding - Steve Azzopardi

Welcome to your oncall onboarding issue!

This is the third and final of your onboarding issues.

In order to join oncall, at a high level you should:

Alerting - know how to find silences and create them.
Join the shadow rotation in PagerDuty for a few days and shadow a current oncall.
Join the shadow rotation in PagerDuty a second time and communicate with the EOC that you will take primary with them as a fallback. Record a log of:
- how many alerts you acknowledge
- how many alerts felt actionable
- how many alerts "made sense", where you knew what you needed to do or where to look
- Ideally, you are ready to join when the ratio of made sense to acknowledge is above 80% with working on at least 10 alerts.

Generalized investigation steps during an incident

An incident starts with trying to efficiently identify the nature of the problem, drilling down from:

PagerDuty alert
Grafana dashboard for the alerting service
Kibana log events for that service, often starting with one of the quick links from the Grafana dashboard
Possibly looking at other Grafana dashboards if the above indicates that the alerting service is having trouble due to its dependency on another service (e.g. Rails having lots of SQL statement timeouts may indicate trouble on the database or its connection pooler).

Once we identify the affected component and the nature of its problem, that usually gives us enough info to understand what kind of solutions are likely to be helpful -- and that may mean getting help from domain experts in whatever component of the app code or infrastructure that we identified as contributing causes of the incident.

Remember that you are not alone. At any point you can ask for help from other SREs in the #infrastructure-lounge channel, someone will be happy to join you in zoom. You can also escalate to the Incident Manager On Call (IMOC) at any time, if you need a second opinion, a different perspective, need help knowing who to reach out to on other teams, etc.

The rest of this issue gives some practical steps/exercises for things you should know how to do.

Asking for help

Make sure you know how to:

Page IMOC by typing /pd trigger in slack, then choosing GitLab Production - IMOC under Impacted Service.
Page CMOC by typing /pd trigger in slack, then choosing GitLab Production - CMOC under Impacted Service.
Page Security by typing /security Please joing us for incident#123 in slack. Handbook.
Page Dev by typing /devoncall incident-issue-url into #dev-escalation. Handbook.

Tools

Woodhouse
sre-oncall Slack bot: /sre-oncall handover. Source.

Incident Management

To declare an incident via Slack: /incident declare.
When you're ready, add yourself to the EOC Shadow PD Schedule.
Checkout an example Alert in #production, Explore the Runbook, Dashboard, the discription and the related Prometheus graph by clicking show more. Note that any of these links could be outdated, so proceed your evaluation with caution.
Understand when an Incident Review is required by viewing The Incident Review Handbook.
Checkout Scenario 3 Youtube recording in this Firedrill doc to give you an idea of the k8s-related issues you might encounter in gitlab.com.

Security

Explain all traffic policing mechanisms we have available: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/rate-limiting/README.md
How to block a user? Runbook for dealing with CI/CD Abuse.
How to add a rate limit for a path? From the Runbook.
Disabling things in HAProxy. Project import - Block example

Delivery

Create a hot-patch against production with a single change to a source file that adds a comment. Assign the MR to one of the current release managers. view documentation
Get the current state of GitLab.com Canary stage using GitLab Chatops. view documentation
Find the latest auto-deploy pipeline on ops.gitlab.net and get the current deploy status on all environments using GitLab Chatops. view documentation
Setup your workstation to ensure you have access to the zonal and regional k8s clusters. view documentation

Observability

Locate the General SLA dashboard and find the panel for Sidekiq Queue Lengths per Queue
Read the SLI apdex troubleshooting tutorial: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/monitoring/apdex-alerts-guide.md
Ensure you know how to Silence an alert. view documentation
Create a visualization in Kibana for all errors grouped by status code. view documentation

Reliability

Familiarize yourself with how to create incidents from Slack.
Get the current HAProxy state of all nodes using the command line. view documentation
First drain and then ready connections from one of the zonal clusters in staging. view documentation
Join the following slack channels: #incident-management, #production, #releases, #f_upcoming_release, #alerts_general, #alerts, #dev-escalation

Edited Mar 15, 2022 by Steve Xuereb

Assignee Loading

Time tracking Loading