smehta oncall onboarding
Title: Site Reliability Engineer Onboarding Issue 3 - Oncall Onboarding [Fill in name and start date]
Welcome to your oncall onboarding issue!
This is the third and final of your onboarding issues.
In order to join oncall, at a high level you should:
-
Alerting - know how to find silences and create them. view documentation -
Join the shadow rotation in PagerDuty for a few days and shadow a current oncall. You might consider setting up a layer within the schedule to shadow on-call once a week on a consistent day. This will allow you to work with different people and see different issues over time, without as much mental drain as a full week on-call. -
Join the shadow rotation in PagerDuty a second time and communicate with the EOC that you will take primary with them as a fallback. Record a log of: - how many alerts you acknowledge
- how many alerts felt actionable
- how many alerts "made sense", where you knew what you needed to do or where to look
- Ideally, you are ready to join when the ratio of made sense to acknowledge is above 80% with working on at least 10 alerts.
Generalized investigation steps during an incident
An incident starts with trying to efficiently identify the nature of the problem, drilling down from:
- PagerDuty alert
- Grafana dashboard for the alerting service
- Kibana log events for that service, often starting with one of the quick links from the Grafana dashboard
- Possibly looking at other Grafana dashboards if the above indicates that the alerting service is having trouble due to its dependency on another service (e.g. Rails having lots of SQL statement timeouts may indicate trouble on the database or its connection pooler).
Once we identify the affected component and the nature of its problem, that usually gives us enough info to understand what kind of solutions are likely to be helpful -- and that may mean getting help from domain experts in whatever component of the app code or infrastructure that we identified as contributing causes of the incident.
Remember that you are not alone. At any point you can ask for help from other SREs in the #infrastructure-lounge channel, someone will be happy to join you in zoom. You can also escalate to the Incident Manager On Call (IMOC) at any time, if you need a second opinion, a different perspective, need help knowing who to reach out to on other teams, etc.
If you find any abnormal or suspicious activity during the course of your investigation, please do not hesitate to contact security.
The rest of this issue gives some practical steps/exercises for things you should know how to do.
Asking for help
Make sure you know how to:
-
Page IMOC by typing /pd trigger
in slack, then choosingGitLab Production - Incident Manager
underImpacted Service
. -
Page CMOC by typing /pd trigger
in slack, then choosingIncident Management - CMOC
underImpacted Service
. -
Page Security - for medium/high severity incidents, refer to how to engage the SEOC. For lower severity incidents, refer to the incident severity table to determine the right course of action. -
Page Dev by typing /devoncall incident-issue-url
into#dev-escalation
. Handbook.
Tools
Incident Management
-
To declare an incident via Slack: /incident declare
. -
When you're ready, add yourself to the EOC Shadow PD Schedule. -
Checkout an example alert in #production
, explore theRunbook
,Dashboard
, the description and the relatedPrometheus graph
by clickingshow more
. Note that any of these links could be outdated, so proceed your evaluation with caution. -
Understand when an Incident Review is required by viewing The Incident Review Handbook. -
Checkout Scenario 3 YouTube recording in this Firedrill doc to give you an idea of the k8s-related issues you might encounter in gitlab.com.
Security
-
Explain all traffic policing mechanisms we have available. View the runbook -
How to block a user? Runbook for dealing with CI/CD Abuse -
How to add a rate limit for a path? View the runbook. -
Disabling things in HAProxy. Project import - Block example
Delivery
-
Create a hot-patch against production with a single change to a source file that adds a comment. Assign the MR to one of the current release managers. view documentation -
Get the current state of GitLab.com Canary stage using GitLab Chatops. view documentation -
Find the latest auto-deploy pipeline on ops.gitlab.net and get the current deploy status on all environments using GitLab Chatops. view documentation -
Setup your workstation to ensure you have access to the zonal and regional k8s clusters. view documentation
Observability
-
Take a look at the following dashboards: -
Locate the general: SLAs
dashboard -
Locate the sidekiq: Overview
and find panel forSidekiq Queue Lengths per Queue
-
-
Read these documents about notifications and troubleshooting: -
Ensure you know how to Silence an alert. view documentation -
Ensure you can run make generate
in runbooks. -
Create a visualization in Kibana for all errors grouped by status code. view documentation
Reliability
-
Familiarize yourself with how to create incidents from Slack. view documentation -
Get the current HAProxy state of all nodes using the command line. view documentation -
First drain and then ready connections from one of the zonal clusters in staging. view documentation -
Join the following slack channels: #incident-management
,#production
,#releases
,#f_upcoming_release
,#feed_alerts-general
,#alerts
,#dev-escalation