smehta oncall onboarding

Title: Site Reliability Engineer Onboarding Issue 3 - Oncall Onboarding [Fill in name and start date]

Welcome to your oncall onboarding issue!

This is the third and final of your onboarding issues.

In order to join oncall, at a high level you should:

Alerting - know how to find silences and create them. view documentation
Join the shadow rotation in PagerDuty for a few days and shadow a current oncall. You might consider setting up a layer within the schedule to shadow on-call once a week on a consistent day. This will allow you to work with different people and see different issues over time, without as much mental drain as a full week on-call.
Join the shadow rotation in PagerDuty a second time and communicate with the EOC that you will take primary with them as a fallback. Record a log of:
- how many alerts you acknowledge
- how many alerts felt actionable
- how many alerts "made sense", where you knew what you needed to do or where to look
- Ideally, you are ready to join when the ratio of made sense to acknowledge is above 80% with working on at least 10 alerts.

Generalized investigation steps during an incident

An incident starts with trying to efficiently identify the nature of the problem, drilling down from:

PagerDuty alert
Grafana dashboard for the alerting service
Kibana log events for that service, often starting with one of the quick links from the Grafana dashboard
Possibly looking at other Grafana dashboards if the above indicates that the alerting service is having trouble due to its dependency on another service (e.g. Rails having lots of SQL statement timeouts may indicate trouble on the database or its connection pooler).

Once we identify the affected component and the nature of its problem, that usually gives us enough info to understand what kind of solutions are likely to be helpful -- and that may mean getting help from domain experts in whatever component of the app code or infrastructure that we identified as contributing causes of the incident.

Remember that you are not alone. At any point you can ask for help from other SREs in the #infrastructure-lounge channel, someone will be happy to join you in zoom. You can also escalate to the Incident Manager On Call (IMOC) at any time, if you need a second opinion, a different perspective, need help knowing who to reach out to on other teams, etc.

If you find any abnormal or suspicious activity during the course of your investigation, please do not hesitate to contact security.

The rest of this issue gives some practical steps/exercises for things you should know how to do.

Asking for help

Make sure you know how to:

Page IMOC by typing /pd trigger in slack, then choosing GitLab Production - Incident Manager under Impacted Service.
Page CMOC by typing /pd trigger in slack, then choosing Incident Management - CMOC under Impacted Service.
Page Security - for medium/high severity incidents, refer to how to engage the SEOC. For lower severity incidents, refer to the incident severity table to determine the right course of action.
Page Dev by typing /devoncall incident-issue-url into #dev-escalation. Handbook.

Tools

Woodhouse
sre-oncall Slack bot: /sre-oncall handover. Source.

Incident Management

To declare an incident via Slack: /incident declare.
When you're ready, add yourself to the EOC Shadow PD Schedule.
Checkout an example alert in #production, explore the Runbook, Dashboard, the description and the related Prometheus graph by clicking show more. Note that any of these links could be outdated, so proceed your evaluation with caution.
Understand when an Incident Review is required by viewing The Incident Review Handbook.
Checkout Scenario 3 YouTube recording in this Firedrill doc to give you an idea of the k8s-related issues you might encounter in gitlab.com.

Security

Explain all traffic policing mechanisms we have available. View the runbook
How to block a user? Runbook for dealing with CI/CD Abuse
How to add a rate limit for a path? View the runbook.
Disabling things in HAProxy. Project import - Block example

Delivery

Create a hot-patch against production with a single change to a source file that adds a comment. Assign the MR to one of the current release managers. view documentation
Get the current state of GitLab.com Canary stage using GitLab Chatops. view documentation
Find the latest auto-deploy pipeline on ops.gitlab.net and get the current deploy status on all environments using GitLab Chatops. view documentation
Setup your workstation to ensure you have access to the zonal and regional k8s clusters. view documentation

Observability

Reliability

Familiarize yourself with how to create incidents from Slack. view documentation
Get the current HAProxy state of all nodes using the command line. view documentation
First drain and then ready connections from one of the zonal clusters in staging. view documentation
Join the following slack channels: #incident-management, #production, #releases, #f_upcoming_release, #feed_alerts-general, #alerts, #dev-escalation