Osato Ojo - On-Call Basics

module-name: "On-Call Basics"
area: "Customer Service"
maintainers:
  - lyle

On-Call Basics

Goal Be familiar with the responsibilities of being on-call in Support.

This module is not a standalone training module. It covers content that overlaps the Customer Emergency On-Call (CEOC) and Communication Manager On-Call (CMOC) roles, and should be done in conjunction with one of the related training modules.

Stage 1: PagerDuty Setup & Scheduling

  • Done with Stage 1

You might already have done parts of this section during onboarding or earlier modules. If so, just mark them accordingly – this section is to make sure everyone has the same baseline.

  1. Sign up on PagerDuty with the link that was emailed to you, and install the app on your phone.
    1. Familiarize yourself with the interface and the functionality.
    2. Use this PagerDuty guide to keep PagerDuty contact information up to date.
    3. Use this PagerDuty guide to configure your notifications. Consider allowing PagerDuty to bypass "Do not disturb" mode.
  2. Configure your personal notification rules in PagerDuty under "My Profile" > "Notification Rules"
    1. Currently, customer emergency escalation policy is set to 10 minutes. That means if you do not respond to the notification within this period, the emergency will escalate to the rest of the team. Make sure your personal notification rules take this into account.
    2. Remember to update this accordingly when your details changed
  3. Use this PagerDuty guide to subscribe to your on-call schedule.
  4. Link PagerDuty with your Slack account by opening a direct message with the PagerDuty app and click link your accounts when prompted. You'll be met with a confirmation page if the link was successful. (If you're unsure whether you've already done this or not, you can do /pd help and check the output. If it mentions /pd unlink as an option, you've already linked the accounts.)
  5. Watch the CMOC training recording where a member of the site reliability team provides guidance on the role and expectations of being on CMOC PagerDuty.
  6. OPTIONAL: Now that you have access to PagerDuty, consider joining a shadow rotation and getting paged right along with the Support Engineer On-Call! Your manager can help you with this.

Stage 2: GitLab.com Incident Basics

  • Done with Stage 2

It's also essential to know what an incident is at its most basic level and how we classify one. Regardless of your on-call role, if you believe there is an incident, declare one.

  1. Read the Incident Management page from the Infrastructure section of the GitLab handbook to understand how to collaborate with the Site Reliability Engineers on-call for GitLab.com emergencies. Take special note of:
    • What an incident is.
    • What roles are assumed during an incident.
    • The definitions of the different state of operations that the GitLab.com platform may be in during an incident
  2. CEOC OPTIONAL (Required for CMOC) Familiarize yourself with what the hot patching process is on GitLab.com and how it works. Make sure to bookmark the patcher project. See !42 and gitlab-com/gl-infra/patcher/-/pipelines/252165 for an example hot-patch MR and corresponding pipeline.

GitLab.com Staging Access

You may use staging.gitlab.com in the process of verifying a patch generated in the scope of an emergency before pushing it out to production. You don't need access to any special account type, a normal user account should be sufficient.

  1. You can create an account by using the Google or SSO sign in on staging.gitlab.com. If for some reason that doesn't work, then open a Individual Access request.

Stage 3: GitLab.com Architecture, Monitoring and Logs

  • Done with Stage 3

A basic understanding of the architecture that the GitLab.com platform is comprised of, along with how our infrastructure team monitors it, is paramount to understanding how issues with certain components of the platform affect end-users.

  1. Read through the Production Architecture document to gain a basic understanding of the infrastructural layout of GitLab.com.
  2. Read about the Monitoring of GitLab.com to understand how our infrastructure team monitors the performance of GitLab.com.
  3. Read about which critical dashboards show if GitLab.com is experiencing an incident and then bookmark the following ones:
  1. OPTIONAL: If you need a refresher on searching GitLab.com logs, review the Searching logs section in the GitLab.com Basics training.

Stage 4: Basic GitLab.com Incident Preparation

  • Done with Stage 4

Your workspace should be configured to be as prepared as possible for an incident. This means having essential issue trackers bookmarked, joining incident management related Slack channels, and being aware of where reports of issues from end-users will surface.

Issue Trackers

  1. Bookmark these issues trackers.

Slack Channels

  1. Join these Slack channels
Edited by Osato Ojo