Gitlab.com CMOC - Jason Young

module: GitLab.com CMOC
area: Helping Customers
level: Beginner
maintainer: TBD
pathways:
  gitlab-com-saas-support:
    position: 2

Goal of this module: Instruct the taker on the duties and responsibilities of being the Communications Manager On Call (CMOC) for an active GitLab.com incident. This includes providing a clear understanding of what an incident is, how to work with reliability engineering during one, and how to use the tools at our disposal to effectively communicate updates to incidents both internally and externally to end-users and stakeholders.

Tackle each stage in sequential order, but first:

Ping your manager on this issue to notify them you have started.
Open an access request to request access to Status.io and PagerDuty (if you don't already have an account).

Stage 1: GitLab.com Architecture, Monitoring, and Incident Basics

Done with Stage 1

A basic understanding of the architecture that the GitLab.com platform is comprised of along with how our infrastructure team monitors it is paramount to understanding how issues with certain components of the platform affect end-users. It's also essential to know what an incident is at its most basic level and how we classify one.

Architecture & Monitoring

Read through the Production Architecture document to gain a basic understanding of the infrastructural layout of GitLab.com.
Read about the Monitoring of GitLab.com to understand how our infrastructure team monitors the performance of GitLab.com.
Read about which critical dashboards show if GitLab.com is experiencing an incident and then bookmark the following ones:

Incident Basics

Read the Incident Management page from the Infrastructure section of the GitLab handbook. Take special note of:
- What an incident is.
- What roles are assumed during an incident.
- The definitions of the different state of operations that the GitLab.com platform may be in during an incident.

You should also know:

What the hot patching process is on GitLab.com and how it works. Make sure to bookmark the patcher project.

Lastly:

Be aware of PagerDuty schedules, specifically the hours of on-call. They may be slightly outside of your normal hours of work.
Understand the expectations for being on-call.

Stage 2: Incident Preparation

Done with Stage 2

Your workspace should be configured to be as prepared as possible for an incident. This means having essential issue trackers bookmarked, joining incident management related Slack channels, and being aware of where reports of issues from end-users will surface.

Issue Trackers

Bookmark these issues trackers.
- Production
- CMOC Handover

Slack Channels

Join these Slack channels
- #production
- #incident-management

Optionally, consider joining the following channels as well. They aren't necessary to monitor when working through most incidents but they will be useful eventually.

GitLab Community Forum

Reports of problems with GitLab.com will primarily come from end-users through Zendesk but they may come from other sources as well.

Bookmark the GitLab Community Forum and be prepared to check it for reports of issues with GitLab.com.

Stage 3: Managing Incidents

Done with Stage 3

The incident management process normally begins with a PagerDuty page to the CMOC from the EOC when they've detected a problem with GitLab.com severe enough to require us.

However, if we or our users notice an issue that we suspect could turn into an incident we're encouraged to contact the oncall EOC for their opinion.

Read about how to determine who the oncall EOC is.

In more severe cases if we've received enough reports from users of a particular issue with GitLab.com that we feel is indicative of an incident we can page the oncall EOC through PagerDuty.

Read about how to page to oncall EOC.

As a general rule, at least three separate reports of the same type of issue within a relatively short timeframe is a strong indicator that GitLab.com may be facing an incident and a page to the EOC is justified. Be sure to include links to Zendesk tickets and any other information to the EOC that they may need to understand the issue clearly and concisely.

The Status Page - Status.io

Once an incident has been declared the EOC will page the CMOC via PagerDuty and it'll be up to you to start managing Status.io. Along with this, one of your first tasks should be to join The Situation Room Zoom call so that you can follow along with the EOC and anyone else involved in working the incident.

Read about how to conduct yourself while in The Situation Room.

Effective communication with production engineering during an incident is crucial as the content of our status updates will largely come from them. Keeping yourself informed on the progress of an incident will allow you to communicate updates quickly and concisely with stakeholders and affected users.

Incidents and maintenance events are managed by the CMOC throughout their entire lifecycle through our status page, powered by Status.io. Updates made to incidents and maintenance events through Status.io are automatically tweeted out via @gitlabstatus through the broadcast feature of Status.io.

Learn how to Create, Update, and Resolve incidents in Status.io.
Read about Frequency of Updates to learn how often you should aim to update Status.io depending on the severity of the incident.
Learn how to Perform a Handover at the end of your on-call shift.

Stage 4: Review

Done with Stage 4

Review Past Incidents

Review the following past incidents to get an idea of the proper tone to use and what details to include when posting status updates.

[S2] Errors returned on CI job artifact uploads

A recent deployment to GitLab.com began to cause an issue with CI job artifacts returning a 500 error on upload.

[S3] Increased Error Rate on GitLab.com

Infrastructure was alerted to a CPU saturation issue on a specific Gitaly node which was causing overall slowness and timeouts on GitLab.com. This incident involved the CMOC blocking a .com user and contacting them via Zendesk.

[S1] GitLab.com Registry Issues

The GitLab.com registry began returning 503 for pushes and pulls.

[S1] CI Runner Delays

A deployment to GitLab.com temporarily rendered sidekiq inoperable causing CI pipelines to not be picked up.

If you need to review more examples, browse the incident history section of Status.io.

Status Page Testing

Our Status.io instance gives us access to a testing environment that allows you to create, update, and resolve incidents on our status page in an internal environment.

Once your access request has been fulfilled and you've logged in to Status.io you can access our testing environment by clicking the GitLab System Status dropdown box in the top right of the window and selecting Test Page. If you continue without doing this you will be creating an incident on our LIVE status page.

Test your knowledge by creating an incident, guide it through the entire incident lifecycle, and then provide a link to it.

Create, update, change to monitoring, and then resolve one incident on our test status page.
- Link: https://app.status.io/pages/incident/5bedc0c2a394fc04c9ccc974/5f88a0cd20fdc304b44f8d92

Completion

Ping your manager on this issue to let them know that you've completed this module and are ready to be added to the CMOC schedule rotation in PagerDuty.
Send an MR to declare yourself a GitLab.com CMOC on the team page.

Edited Oct 15, 2020 by Jason Young