Gitlab.com CMOC - Jason Young
module: GitLab.com CMOC
area: Helping Customers
level: Beginner
maintainer: TBD
pathways:
gitlab-com-saas-support:
position: 2
Goal of this module: Instruct the taker on the duties and responsibilities of being the Communications Manager On Call (CMOC) for an active GitLab.com incident. This includes providing a clear understanding of what an incident is, how to work with reliability engineering during one, and how to use the tools at our disposal to effectively communicate updates to incidents both internally and externally to end-users and stakeholders.
Tackle each stage in sequential order, but first:
-
Ping your manager on this issue to notify them you have started. -
Open an access request to request access to Status.io and PagerDuty (if you don't already have an account).
Stage 1: GitLab.com Architecture, Monitoring, and Incident Basics
-
Done with Stage 1
A basic understanding of the architecture that the GitLab.com platform is comprised of along with how our infrastructure team monitors it is paramount to understanding how issues with certain components of the platform affect end-users. It's also essential to know what an incident is at its most basic level and how we classify one.
Architecture & Monitoring
-
Read through the Production Architecture document to gain a basic understanding of the infrastructural layout of GitLab.com. -
Read about the Monitoring of GitLab.com to understand how our infrastructure team monitors the performance of GitLab.com. -
Read about which critical dashboards show if GitLab.com is experiencing an incident and then bookmark the following ones:
Incident Basics
-
Read the Incident Management page from the Infrastructure section of the GitLab handbook. Take special note of: - What an incident is.
- What roles are assumed during an incident.
- The definitions of the different state of operations that the GitLab.com platform may be in during an incident.
You should also know:
-
What the hot patching process is on GitLab.com and how it works. Make sure to bookmark the patcher project.
Lastly:
-
Be aware of PagerDuty schedules, specifically the hours of on-call. They may be slightly outside of your normal hours of work. -
Understand the expectations for being on-call.
Stage 2: Incident Preparation
-
Done with Stage 2
Your workspace should be configured to be as prepared as possible for an incident. This means having essential issue trackers bookmarked, joining incident management related Slack channels, and being aware of where reports of issues from end-users will surface.
Issue Trackers
-
Bookmark these issues trackers.
Slack Channels
-
Join these Slack channels
Optionally, consider joining the following channels as well. They aren't necessary to monitor when working through most incidents but they will be useful eventually.
- #mgcp_gitlab_ops
- #ongres-gitlab
- #cloud-provider-alerts
- #alerts
- #alerts-general
- #announcements
- #dev-escalation
GitLab Community Forum
Reports of problems with GitLab.com will primarily come from end-users through Zendesk but they may come from other sources as well.
-
Bookmark the GitLab Community Forum and be prepared to check it for reports of issues with GitLab.com.
Stage 3: Managing Incidents
-
Done with Stage 3
The incident management process normally begins with a PagerDuty page to the CMOC from the EOC when they've detected a problem with GitLab.com severe enough to require us.
However, if we or our users notice an issue that we suspect could turn into an incident we're encouraged to contact the oncall EOC for their opinion.
-
Read about how to determine who the oncall EOC is.
In more severe cases if we've received enough reports from users of a particular issue with GitLab.com that we feel is indicative of an incident we can page the oncall EOC through PagerDuty.
-
Read about how to page to oncall EOC.
As a general rule, at least three separate reports of the same type of issue within a relatively short timeframe is a strong indicator that GitLab.com may be facing an incident and a page to the EOC is justified. Be sure to include links to Zendesk tickets and any other information to the EOC that they may need to understand the issue clearly and concisely.
The Status Page - Status.io
Once an incident has been declared the EOC will page the CMOC via PagerDuty and it'll be up to you to start managing Status.io. Along with this, one of your first tasks should be to join The Situation Room Zoom call so that you can follow along with the EOC and anyone else involved in working the incident.
Effective communication with production engineering during an incident is crucial as the content of our status updates will largely come from them. Keeping yourself informed on the progress of an incident will allow you to communicate updates quickly and concisely with stakeholders and affected users.
Incidents and maintenance events are managed by the CMOC throughout their entire lifecycle through our status page, powered by Status.io. Updates made to incidents and maintenance events through Status.io are automatically tweeted out via @gitlabstatus through the broadcast feature of Status.io.
-
Learn how to Create, Update, and Resolve incidents in Status.io. -
Read about Frequency of Updates to learn how often you should aim to update Status.io depending on the severity of the incident. -
Learn how to Perform a Handover at the end of your on-call shift.
Stage 4: Review
-
Done with Stage 4
Review Past Incidents
-
Review the following past incidents to get an idea of the proper tone to use and what details to include when posting status updates.
[S2] Errors returned on CI job artifact uploads
A recent deployment to GitLab.com began to cause an issue with CI job artifacts returning a 500
error on upload.
[S3] Increased Error Rate on GitLab.com
Infrastructure was alerted to a CPU saturation issue on a specific Gitaly node which was causing overall slowness and timeouts on GitLab.com. This incident involved the CMOC blocking a .com user and contacting them via Zendesk.
[S1] GitLab.com Registry Issues
The GitLab.com registry began returning 503
for pushes and pulls.
[S1] CI Runner Delays
A deployment to GitLab.com temporarily rendered sidekiq inoperable causing CI pipelines to not be picked up.
If you need to review more examples, browse the incident history section of Status.io.
Status Page Testing
Our Status.io instance gives us access to a testing environment that allows you to create, update, and resolve incidents on our status page in an internal environment.
Once your access request has been fulfilled and you've logged in to Status.io you can access our testing environment by clicking the GitLab System Status
dropdown box in the top right of the window and selecting Test Page
. If you continue without doing this you will be creating an incident on our LIVE status page.
Test your knowledge by creating an incident, guide it through the entire incident lifecycle, and then provide a link to it.
-
Create, update, change to monitoring, and then resolve one incident on our test status page.
Completion
-
Ping your manager on this issue to let them know that you've completed this module and are ready to be added to the CMOC schedule rotation in PagerDuty. -
Send an MR to declare yourself a GitLab.com CMOC on the team page.