Hubert Maraszek - GitLab.com CMOC
module-name: "GitLab-com CMOC"
area: "Customer Service"
maintainers:
- tristan
- faleksic
Goal of this module: Instruct the taker on the duties and responsibilities of being the Communications Manager On Call (CMOC) for an active GitLab.com incident. This includes providing a clear understanding of what an incident is, how to work with reliability engineering during one, and how to use the tools at our disposal to effectively communicate updates to incidents both internally and externally to end-users and stakeholders.
Tackle each stage in sequential order, but first:
-
Notify your manager on this issue to let them know them you have started. -
Open an access request to request access to Status.io, PagerDuty (if you don't already have an account) and the Social Media Admin 1Password vault.
Note: Ensure that you request access to the
Test Page
along with the production page on Status.io - this is the testing environment and you'll need this to complete Stage 4 of this module. See this access request for an example.If you register your account, then have problems accepting the email invitation: add a Comment to your Access Request Issue requesting a new invitation. It is not possible to register the account twice with the same invitation link.
Introduction
CMOC is a communication-focused role that focuses on updating Gitlab.com customers about the current state of known Gitlab.com incidents. The CMOC will interface with Engineering during production issues, working with them to characterize the issue and use Status.io to update Gitlab.com social media and our status page with the current incident status.
The CMOC responsibilities are to have a clear knowledge of what is the affectation that happens when a specific incident is triggered so that it can properly communicate it to all the required channels. This is generally achieved by collecting information from the EOC and IM and/or monitoring tools to provide the most accurate update about what happened and how it is being solved.
Stage 1: On-Call basics & expectations
-
Done with Stage 1
-
Review GitLab Support's On-Call Guide, paying special attention to the expectations for being on-call. -
Open and complete the On-Call Basics training module.
Stage 2: Additional Incident Preparation
-
Done with Stage 2
In addition to the incident preparation from the On-Call basics, as a CMOC, there are some additional resources to monitor for possible incidents.
Slack Channels
-
Note that for each new incident, a dedicated slack channel is generated automatically for all communication around that incident. Example: #incident-3188
. The channel handle can be found in #incident-management slack channel.
Optionally, consider joining the following channels as well. They aren't necessary to monitor when working through most incidents but they will be useful eventually.
GitLab Community Forum
Reports of problems with GitLab.com will primarily come from end-users through Zendesk but they may come from other sources as well.
Bookmark the following and be prepared to check them for reports of issues with GitLab.com:
Stage 3: Managing Incidents
-
Done with Stage 3
The incident management process normally begins with a PagerDuty page to the CMOC from the EOC (Engineer On Call) when they've detected a problem with GitLab.com severe enough to require us.
However, if we or our users notice an issue that we suspect could turn into an incident we're encouraged to contact the on-call EOC for their opinion.
-
Read about how to determine who the on-call EOC is.
In more severe cases if we've received enough reports from users of a particular issue with GitLab.com that we feel is indicative of an incident we can page the on-call EOC through PagerDuty.
-
Read about how to page to on-call EOC.
As a general rule, at least three separate reports of the same type of issue within a relatively short timeframe is a strong indicator that GitLab.com may be facing an incident and a page to the EOC is justified. Be sure to include links to Zendesk tickets and any other information to the EOC that they may need to understand the issue clearly and concisely.
-
Watch the CMOC training recording where a member of the site reliability team provides guidance on the role and expectations of being on CMOC PagerDuty. -
Read about finding related tickets for an incident.
Once the severity of an ongoing incident has been set and has been posted to our status page, internal stakeholders must be notified by the CMOC.
-
Read about notifying stakeholders
The Status Page - Status.io
Once an incident has been declared the CMOC will be paged via PagerDuty and it'll be up to you to start managing Status.io. Along with this, one of your first tasks should be to join the incident Zoom call so that you can follow along with the EOC and anyone else involved in working the incident.
Effective communication with production engineering during an incident is crucial as the content of our status updates will largely come from them. Keeping yourself informed on the progress of an incident will allow you to communicate updates quickly and concisely with stakeholders and affected users.
Incidents and maintenance events are managed by the CMOC throughout their entire lifecycle through our status page, powered by Status.io. Updates made to incidents and maintenance events through Status.io are automatically tweeted out via @gitlabstatus through the broadcast feature of Status.io.
-
Learn how to create, update, and resolve incidents in Status.io. -
Read about Frequency of Updates to learn how often you should aim to update Status.io depending on the severity of the incident. -
Learn how to Perform a Handover at the end of your on-call shift.
Stage 4: Review
-
Done with Stage 4
Review Past Incidents
-
Review the following past incidents to get an idea of the proper tone to use and what details to include when posting status updates.
[S1] Site outage due to CDN connectivity issues
Connectivity issues for GitLab.com caused by CloudFlare outage.
[S2] 500 Error Returned During Project Security Configuration
Receiving 500 error when navigating to Security Configuration on projects caused by an MR which needed to be reverted.
[S3] Repository mirroring delays
Investigating unusually high delays in our repository mirroring feature.
CMOC Practice Events
Look in the Google Shared Drive for recordings of previous CMOC Practice Events that you can watch to get an understanding of the process.
If you need to review more examples, browse the incident history section of Status.io.
Status Page Testing
Our Status.io instance gives us access to a testing environment that allows you to create, update, and resolve incidents on our status page in an internal environment.
Once your access request has been fulfilled and you've logged in to Status.io you can access our testing environment by clicking the GitLab System Status
dropdown box in the top right of the window and selecting Test Page
. If you continue without doing this you will be creating an incident on our LIVE status page.
Test your knowledge by creating an incident, guide it through the entire incident lifecycle, and then provide a link to it. It is safe to use the "Broadcast" feature in the test environment, as it will only broadcast the status to a protected test Twitter account and any GitLab team member emails configured here.
-
Create, update, change to monitoring, and then resolve one incident on our test status page.
Stage 5: Access & Shadowing
This stage ensures that all of the necessary accounts and access you need to perform CMOC duties are provisioned and that you get a preview of what being a CMOC is like before you officially join the rotation.
While tweets are usually made through our Status.io integration, there are times where a CMOC will need direct access to @gitlabstatus
to make a tweet. The steps below will allow you to gain access to the account to tweet directly from it.
-
Log in to the @gitlabstatus
twitter account using the credentials in the 1PasswordSocial Media Admin
vault.-
If you do not have access to the Social Media Admin
vault, submit an individual access request.
-
Shadowing
-
Shadow one or more incidents before you join the CMOC rotation. This gives you opportunities to ask questions of the current CMOC, as well as get a feel of what is happening in the incident room during an active incident. -
Create an Issue to have Support-ops add you to the CMOC Shadow rotation for your region/timezone in PagerDuty. Make sure to specify which layer applies to you.
-
Please be sure to reach out in #support-team-chat
if you run into issues with any of the steps.
Stage 6: Setting up for success
Jumping into an incident and having everything ready can be a challenge. Consider having the following bookmarks in a folder to find what you need faster and easier:
- Escalation Policies & Schedules - PagerDuty
- Status.io
- Servicing Internal Requests
- Sending Notices
- CMOC Workflow
- Issue Tracker CMOC Handover
- Oncall Schedules
- GitLab System Status
- GitLab.com Status (@gitlabstatus) / Twitter
- Who's IM now
Having all of these saved has a lot of uses. Perhaps you aren't sure about the next step that you need to make as a CMOC - the CMOC Workflows helps you find that out.
It is hard to jump in into an incident unprepared or late, having bookmarks helps get up to speed.
-
(Optional) Review CMOC-Helper tool to support you in the on-call responsibilities. The tool aggregates many helpful resources and helps to find the required information faster using your terminal. Read the list of available features here
Stage 7: Final Steps
-
Have your trainer review your practice incidents. If you do not have a trainer, ask another CMOC or your manager to review. -
(Optional but highly recommended) Work with your trainer through a CMOC practice event. Your trainer should act as IMOC, and you are CMOC. -
Ping your manager on this issue to let them know that you've completed this module and are ready to be added to the CMOC schedule rotation in PagerDuty. - You may expense the cost of your mobile phone service for any month that you are performing on-call duties.
-
Manager: schedule a call (or integrate into 1:1) to review how the module went once you have reviewed this issue.
Documenting your CMOC status
NOTE: Perform these steps ONLY if you have joined the CMOC rotation in PagerDuty.
-
Create an MR to declare yourself a GitLab.com CMOC on the team page by updating the your yaml file. Detailed instructions can be found here. -
Submit a MR to update your entry in the Support Team yaml file. -
Add CMOC
to the list inmodules
:
-
modules:
- CMOC
-
Add 'Support Focus: CMOC'
to the list of Zendeskgroups
so you gain access to the GitLab Incidents shared view in Zendesk:
zendesk:
main:
id: 1234
groups:
- 'Support Focus: CMOC'
-
Add your CMOC region to the pagerduty
rotations
array with the appropriate region. The format isCMOC <REGION>
, for exampleCMOC EMEA
:
pagerduty:
id: 1234
rotations:
- CMOC EMEA
-
Open an access request to have a Support Staff - CMOC
role in Zendesk