Commit cb51fe4e authored by Anthony Sandoval's avatar Anthony Sandoval

WIP: restructuring README to foucs on On-Call

The README provided an outline with general
guidance for incident response. But, there
were conflicting pages with duplicate information
pertaining to the roles and responsibilities
shared during incident management.

I consolidated into the README a couple pages
that were under the `howto/` section of the
project. And I've added an `on-call/` section
that will include the checklists to follow
at the start of both shifts and incidents.

fixes gitlab-com/gl-infra/infrastructure#6765
parent e001dd06
This diff is collapsed.
# Production Incidents
## Roles
During an incident there are at least 2 roles, and one more optional
* Production engineers will
* Open a war room on Zoom immediately to have high a bandwidth communication channel.
* Create a [Google Doc]( to gather the timeline of events.
* Publish this document using the _File_, _Publish to web..._ function.
* Make this document GitLab editable by clicking on the `Share` icon and selecting _Advanced_, _Change_, then _On - GitLab_.
* Tweet ` is having a major outage, we're working on resolving it in a Google Doc LINK` with a link to this document to make the community aware.
* Redact the names to remove the blame. Only use team-member-1, -2, -3, etc.
* Document partial findings or guessing as we learn.
* Write a post mortem issue when the incident is solved, and label it with `outage`
* The point person will
* Handle updating the @gitlabstatus account explaining what is going on in a simple yet reassuring way.
* Synchronize efforts across the production engineering team
* Pull other people in when consultation is needed.
* Declare a major outage when we are meeting the definition.
* Post `@channel, we have a major outage and need help creating a live streaming war room, refer to [runbooks-production-incident]` into the #general slack channel.
* Post `@channel, we have a major outage and need help reviewing public documents` into the #marketing slack channel.
* Post `@channel, we have a major outage and are working to solve it, you can find the public doc <here>` into the #devrel slack channel.
* Move the war room to a paid account so the meeting is not time limited.
* Coordinate with the security team and the communications manager and use the [breach notification policy]( to determine if a breach of user data has occurred and notify any affected users.
* The communications manager will
* Setup a not time limited Zoom war room and provide it to the point person to move all the production engineers there.
* Setup Youtube Live Streaming int the war room following [this Zoom guide]( (for this you will need to have access to the GitLab Youtube account, ask someone from People Ops to grant you so)
* The Marketing representative will
* Review the Google Doc to provide proper context when needed.
* Include a note about how is this outage impacting customers in the document.
* Decide how to handle further communications when the outage is already handled.
# So you got yourself on call
To start with the right foot let's define a set of tasks that are nice things to do before you go
any further in your week
By performing these tasks we will keep the [broken window
effect]( under control, preventing future pain
and mess.
## Going on call
Here is a suggested checklist of things to do at the start of an on-call shift:
- *Change Slack Icon*: Click name. Click `Set status`. Click grey smile face. Type `:pagerduty:`. Set `Clear after` to end of on-call shift. Click `Save`
- *Add On-Call Feed*: PM yourself in slack `/feed add`
- *Add Production Feed*: PM yourself in slack `/feed add`
- *Join alert channels*: If not already a member, `/join` `#alerts`, `#alerts-general`, `#alerts-prod-abuse`, `#alerts-ops`
- *Turn on slack channel notifications*: Open `#production` Notification Preferences (and optionally #infra-lounge). Set Desktop and Mobile to `All new messages`
- *Turn on slack alert notifications*: Open `#alerts` and `#alerts-general` Notification Preferences. Set Desktop only to `All new messages`
- At the start of each on-call day, read all S1 incidents at:✓&state=opened&label_name%5B%5D=incident&label_name%5B%5D=S1
At the end of a shift:
- *Remove feeds*: PM yourself in slack `/feed list`, then `/feed remove (number)` for the production and on-call feeds
- *Turn off slack channel notifications*: Open `#production`, `#alerts`, `#alerts-general` Notification Preferences and return alerts to the desired values.
- *Leave noisy alert channels*: `/leave` alert channels (It's good to stay in `#alerts` and `#alerts-general`)
- Comment on any open S1 incidents at:✓&state=opened&label_name%5B%5D=incident&label_name%5B%5D=S1
- At the end of each on-call day, post a quick update in slack so the next person is aware of anything ongoing, any false alerts, or anything that needs to be handed over.
## Things to keep an eye on
### On-call issues
First check [the on-call issues][on-call-issues] to familiarize yourself with what has been
happening lately. Also, keep an eye on the [#production][slack-production] and
[#incident-management][slack-incident-management] channels for discussion around any on-going
### Useful Dashboard to keep open
- [GitLab Triage](
### Alerts
Start by checking how many alerts are in flight right now
- go to the [fleet overview dashboard]( and check the number of Active Alerts, it should be 0. If it is not 0
- go to the alerts dashboard and check what is being triggered
- [azure][prometheus-azure]
- [gprd prometheus][prometheus-gprd]
- [gprd prometheus-app][prometheus-app-gprd]
- watch the [#alerts][slack-alerts], [#alerts-general][slack-alerts-general], and [#alerts-gstg][slack-alerts-gstg] channels for alert notifications; each alert here should point you to the right [runbook][runbook-repo] to fix it.
- if they don't, you have more work to do.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
### Prometheus targets down
Check how many targets are not scraped at the moment. alerts are in flight right now, to do this:
- go to the [fleet overview dashboard]( and check the number of Targets down. It should be 0. If it is not 0
- go to the [targets down list] and check what is.
- [azure][prometheus-azure-targets-down]
- [gprd prometheus][prometheus-gprd-targets-down]
- [gprd prometheus-app][prometheus-app-gprd-targets-down]
- try to figure out why there is scraping problems and try to fix it. Note that sometimes there can be temporary scraping problems because of exporter errors.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
## Rotation Schedule
We use [PagerDuty]( to manage our on-call rotation schedule and
alerting for emergency issues. We currently have a split schedule between EMEA and AMER for on-call
rotations in each geographical region; we will also incorporate a rotation for team members in the
APAC region as we continue to grow over time.
The [EMEA][pagerduty-emea] and [AMER][pagerduty-amer] schedule [each have][pagerduty-emea-shadow] a
[shadow schedule][pagerduty-amer-shadow] which we use for on-boarding new engineers to the on-call
When a new engineer joins the team and is ready to start shadowing for an on-call rotation,
[overrides][pagerduty-overrides] should be enabled for the relevant on-call hours during that
rotation. Once they have completed shadowing and are comfortable/ready to be inserted into the
primary rotations, update the membership list for the appropriate schedule to [add the new team
This [pagerduty forum post][pagerduty-shadow-schedule] was referenced when setting up the [blank
shadow schedule][pagerduty-blank-schedule] and initial [overrides][pagerduty-overrides] for
on-boarding new team members.
# CMOC Checklist
## Shift Start
## Declaring an Incident
checklist for starting and incident:
- [ ] For the CMOC - post in #incident-management that you are the CMOC - cross post to #support_gitlab-com if needed
- [ ] Create the production issue if possible. In slack: `/start-incident` or if you have an alert in alerts-general - click the Open Issue button in the thread.
- [ ] Create an incident in - make sure you check the options to broadcast to slack, twitter, etc
* If you don't have full specifics, get the incident created in and first tweet out with a more generic "We are seeing elevated error rates on". It is better to have a post sooner with investigating than waiting 5 minutes to know more.
- [ ] Create a google doc from the [shared template](
- [ ] Update Slack and the incident with links to the issue number and google docs.
- [ ] Check with incident team. Are they all in the same channel, gdoc, on the zoom as needed. Coordinate and consolidate communication
- [ ] Set a timer for 15 minutes to remind yourself to update and tweet
- [ ] Start to gather overall summary and look to write up an executive summary in the production issue or gdoc for others in the company
- [ ] Check in with incident team:
* Do they need more people or expertise? Broadcast and ask for help as soon as you know it is needed.
* Clear the deck - make sure other changes / teams know an incident is going on
* Clear the deck - cancel other meetings as needed.
## Taking Ownership of an Incident
#### Critical Dashboards
1. What alerts are going off? [Prometheus gprd](
1. How does do these dashboards look?
- [Triage dashboard](
- [General Triage dashboard](
1. What services are showing availability issues?
1. What components are outside of normal operations?
* [Triage-components](
* [Triage-services](
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment