Commit 6b6b289f authored by Anthony Sandoval's avatar Anthony Sandoval

Merge branch 'on-call-incident-cleanup' into 'master'

restructuring README to foucs on On-Call

Closes gitlab-com/gl-infra/infrastructure#6765

See merge request !1124
parents e001dd06 cb51fe4e
This diff is collapsed.
# Production Incidents
## Roles
During an incident there are at least 2 roles, and one more optional
* Production engineers will
* Open a war room on Zoom immediately to have high a bandwidth communication channel.
* Create a [Google Doc](https://docs.google.com) to gather the timeline of events.
* Publish this document using the _File_, _Publish to web..._ function.
* Make this document GitLab editable by clicking on the `Share` icon and selecting _Advanced_, _Change_, then _On - GitLab_.
* Tweet `GitLab.com is having a major outage, we're working on resolving it in a Google Doc LINK` with a link to this document to make the community aware.
* Redact the names to remove the blame. Only use team-member-1, -2, -3, etc.
* Document partial findings or guessing as we learn.
* Write a post mortem issue when the incident is solved, and label it with `outage`
* The point person will
* Handle updating the @gitlabstatus account explaining what is going on in a simple yet reassuring way.
* Synchronize efforts across the production engineering team
* Pull other people in when consultation is needed.
* Declare a major outage when we are meeting the definition.
* Post `@channel, we have a major outage and need help creating a live streaming war room, refer to [runbooks-production-incident]` into the #general slack channel.
* Post `@channel, we have a major outage and need help reviewing public documents` into the #marketing slack channel.
* Post `@channel, we have a major outage and are working to solve it, you can find the public doc <here>` into the #devrel slack channel.
* Move the war room to a paid account so the meeting is not time limited.
* Coordinate with the security team and the communications manager and use the [breach notification policy](https://about.gitlab.com/security/#data-breach-notification-policy) to determine if a breach of user data has occurred and notify any affected users.
* The communications manager will
* Setup a not time limited Zoom war room and provide it to the point person to move all the production engineers there.
* Setup Youtube Live Streaming int the war room following [this Zoom guide](https://support.zoom.us/hc/en-us/articles/115000350446-Streaming-a-Webinar-on-YouTube-Live) (for this you will need to have access to the GitLab Youtube account, ask someone from People Ops to grant you so)
* The Marketing representative will
* Review the Google Doc to provide proper context when needed.
* Include a note about how is this outage impacting customers in the document.
* Decide how to handle further communications when the outage is already handled.
# So you got yourself on call
To start with the right foot let's define a set of tasks that are nice things to do before you go
any further in your week
By performing these tasks we will keep the [broken window
effect](https://en.wikipedia.org/wiki/Broken_windows_theory) under control, preventing future pain
and mess.
## Going on call
Here is a suggested checklist of things to do at the start of an on-call shift:
- *Change Slack Icon*: Click name. Click `Set status`. Click grey smile face. Type `:pagerduty:`. Set `Clear after` to end of on-call shift. Click `Save`
- *Add On-Call Feed*: PM yourself in slack `/feed add https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues.atom?feed_token=(TOKEN)&label_name%5B%5D=oncall&scope=all&state=opened&utf8=%E2%9C%93`
- *Add Production Feed*: PM yourself in slack `/feed add https://gitlab.com/gitlab-com/gl-infra/production/issues.atom?feed_token=(TOKEN)&label_name%5B%5D=incident&state=opened`
- *Join alert channels*: If not already a member, `/join` `#alerts`, `#alerts-general`, `#alerts-prod-abuse`, `#alerts-ops`
- *Turn on slack channel notifications*: Open `#production` Notification Preferences (and optionally #infra-lounge). Set Desktop and Mobile to `All new messages`
- *Turn on slack alert notifications*: Open `#alerts` and `#alerts-general` Notification Preferences. Set Desktop only to `All new messages`
- At the start of each on-call day, read all S1 incidents at: https://gitlab.com/gitlab-com/gl-infra/production/issues?scope=all&utf8=✓&state=opened&label_name%5B%5D=incident&label_name%5B%5D=S1
At the end of a shift:
- *Remove feeds*: PM yourself in slack `/feed list`, then `/feed remove (number)` for the production and on-call feeds
- *Turn off slack channel notifications*: Open `#production`, `#alerts`, `#alerts-general` Notification Preferences and return alerts to the desired values.
- *Leave noisy alert channels*: `/leave` alert channels (It's good to stay in `#alerts` and `#alerts-general`)
- Comment on any open S1 incidents at: https://gitlab.com/gitlab-com/gl-infra/production/issues?scope=all&utf8=✓&state=opened&label_name%5B%5D=incident&label_name%5B%5D=S1
- At the end of each on-call day, post a quick update in slack so the next person is aware of anything ongoing, any false alerts, or anything that needs to be handed over.
## Things to keep an eye on
### On-call issues
First check [the on-call issues][on-call-issues] to familiarize yourself with what has been
happening lately. Also, keep an eye on the [#production][slack-production] and
[#incident-management][slack-incident-management] channels for discussion around any on-going
issues.
### Useful Dashboard to keep open
- [GitLab Triage](https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s)
### Alerts
Start by checking how many alerts are in flight right now
- go to the [fleet overview dashboard](https://dashboards.gitlab.net/dashboard/db/fleet-overview) and check the number of Active Alerts, it should be 0. If it is not 0
- go to the alerts dashboard and check what is being triggered
- [azure][prometheus-azure]
- [gprd prometheus][prometheus-gprd]
- [gprd prometheus-app][prometheus-app-gprd]
- watch the [#alerts][slack-alerts], [#alerts-general][slack-alerts-general], and [#alerts-gstg][slack-alerts-gstg] channels for alert notifications; each alert here should point you to the right [runbook][runbook-repo] to fix it.
- if they don't, you have more work to do.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
### Prometheus targets down
Check how many targets are not scraped at the moment. alerts are in flight right now, to do this:
- go to the [fleet overview dashboard](https://dashboards.gitlab.net/dashboard/db/fleet-overview) and check the number of Targets down. It should be 0. If it is not 0
- go to the [targets down list] and check what is.
- [azure][prometheus-azure-targets-down]
- [gprd prometheus][prometheus-gprd-targets-down]
- [gprd prometheus-app][prometheus-app-gprd-targets-down]
- try to figure out why there is scraping problems and try to fix it. Note that sometimes there can be temporary scraping problems because of exporter errors.
- be sure to create an issue, particularly to declare toil so we can work on it and suppress it.
## Rotation Schedule
We use [PagerDuty](https://gitlab.pagerduty.com) to manage our on-call rotation schedule and
alerting for emergency issues. We currently have a split schedule between EMEA and AMER for on-call
rotations in each geographical region; we will also incorporate a rotation for team members in the
APAC region as we continue to grow over time.
The [EMEA][pagerduty-emea] and [AMER][pagerduty-amer] schedule [each have][pagerduty-emea-shadow] a
[shadow schedule][pagerduty-amer-shadow] which we use for on-boarding new engineers to the on-call
rotations.
When a new engineer joins the team and is ready to start shadowing for an on-call rotation,
[overrides][pagerduty-overrides] should be enabled for the relevant on-call hours during that
rotation. Once they have completed shadowing and are comfortable/ready to be inserted into the
primary rotations, update the membership list for the appropriate schedule to [add the new team
member][pagerduty-add-user].
This [pagerduty forum post][pagerduty-shadow-schedule] was referenced when setting up the [blank
shadow schedule][pagerduty-blank-schedule] and initial [overrides][pagerduty-overrides] for
on-boarding new team members.
[on-call-issues]: https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=all&label_name[]=oncall
[pagerduty-add-user]: https://support.pagerduty.com/docs/editing-schedules#section-adding-users
[pagerduty-amer]: https://gitlab.pagerduty.com/schedules#PKN8L5Q
[pagerduty-amer-shadow]: https://gitlab.pagerduty.com/schedules#P0HRY7O
[pagerduty-blank-schedule]: https://community.pagerduty.com/t/creating-a-blank-schedule/212
[pagerduty-emea]: https://gitlab.pagerduty.com/schedules#PWDTHYI
[pagerduty-emea-shadow]: https://gitlab.pagerduty.com/schedules#PSWRHSH
[pagerduty-overrides]: https://support.pagerduty.com/docs/editing-schedules#section-create-and-delete-overrides
[pagerduty-shadow-schedule]: https://community.pagerduty.com/t/creating-a-shadow-schedule-to-onboard-new-employees/214
[prometheus-azure]: https://prometheus.gitlab.com/alerts
[prometheus-azure-targets-down]: https://prometheus.gitlab.com/consoles/up.html
[prometheus-gprd]: https://prometheus.gprd.gitlab.net/alerts
[prometheus-gprd-targets-down]: https://prometheus.gprd.gitlab.net/consoles/up.html
[prometheus-app-gprd]: https://prometheus-app.gprdgitlab.net/alerts
[prometheus-app-gprd-targets-down]: https://prometheus-app.gprd.gitlab.net/consoles/up.html
[runbook-repo]: https://gitlab.com/gitlab-com/runbooks
[slack-alerts]: https://gitlab.slack.com/channels/alerts
[slack-alerts-general]: https://gitlab.slack.com/channels/alerts-general
[slack-alerts-gstg]: https://gitlab.slack.com/channels/alerts-gstg
[slack-incident-management]: https://gitlab.slack.com/channels/incident-management
[slack-production]: https://gitlab.slack.com/channels/production
# CMOC Checklist
## Shift Start
## Declaring an Incident
checklist for starting and incident:
- [ ] For the CMOC - post in #incident-management that you are the CMOC - cross post to #support_gitlab-com if needed
- [ ] Create the production issue if possible. In slack: `/start-incident` or if you have an alert in alerts-general - click the Open Issue button in the thread.
- [ ] Create an incident in http://status.io - make sure you check the options to broadcast to slack, twitter, etc
* If you don't have full specifics, get the incident created in status.io and first tweet out with a more generic "We are seeing elevated error rates on GitLab.com". It is better to have a post sooner with investigating than waiting 5 minutes to know more.
- [ ] Create a google doc from the [shared template](https://docs.google.com/document/d/1NMZllwnK70-WLUn_9IiiyMWeXs-JKPEiq-lordxJAig/edit#)
- [ ] Update Slack and the status.io incident with links to the issue number and google docs.
- [ ] Check with incident team. Are they all in the same channel, gdoc, on the zoom as needed. Coordinate and consolidate communication
- [ ] Set a timer for 15 minutes to remind yourself to update status.io and tweet
- [ ] Start to gather overall summary and look to write up an executive summary in the production issue or gdoc for others in the company
- [ ] Check in with incident team:
* Do they need more people or expertise? Broadcast and ask for help as soon as you know it is needed.
* Clear the deck - make sure other changes / teams know an incident is going on
* Clear the deck - cancel other meetings as needed.
## Taking Ownership of an Incident
The CMOC
# IMOC
#### Critical Dashboards
1. What alerts are going off? [Prometheus gprd](https://prometheus.gprd.gitlab.net/alerts#)
1. How does do these dashboards look?
- [Triage dashboard](https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s)
- [General Triage dashboard](https://dashboards.gitlab.net/d/1UpWp7dik/general-triage?orgId=1)
1. What services are showing availability issues?
1. What components are outside of normal operations?
* [Triage-components](https://dashboards.gitlab.net/d/VE4pXc1iz/general-triage-components?orgId=1)
* [Triage-services](https://dashboards.gitlab.net/d/WOtyonOiz/general-triage-service?orgId=1)
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment