diff --git a/README.md b/README.md index 68696af231ba72995225a5bd3fe158bb7b0f1dd4..c4f5de59bd9df532a1ab85681bc7196a19121e94 100644 --- a/README.md +++ b/README.md @@ -136,21 +136,22 @@ The aim of this project is to have a quick guide of what to do when an emergency * Confirm that it is actually an emergency, challenge this: are we losing data? Is GitLab.com not working? * [Tweet](howto/tweeting-guidelines.md) in a reassuring but informative way to let the people know what's going on -* Join the `#alerts` channel -* Organize - * Establish who is taking point on the emergency issue in the `#alerts` channel: "I'm taking point" and pin the message for the duration of the emergency. - * open a hangout if it will save time: https://plus.google.com/hangouts/_/gitlab.com?authuser=1 - * share the link in the alerts channel -* If the point person needs someone to do something, give a direct command: _@someone: please run `this` command_ +* Join the `#infrastructure` channel +* Define a _point person_ or _incident owner_, this is the person that will gather all the data and coordinate the efforts. +* Organize: + * Establish who is the point person on the incident in the `#infrastructure` channel: "@here I'm taking point" and pin the message for the duration of the emergency. + * Start a war room using zoom if it will save time + * Share the link in the #infrastructure channel + * If the _point person_ needs someone to do something, give a direct command: _@someone: please run `this` command_ * Be sure to be in sync - if you are going to reboot a service, say so: _I'm bouncing server X_ * If you have conflicting information, **stop and think**, bounce ideas, escalate -* Fix first, ask questions later. -* Gather information when the outage is done - logs, samples of graphs, whatever could help figuring out what happened -* Open an issue and put `monitoring` label on it, even if you close issue immediately. See [handbook](https://about.gitlab.com/handbook/infrastructure/) +* Gather information when the incident is done - logs, samples of graphs, whatever could help figuring out what happened +* If we lack monitoring or alerting Open an issue and label as `monitoring`, even if you close issue immediately. See [handbook](https://about.gitlab.com/handbook/infrastructure/) ## Guidelines * [Tweeting Guidelines](howto/tweeting-guidelines.md) +* [Production Incident Communication Strategy](howto/manage-production-incidents.md) ## Other Servers and Services diff --git a/howto/manage-production-incidents.md b/howto/manage-production-incidents.md new file mode 100644 index 0000000000000000000000000000000000000000..0c3496fa99e50b5381830644db07aebdc0c30074 --- /dev/null +++ b/howto/manage-production-incidents.md @@ -0,0 +1,54 @@ +# Production Incidents + +## Roles + +During an incident there are at least 2 roles, and one more optional + +* Engineer: the person in charge to actually solve the technical problem. +* Point person: the person that is coordinating the resolution of the problem at the technical level. +* Communications manager: the person who manages external communication (setting up the live stream, etc) +* Marketing representative: someone from marketing will need to be involved to review the outage document. + +## Definition of a major outage + +A major outage is any outage that has a ETA of more than 1h and is disruption the service. + +## Minor and major outages management + +During a minor outage all the communications will be handled through twitter using the @gitlabstatus account. + +During a major outage the work will be distributed in the following way: + +* Production engineers will + * Open a war room on Zoom immediately to have high a bandwidth communication channel. + * Create a [Google Doc](https://docs.google.com) to gather the timeline of events. + * Publish this document using the _File_, _Publish to web..._ function. + * Make this document GitLab editable by clicking on the `Share` icon and selecting _Advanced_, _Change_, then _On - GitLab_. + * Tweet `GitLab.com is having a major outage, we're working on resolving it in a Google Doc LINK` with a link to this document to make the community aware. + * Redact the names to remove the blame. Only use team-member-1, -2, -3, etc. + * Document partial findings or guessing as we learn. + * Write a post mortem issue when the incident is solved, and label it with `outage` + +* The point person will + * Handle updating the @gitlabstatus account explaining what is going on in a simple yet reassuring way. + * Synchronize efforts accross the production engineering team + * Pull other people in when consultation is needed. + * Declare a major outage when we are meeting the definition. + * Post `@channel, we have a major outage and need help creating a live streaming war room, refer to [runbooks-production-incident]` into the #general slack channel. + * Post `@channel, we have a major outage and need help reviewing public documents` into the #marketing slack channel. + * Post `@channel, we have a major outage and are working to solve it, you can find the public doc ` into the #devrel slack channel. + * Move the war room to a paid account so the meeting is not time limited. + +* The communications manager will + * Setup a not time limited Zoom war room and provide it to the point person to move all the production engineers there. + * Setup Youtube Live Streaming int the war room following [this Zoom guide](https://support.zoom.us/hc/en-us/articles/115000350446-Streaming-a-Webinar-on-YouTube-Live) (for this you will need to have access to the GitLab Youtube account, ask someone from People Ops to grant you so) + +* The Marketing representative will + * Review the Google Doc to provide proper context when needed. + * Include a note about how is this outage impacting customers in the document. + * Decide how to handle further communications when the outage is already handled. + + +## Blameless Post Mortems + +Refer to the [infrastructure section](https://about.gitlab.com/handbook/infrastructure/) in the handbook for a description on how to write a good post mortem.