Start using Woodhouse for incident declaration (#2900) · Issues · GitLab.com / GitLab Infrastructure Team / Production

Start using Woodhouse for incident declaration

# Production Change ### Change Summary We've developed a replacement codebase for `/incident` command ### Change Details 1. **Services Impacted** - https://ops.gitlab.net/gitlab-com/gl-infra/incident-management 1. **Change Technician** - @craigf 1. **Change Criticality** - ~C4 1. **Change Type** - ~"change::scheduled" 1. **Change Reviewer** - @AnthonySandoval 1. **Due Date** - 27 October 2020 14:30 UTC 1. **Time tracking** - 1 min 1. **Downtime Component** - n/a ## Detailed steps for the change ### Pre-Change Steps - steps to be completed before execution of the change *Estimated Time to Complete (mins)* - 5 min - [x] Merge remaining Woodhouse MRs - [x] https://gitlab.com/gitlab-com/gl-infra/woodhouse/-/merge_requests/20 - [x] https://gitlab.com/gitlab-com/gl-infra/woodhouse/-/merge_requests/24 - [x] Compare IMA issue template to Woodhouse's, make sure they match. - [x] Documentation cutover MRs are approved - [x] https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_requests/65022 - [x] https://ops.gitlab.net/gitlab-com/gl-infra/incident-management/-/merge_requests/19 - [x] Configure the already-deployed Woodhouse with real integrations for GitLab and Pagerduty - `/woodhouse incident declare` can now be used, and `/incident declare` will keep working. - [x] Raise a test incident with `/woodhouse incident declare`, to get confidence in woodhouse before shadowing the `/incident` slash command. ### Change Steps - steps to take to execute the change *Estimated Time to Complete (mins)* - 1 min - [x] Configure the `/incident` slash command in Woodhouse as per https://gitlab.com/gitlab-com/gl-infra/woodhouse#installing-slack-app - Invocations of this will now be sent to Woodhouse, not IMA - [x] Configure a production project issue webhook as documented: https://gitlab.com/gitlab-com/gl-infra/woodhouse#gitlab-webhook-integration - IMA and Woodhouse will now each report incident issue events, which is nois ### Post-Change Steps - steps to take to verify the change *Estimated Time to Complete (mins)* - 15 min - [x] Disable the IMA's production GitLab webhook by appending "-DELETEME-TO-ENABLE" to the secret token. - Now only woodhouse handles gitlab webhooks - [x] Test woodhouse's real integrations - [x] Slack the EOC, IMOC, and CMOC, checking if this is a good time for them to get paged. - [x] In `#production`: `/woodhouse incident declare` - [x] In the modal, tick all pager boxes - [x] We should see an incident issue, slack channel, and all on-calls should be paged. - [x] Close, the reopen the incident issue. Woodhouse should post in slack about the reopen. - [x] Merge documentation cutover MRs - [x] Configure periodic archival of old incident slack channels: https://gitlab.com/gitlab-com/gl-infra/woodhouse#slack-archive-incident-channels-subcommand - [x] Write up deprecation schedule for classic IMA (in another issue, link here) - Remove now-unused Pagerduty webhooks - Remove GitLab webhooks - Turn down the IMA application - Write issues to replace remaining IMA functionality - like the `@sre-oncall` schedule populator cronjob. ## Rollback ### Rollback steps - steps to be taken in the event of a need to rollback this change *Estimated Time to Complete (mins)* - 15s - [ ] Navigate to Woodhouse's app page: https://api.slack.com/apps/A01CRM3E0PJ/slash-commands? - [ ] Delete the `/incident` slash command from the list (Optional) Break Woodhouse's incident issue webhook, restore IMA's: - [ ] Navigate to https://gitlab.com/gitlab-com/gl-infra/production/hooks - [ ] Edit the classic IMA's webhook, the one that goes to https://incident-management-dot-gitlab-infra-automation.ue.r.appspot.com/handleGitLabIncidentIssue - [ ] Remove "-DELETEME-TO-ENABLE" from the webhook token. - [ ] Edit Woodhouse's webhook, the one that goes to https://woodhouse.ops.gitlab.net/gitlab/incident-issue - [ ] Append "-DELETEME-TO-ENABLE" to the webhook token. ## Monitoring ### Key metrics to observe - Metric: n/a - Location: n/a - What changes to this metric should prompt a rollback: User reported error with Slack `/incident` command usage. ## Summary of infrastructure changes - [/] Does this change introduce new compute instances? **No** - [/] Does this change re-size any existing compute instances? **No** - [/] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? **No** ## Changes checklist - [x] This issue has a criticality label (e.g. ~C1, ~C2, ~C3, ~C4) and a change-type label (e.g. ~"change::unscheduled", ~"change::scheduled"). - [x] This issue has the change technician as the assignee. - [x] Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. - [x] Necessary approvals have been completed based on the [Change Management Workflow](https://about.gitlab.com/handbook/engineering/infrastructure/change-management/#change-request-workflows). - [x] Change has been tested in staging and results noted in a comment on this issue. - [x] A dry-run has been conducted and results noted in a comment on this issue. - [x] SRE on-call has been informed prior to change being rolled out. (In #production channel, mention `@sre-oncall` and this issue.) - [x] There are currently no [active incidents](https://gitlab.com/gitlab-com/gl-infra/production/-/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Incident%3A%3AActive).

issue