Start using Woodhouse for incident declaration
# Production Change
### Change Summary
We've developed a replacement codebase for `/incident` command
### Change Details
1. **Services Impacted** - https://ops.gitlab.net/gitlab-com/gl-infra/incident-management
1. **Change Technician** - @craigf
1. **Change Criticality** - ~C4
1. **Change Type** - ~"change::scheduled"
1. **Change Reviewer** - @AnthonySandoval
1. **Due Date** - 27 October 2020 14:30 UTC
1. **Time tracking** - 1 min
1. **Downtime Component** - n/a
## Detailed steps for the change
### Pre-Change Steps - steps to be completed before execution of the change
*Estimated Time to Complete (mins)* - 5 min
- [x] Merge remaining Woodhouse MRs
- [x] https://gitlab.com/gitlab-com/gl-infra/woodhouse/-/merge_requests/20
- [x] https://gitlab.com/gitlab-com/gl-infra/woodhouse/-/merge_requests/24
- [x] Compare IMA issue template to Woodhouse's, make sure they match.
- [x] Documentation cutover MRs are approved
- [x] https://gitlab.com/gitlab-com/www-gitlab-com/-/merge_requests/65022
- [x] https://ops.gitlab.net/gitlab-com/gl-infra/incident-management/-/merge_requests/19
- [x] Configure the already-deployed Woodhouse with real integrations for GitLab and Pagerduty
- `/woodhouse incident declare` can now be used, and `/incident declare` will keep working.
- [x] Raise a test incident with `/woodhouse incident declare`, to get confidence in woodhouse before shadowing the `/incident` slash command.
### Change Steps - steps to take to execute the change
*Estimated Time to Complete (mins)* - 1 min
- [x] Configure the `/incident` slash command in Woodhouse as per https://gitlab.com/gitlab-com/gl-infra/woodhouse#installing-slack-app
- Invocations of this will now be sent to Woodhouse, not IMA
- [x] Configure a production project issue webhook as documented: https://gitlab.com/gitlab-com/gl-infra/woodhouse#gitlab-webhook-integration
- IMA and Woodhouse will now each report incident issue events, which is nois
### Post-Change Steps - steps to take to verify the change
*Estimated Time to Complete (mins)* - 15 min
- [x] Disable the IMA's production GitLab webhook by appending "-DELETEME-TO-ENABLE" to the secret token.
- Now only woodhouse handles gitlab webhooks
- [x] Test woodhouse's real integrations
- [x] Slack the EOC, IMOC, and CMOC, checking if this is a good time for them to get paged.
- [x] In `#production`: `/woodhouse incident declare`
- [x] In the modal, tick all pager boxes
- [x] We should see an incident issue, slack channel, and all on-calls should be paged.
- [x] Close, the reopen the incident issue. Woodhouse should post in slack about the reopen.
- [x] Merge documentation cutover MRs
- [x] Configure periodic archival of old incident slack channels: https://gitlab.com/gitlab-com/gl-infra/woodhouse#slack-archive-incident-channels-subcommand
- [x] Write up deprecation schedule for classic IMA (in another issue, link here)
- Remove now-unused Pagerduty webhooks
- Remove GitLab webhooks
- Turn down the IMA application
- Write issues to replace remaining IMA functionality - like the `@sre-oncall` schedule populator cronjob.
## Rollback
### Rollback steps - steps to be taken in the event of a need to rollback this change
*Estimated Time to Complete (mins)* - 15s
- [ ] Navigate to Woodhouse's app page: https://api.slack.com/apps/A01CRM3E0PJ/slash-commands?
- [ ] Delete the `/incident` slash command from the list
(Optional) Break Woodhouse's incident issue webhook, restore IMA's:
- [ ] Navigate to https://gitlab.com/gitlab-com/gl-infra/production/hooks
- [ ] Edit the classic IMA's webhook, the one that goes to https://incident-management-dot-gitlab-infra-automation.ue.r.appspot.com/handleGitLabIncidentIssue
- [ ] Remove "-DELETEME-TO-ENABLE" from the webhook token.
- [ ] Edit Woodhouse's webhook, the one that goes to https://woodhouse.ops.gitlab.net/gitlab/incident-issue
- [ ] Append "-DELETEME-TO-ENABLE" to the webhook token.
## Monitoring
### Key metrics to observe
- Metric: n/a
- Location: n/a
- What changes to this metric should prompt a rollback: User reported error with Slack `/incident` command usage.
## Summary of infrastructure changes
- [/] Does this change introduce new compute instances? **No**
- [/] Does this change re-size any existing compute instances? **No**
- [/] Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? **No**
## Changes checklist
- [x] This issue has a criticality label (e.g. ~C1, ~C2, ~C3, ~C4) and a change-type label (e.g. ~"change::unscheduled", ~"change::scheduled").
- [x] This issue has the change technician as the assignee.
- [x] Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
- [x] Necessary approvals have been completed based on the [Change Management Workflow](https://about.gitlab.com/handbook/engineering/infrastructure/change-management/#change-request-workflows).
- [x] Change has been tested in staging and results noted in a comment on this issue.
- [x] A dry-run has been conducted and results noted in a comment on this issue.
- [x] SRE on-call has been informed prior to change being rolled out. (In #production channel, mention `@sre-oncall` and this issue.)
- [x] There are currently no [active incidents](https://gitlab.com/gitlab-com/gl-infra/production/-/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Incident%3A%3AActive).
issue