Review Slack incident functionality of Woodhouse

Background

Woodhouse is the new monolithic codebase for SRE and incident tooling. It's planned to eventually replace most of the reliability teams' internal tools, consolidating and improving them in one place.

This issue concerns the proposed rollout of Woodhouse as a replacement of our /incident Slack command, currently backed by https://ops.gitlab.net/gitlab-com/gl-infra/incident-management. I want to use it as a go / no-go to:

Please request changes in the comments of this issue, and/or spin out new issues.

Workflow

Declaring an incident

I ran through Woodhouse's incident functionality in #woodhouse-staging. You can do the same if you like.

Step 1: Run /woodhouse incident declare from a channel that woodhouse-staging has been added to, by mentioning the bot in that channel. For production, we'll use #production. For staging, we can use this #woodhouse-staging channel.

Screenshot_2020-10-12_at_09.07.23

The bot somewhat redundantly announced that it will declare an incident in this channel. In the long term, we plan to make #incident-management a feed populated only by bots, without messages from humans (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11271#note_411409045). Incidents would be triggered from #production, so it makes sense to leave a message redirecting humans to #incident-management. In staging, Woodhouse is configured to use the same channel we trigger incidents from, for simplicity.

This incident feed channel is configurable in Woodhouse.

Step 2: Fill in the incident modal

Screenshot_2020-10-12_at_09.08.00

The user who triggers /woodhouse incident declare sees this modal. They must enter a title and a severity. Optionally, the user can page EOC, IMOC, or CMOC. Pagerduty schedules for these are configurable in Woodhouse. Currently, no schedules are configured for staging, so we'll want to page people for real when we make the first production deploy, to ensure it works. I've tested this functionality in a test workspace.

Step 3: Woodhouse creates incident artifacts

A spinner appears (hopefully briefly):

Screenshot_2020-10-12_at_09.36.10

The bot message is updated to indicate success or failure, with a link to the incident issue. The issue it opened: woodhouse-integration-test#1 (closed)

Screenshot_2020-10-12_at_09.08.33

A Slack channel is created for this incident:

Screenshot_2020-10-12_at_09.08.46

The issue is updated with a note linking to the Slack channel:

Screenshot_2020-10-12_at_09.08.56

Updating the on-call with actions performed outside of Slack

Woodhouse can receive GitLab issue event webhooks and send message to the main incident channel to update the on-call. Its functionality should be the same as our current incident issue webhook receiver, with one notable exception: Woodhouse does not update the channel when incident titles / descriptions are changed. In practice this was very noisy, but if people want it we can add it.

When an active incident is reopened:

Screenshot_2020-10-12_at_09.10.08

When an issue with label "incident" but without labels indicating that Woodhouse/IMA opened it is opened:

Screenshot_2020-10-12_at_09.12.33

Developing and testing changes

Woodhouse is likely to remain an integration-heavy codebase, with many dependencies on services like gitlab.com, Pagerduty, and Slack, without much logic of its own. Most changes are really only testable when hooked up to real dependencies. We have a staging deployment of Woodhouse that I've been using to capture these screenshots, hooked up to real-but-nonprod integrations.

Step 1: Create your branch and merge request. For example: woodhouse!17 (closed). This won't do anything except run unit tests.

Step 2: Announce that you'd like to borrow woodhouse-staging, in the woodhouse-staging Slack channel. If the last person to do this didn't announce they were done, ask them if they're done!

Screenshot_2020-10-12_at_09.48.13

Step 3: Trigger a non-master build pipeline.

  1. Navigate to https://gitlab.com/gitlab-com/gl-infra/woodhouse/-/pipelines/new
  2. Configure a variable: WOODHOUSE_BUILD_BRANCH = 1
  3. Run the pipeline

Screenshot_2020-10-12_at_09.17.11

Screenshot_2020-10-12_at_09.17.26

This will cause a Docker image to be built and pushed, tagged with your branch head's short SHA. Example: https://ops.gitlab.net/gitlab-com/gl-infra/woodhouse/-/pipelines/298486

Step 4: Trigger a non-master deploy pipeline.

Unfortunately, we're currently constrained to run deployment pipelines on private repositories (on the ops server) which are mirrors of our work on gitlab.com, and so must perform this extra step. This is a general clunkiness issue with our workflows that is larger than the scope of Woodhouse or this issue, but perhaps we can revisit it in the future and improve it.

  1. Ensure your change has been mirrored to ops. Often, this will be instantaneous as we use push mirroring, but this has a maximum frequency of once every 5 minutes, so you might have to wait up to 5 minutes. Or, push to the ops mirror yourself.
  2. Navigate to https://ops.gitlab.net/gitlab-com/gl-infra/woodhouse/-/pipelines/new
  3. Configure a variable: WOODHOUSE_BUILD_BRANCH = 1
  4. Run the pipeline

Screenshot_2020-10-12_at_09.18.08

Note the extra step, deploy-staging.

This also causes a Docker image to be built and pushed to ops, but this image is not used.

Deploy-staging triggers a multi-project pipeline, with the downstream pipeline being in our tanka monorepo. It's configured with variables such that it only deploys woodhouse (and doesn't run diffs for anything else), without manual gating.

Screenshot_2020-10-12_at_09.21.16

Screenshot_2020-10-12_at_09.21.43

Example: https://ops.gitlab.net/gitlab-com/gl-infra/woodhouse/-/pipelines/298486

Step 5: Try out your new feature

My example MR causes Woodhouse to tip his hat when replying to /woodhouse echo:

Screenshot_2020-10-12_at_09.22.18

Step 6: Announce that you're done with woodhouse-staging, in the Slack channel.

Step 7: If you're happy with your feature, merge your MR. This will cause an automatic deployment to production.

Disparity with current incident-management automation

The current IMA app receives webhooks from our production pagerduty service, and under certain conditions creates an incident issue: https://ops.gitlab.net/gitlab-com/gl-infra/incident-management/blob/master/pagerduty/pagerduty-webhook-handler.js. As far as I can tell, this is never used, and I've never observed it actually creating an incident issue. Woodhouse doesn't handle Pagerduty webhooks, and I don't propose to add this feature. Not-on-call people who want to raise an incident and page the EOC would do so via /woodhouse incident declare, rather than creating a pagerduty incident. Most of the company don't have pagerduty accounts anyway.

In the future we may well rearrange our pager-woodhouse-slack-gitlab workflow (&303), but in the interests of declaring parity with our current workflow so that we can start iterating on Woodhouse, I propose we do nothing related to PD webhooks right now - if you disagree, please comment!


Pinging everyone in reliability, since this affects everyone on call: @gitlab-com/gl-infra/sre-observability @gitlab-com/gl-infra/sre-datastores @gitlab-com/gl-infra/sre-coreinfra @brentnewton