Skip to content

Adds an ops fallback for incident management

John Jarvis requested to merge jarv/ops-option into master

For production-engineering#25466 (closed)

This creates a new option for Incident and Change issues to fallback to ops.gitlab.net. It implements the following:

  1. For both /change declare and /incident declare there is a new checkbox Use ops.gitlab.net instead of gitlab.com
  2. When declaring an incident or a change issue, a quick api sanity test is done on .com. If that fails, a notice will be sent to slack and the Use ops.gitlab.net instead of gitlab.com option will be checked.
  3. This refactors the checkboxes a bit to put them under "{Incident,Change} Options", which saves on vertical space in the modal dialog

Testing

We don't have much (really, any) unit test coverage for Woodhouse, which is something that I would like to do at some point but decided not to take on that large refactor here. Instead, this was validated locally, and deployed to #woodhouse-staging.

  • /woodhouse-staging change declare
  • /woodhouse-staging incident declare

For validating feature #2 above, I validated locally by using a bogus API token for .com, it looks something like this:

image

Why

This was done for DR, so that in the case of a regional outage we can still declare change issues without depending on us-east1 (note that we still need to move woodhouse to the us-central1 cluster, which will happen soon). In general, this will be useful for other scenarios where we want to use an incident issue when gitlab.com is completely down, or a change issue that results in downtime on .com (like db work).

New configuration

There is no new configuration for this change other than a new token for OPS set as an env variable GITLAB_OPS_API_TOKEN. This is configured now through Terraform in infra-mgmt in Vault, and will be rotated and automatically updated. I have confirmed that this env variable is present in both the ops and staging deployments already.

Edited by John Jarvis

Merge request reports