Adds an ops fallback for incident management
For production-engineering#25466 (closed)
This creates a new option for Incident and Change issues to fallback to ops.gitlab.net. It implements the following:
- For both
/change declare
and/incident declare
there is a new checkboxUse ops.gitlab.net instead of gitlab.com
- When declaring an incident or a change issue, a quick api sanity test is done on .com. If that fails, a notice will be sent to slack and the
Use ops.gitlab.net instead of gitlab.com
option will be checked. - This refactors the checkboxes a bit to put them under "{Incident,Change} Options", which saves on vertical space in the modal dialog
Testing
We don't have much (really, any) unit test coverage for Woodhouse, which is something that I would like to do at some point but decided not to take on that large refactor here. Instead, this was validated locally, and deployed to #woodhouse-staging
.
/woodhouse-staging change declare
/woodhouse-staging incident declare
For validating feature #2 above, I validated locally by using a bogus API token for .com, it looks something like this:
Why
This was done for DR, so that in the case of a regional outage we can still declare change issues without depending on us-east1
(note that we still need to move woodhouse to the us-central1 cluster, which will happen soon).
In general, this will be useful for other scenarios where we want to use an incident issue when gitlab.com is completely down, or a change issue that results in downtime on .com (like db work).
New configuration
There is no new configuration for this change other than a new token for OPS set as an env variable GITLAB_OPS_API_TOKEN
.
This is configured now through Terraform in infra-mgmt
in Vault, and will be rotated and automatically updated.
I have confirmed that this env variable is present in both the ops and staging deployments already.