Skip to content

Make the pager stop melting if the world is on fire.

Right now, if there's a major outage, the EOC gets paged... a lot.

This is not helpful to debugging issues and it's very stressful.

This was mentioned in gitlab-com/gl-infra/production#19996 (comment 2561813561)

I see two solutions for this, both of which we probably should have done a while ago.

  1. Actually set up dependencies correctly so that if say, patroni is down, we don't page on anything else.
  2. Create a potential breakglass that says the world is on fire and we know it, shut up paging for say an hour.

I'm curious on the thoughts from @gitlab-org/production-engineering/observability on the best ways to do this.