Exclude maintenance windows from SLA calculation
In the GitLab.com's legal SLA definition, maintenance periods are excluded from the SLA.
However, our current SLA calculation does not actually do this, but the GitLab application now includes a gauge prometheus metric to signal when the application has been placed in maintenance mode (for example, during a major PostgreSQL upgrade).
This change was added in gitlab-org/gitlab#387627 (closed).
Proposal
We should incorporate the maintenance_mode metric into the SLA calculation, possibly also error budget calculations and even SLO alerts (using the maintenance_mode as a suppression on certain SLIs).
This should be done by excluding all activity during the maintenance mode.
Rather than treating maintenance time as 100% available, maintenance time should be total excluded. For example, if there is 10 minutes of maintenance in a 28 day period, this time should be excluded from the monthly calculation, and the SLA should be calculated with a total time of (28 days - 10 minutes), with values during that period not being included.
I think that this should be easy to implement by making these values absent in Prometheus, but haven't experimented with it yet.