refactor(alerts): vault audit log failures
What
- Delete the alert
VaultAuditLogRequestFailure
- Create a new SLI for
vault_audit_log_request
with an error ratio. - Create a new SLI for
vault_audit_log_response
with an error ratio.
Why
The VaultAuditLogRequestFailure
was too sensitive because it pages the on-call for a small blip of requests:
- gitlab-com/gl-infra/production#17172 (closed)
- gitlab-com/gl-infra/production#17106 (closed)
- gitlab-com/gl-infra/production#17056 (closed)
All of these pages resulted in non-actionable alerts.
We could tune the existing alert to make it less sensitive, but we already have a pattern established on how to alert on error spikes.
An argument could be made that we shouldn't look at the ratio however, I think in this case it's fine since it's a low request service.