Skip to content

Draft: chore: Add alert for Gitaly and create production incident for 3-day Apdex violation

What

Add a new alert for Gitaly single node 3 day burn rate violations, which creates incident issue instead of paging EOC.

Why

We saw a lot of SingleNode gitaly events, which were inactionable and short-lived. Based on that, we lowered Apdex Score for component_node and at the same time started alerting on 3 day burn rate but rather than paging EOC for that, an incident issue will get created, since slow burn is not threating immediately but certainly does require someone to look at.

Relevant dicussion in this thread: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23576#note_1379500501

Merge request reports