Create alerts for webhook notification error tracking
Context
In #828 (closed) we have exposed the webhook notification metrics in Grafana to start having a better understanding of how reliable they are.
Problem
In gitlab#386389 (closed), it was described that Rails started failing to process registry notifications, with a 400 Bad Request
status. Looking at the metrics on the registry side I do not see evidence of this.
Note: the original issue only happens for Geo installations, that is why the metrics were not altered for gitlab.com.
Solution
-
Test a scenario where Rails consistently returns
400 Bad Request
when receiving notifications and ensure that metrics signal that behavior. Additionally, once the metrics are fixed (if needed), create an alert so that we get notified when persistent failures occur for more than 30 minutes. -
While at it, do the same for other possible errors returned by Rails.
400 Bad Request
is just an example. -
Create alerts for the following metrics:
-
registry_notifications_events_total{type="Errors"} XX
-> blocked by chore(notifications): enhance metrics with acti... (#851 - closed) -
registry_notifications_pending_total YY
to identify when there are too many events pending to be sent
Optionally, an alert on registry_notifications_status_total
when {code!= "200 OK"}
, which might be tricky to implement (see sample below)
# TYPE registry_notifications_status_total counter
registry_notifications_status_total{code="200 OK"} 4
registry_notifications_status_total{code="202 Accepted"} 6
registry_notifications_status_total{code="400 Bad Request"} 3
registry_notifications_status_total{code="401 Unauthorized"} 10