Remove handled errors from Sentry to reduce noise
Problem
We have a bunch of handled errors that show up in Sentry causing noise.
Example -
- Gitlab API error exception - Sentry link
- Anthropic api error - Sentry link
Desired Outcome
Lesser noise in sentry alerts channel
Move to using Prometheus alerts over Sentry events for specific errors.
These should include:
-
500errors from both Gitlab/checkpointsAPI and Anthropic API AsyncRunTimeErrorswith message "Task was destroyed but it is pending!"Async CancelledError with message "async generator raised StopAsyncIteration"
Implementation Plan (Proposed)
Implement the following approach to reduce noise on the above errors:
- Create prometheus counter for error
- Create alert in Grafana
- Filter alert from Sentry when alert is in place
Current progress
-
Create an alert from grafana dashboard in the same #g_duo_workflow_alerts to understand if they cross a certain threshold as these errors hamper the workflow execution. -
Create an alert threshold for 500/checkpointingerrors observed from GitLab API:-
Prometheus counter implemented: MR !2953 (merged) -
Grafana alert created: MR gitlab-com/runbooks!9185 (merged) -
Sentry alerts filtered
-
-
Create an alert threshold for APIStatusErrors returned during model completions:-
Prometheus counter implemented: MR !2956 (merged) -
Grafana alert created: MR gitlab-com/runbooks!9243 (merged) -
Sentry alerts filtered
-
-
-
Document in troubleshooting docs.
ON HOLD / DROPPED
- ON HOLD: currently removing 500
/checkpointsstatus errors from Sentry is currently on-hold until we understand what is causing the500status errors./checkpointAPI endpoint for gitlab.com sometimes returns500error which causes a JSON Decode error. Addressed in !2868 (closed) -
500status error from Anthropic. Currently I can only see one instance of this in the sentry issue. I propose we create an alert to deal with this instead. - ON HOLD: As described in issue #1314 on hold until we decide if it's something we want to filter or fix.
Create an alert threshold forRunTimeErrorwith message "Task was destroyed but it is pending!" returned during model completions: -
Create an alert threshold for AsyncIssue will be fixed rather than filtered / alerted #961 (closed)CancelledErrorwith message "async generator raised StopAsyncIteration"
Edited by Tim Morriss