Group::Utilization Error Budget Investigation
Summary
The grouputilization error budget has recently taken a turn for the worse, going from green to red:
As of 2023-10-17:
As of 2023-10-24:
We need to investigate what is causing this problem to get us back to
Findings
- As can be seen in the screenshots above, the metrics have remained the same, but the number of failures has dropped
- We had some very slow background migration jobs (screenshot below) which have completed as of 2023-10-23, so if they were contributing to the apdex, we should see an improvement over the coming days. However, the dashboards suggest we only have 1 sidekiq apdex failure, so this should be unlikely to be the cause
- Related logs suggested we were only 13 seconds over our apdex target in the last 7 days, which should not have resulted in the metrics we're seeing. edit: this dashboard might actually be indicating 13 matching documents rather than 13 seconds, see #429205 (comment 1619766526)
- The latest error report issue suggests there were accuracy issues - perhaps ours is still bugged?
Edited by Vijay Hawoldar