Sign in or sign up before continuing. Don't have an account yet? Register now to get started.
Register now

Group::Utilization Error Budget Investigation

Summary

https://dashboards.gitlab.net/d/stage-groups-utilization/stage-groups-utilization-group-dashboard?orgId=1

The grouputilization error budget has recently taken a turn for the worse, going from green to red:

As of 2023-10-17:

Screenshot_2023-10-17_at_14.10.52

As of 2023-10-24:

Screenshot_2023-10-24_at_08.39.11

We need to investigate what is causing this problem to get us back to 🍏

Findings

  • As can be seen in the screenshots above, the metrics have remained the same, but the number of failures has dropped
  • We had some very slow background migration jobs (screenshot below) which have completed as of 2023-10-23, so if they were contributing to the apdex, we should see an improvement over the coming days. However, the dashboards suggest we only have 1 sidekiq apdex failure, so this should be unlikely to be the cause Screenshot_2023-10-24_at_08.42.18
  • Related logs suggested we were only 13 seconds over our apdex target in the last 7 days, which should not have resulted in the metrics we're seeing. edit: this dashboard might actually be indicating 13 matching documents rather than 13 seconds, see #429205 (comment 1619766526) Screenshot_2023-10-24_at_08.46.52
  • The latest error report issue suggests there were accuracy issues - perhaps ours is still bugged?
Edited Oct 26, 2023 by Vijay Hawoldar
Assignee Loading
Time tracking Loading