Post-launch image scaler errors

Now that image scaling is live, we are observing a number of errors that occur.

In terms of our current SLO, we frequently breach the 0.15% error budget threshold: https://dashboards.gitlab.net/d/web-main/web-overview?viewPanel=12&orgId=1

The current error rate is actually fairly low, but is spiky and hovers somewhere between 0.05 err/s and 0.2 err/s: https://thanos-query.ops.gitlab.net/graph?g0.range_input=2d&g0.max_source_resolution=0s&g0.expr=sum%20by%20(status)%20(rate(gitlab_workhorse_image_resize_requests_total%7Benv%3D%22gprd%22%2C%20stage%3D%22main%22%2C%20status!~%22unknown%7Csuccess%22%7D%5B5m%5D))&g0.tab=0

My understanding is that these thresholds are not configurable on a per-component level, but rather service-wide, i.e. the scaler must meet the same requirements here as the main workhorse request path.

We should:

  1. Investigate what causes these errors and how frequently they occur
  2. Decide if they are worth fixing
  3. If they are fixable, how + follow-up issues
Edited by Matthias Käppler