Puma failing SLI after recent job rescheduling changes

Summary

After #471239 (closed) was rolled out to Dedicated customers we saw a degradation in SLO for several customers. The latest update thread is here: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/378#note_2104841320

There are hundreds of webservice log messages of the type AuthorizedProjectUpdate::UserRefreshFromReplicaWorker JID-bcc49cb259c34204da2a55c8: deduplicated: until executing every hour.

The degradation started occurring immediately with a 17.2.x release of GitLab, and looks like there is a change in this area around the same time `:until_executed` jobs with `reschedule_once` n... (#471239 - closed)

Proposal

Additional details

Some relevant technical details, if applicable, such as:

  • Does this need a feature flag? no
  • Does there need to be an associated instrumentation issue created related to this work? no
  • Is there an example response showing the data structure that should be returned (new endpoints only)? no
  • What permissions should be used? n/a
  • Is this EE or CE?
    • EE
    • CE
  • Additional comments:

Implementation Table

Group Issue Link
backend 👈 You are here

Links/References

Edited by Thong Kuah