Puma failing SLI after recent job rescheduling changes

Summary

After #471239 (closed) was rolled out to Dedicated customers we saw a degradation in SLO for several customers. The latest update thread is here: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/378#note_2104841320

There are hundreds of webservice log messages of the type AuthorizedProjectUpdate::UserRefreshFromReplicaWorker JID-bcc49cb259c34204da2a55c8: deduplicated: until executing every hour.

The degradation started occurring immediately with a 17.2.x release of GitLab, and looks like there is a change in this area around the same time `:until_executed` jobs with `reschedule_once` n... (#471239 - closed)

Proposal

Could we please set urgency: low on the affected endpoint; that will set the threshold to 5s, which is in keeping with what we see here. See https://docs.gitlab.com/ee/development/application_slis/rails_request.html#decreasing-the-urgency-setting-a-higher-target-duration
If this is not an acceptable fix, ideally we'd roll back the change and backport it to 17.3.x for release to Dedicated as soon as we can. It is causing a real world SLO failure.

Additional details

Some relevant technical details, if applicable, such as:

Does this need a feature flag? no
Does there need to be an associated instrumentation issue created related to this work? no
Is there an example response showing the data structure that should be returned (new endpoints only)? no
What permissions should be used? n/a
Is this EE or CE?
- EE
- CE
Additional comments:

Implementation Table

Group	Issue Link
backend	👈 You are here

Links/References

Edited Sep 17, 2024 by Thong Kuah