Puma failing SLI after recent job rescheduling changes
Summary
After #471239 (closed) was rolled out to Dedicated customers we saw a degradation in SLO for several customers. The latest update thread is here: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/378#note_2104841320
There are hundreds of webservice log messages of the type AuthorizedProjectUpdate::UserRefreshFromReplicaWorker JID-bcc49cb259c34204da2a55c8: deduplicated: until executing every hour.
The degradation started occurring immediately with a 17.2.x release of GitLab, and looks like there is a change in this area around the same time `:until_executed` jobs with `reschedule_once` n... (#471239 - closed)
Proposal
- Could we please set
urgency: lowon the affected endpoint; that will set the threshold to 5s, which is in keeping with what we see here. See https://docs.gitlab.com/ee/development/application_slis/rails_request.html#decreasing-the-urgency-setting-a-higher-target-duration - If this is not an acceptable fix, ideally we'd roll back the change and backport it to 17.3.x for release to Dedicated as soon as we can. It is causing a real world SLO failure.
Additional details
Some relevant technical details, if applicable, such as:
- Does this need a feature flag? no
- Does there need to be an associated instrumentation issue created related to this work? no
- Is there an example response showing the data structure that should be returned (new endpoints only)? no
- What permissions should be used? n/a
- Is this EE or CE?
-
EE -
CE
-
- Additional comments:
Implementation Table
| Group | Issue Link |
|---|---|
| backend |
|
Links/References
Edited by Thong Kuah