WebHookWorker affected by JobReplicaNotUpToDate errors
Sidekiq workers are experiencing a number of Gitlab::Database::LoadBalancing::SidekiqServerMiddleware::JobReplicaNotUpToDate
errors.
This error has affected the error budgets of ~"group::integrations" due to WebHookWorker
, although it's not exclusive to any worker.
Affected workers
Kibana table link to the number of times worker classes have failed due to the error past 7 days.
Proposal
@reprazent
from groupscalability has helped explain a proposal (internal, good for 30 days):
It made me look at the sleep logic in the middleware:
https://gitlab.com/gitlab-org/gitlab/blob/4725c43542a44cf17350cafd6c87f8677eb223ab[…]lib/gitlab/database/load_balancing/sidekiq_server_middleware.rb.
This `#sleep_if_needed` doesn't seem quite right to me.
It looks like we'll only sleep for a maximum of 0.8s. Depending on the scheduling latency.
This seems a bit complicated to me, and I think we could wait a bit longer there independent of the scheduling latency.
Perhaps we could change that to something like this:
3.times do
sleep 0.5
break if databases_in_sync?
end
That would mean we'd wait less long in happy cases, but longer if we need to.
[...]
If there was already a 0.8s wait since the job was scheduled. We don't wait at all. I think
MINIMUM_DELAY_INTERVAL_SECONDS is actually MAXIMUM_DELAY.
We're relying on the retry mechanism, which also introduces a delay of about 15s
(https://github.com/mperham/sidekiq/wiki/Error-Handling#automatic-job-retry). So the second time,
the wait wouldn't do anything. But in practice I don't think we lag that far behind.
I think taking a small penalty in job run time is acceptable. We allow a 5m runtime for a low
urgency jobs & a 5s one for high urgency ones. So I believe a ~1.5s maximum wait could do a
great deal, and still have higher throughput because we don't need another roundtrip.
Edited by Luke Duncalfe