WebHookWorker affected by JobReplicaNotUpToDate errors

Sidekiq workers are experiencing a number of Gitlab::Database::LoadBalancing::SidekiqServerMiddleware::JobReplicaNotUpToDate errors.

This error has affected the error budgets of ~"group::integrations" due to WebHookWorker, although it's not exclusive to any worker.

Affected workers

Kibana table link to the number of times worker classes have failed due to the error past 7 days.

Proposal

@reprazent from groupscalability has helped explain a proposal (internal, good for 30 days):

It made me look at the sleep logic in the middleware: 
https://gitlab.com/gitlab-org/gitlab/blob/4725c43542a44cf17350cafd6c87f8677eb223ab[…]lib/gitlab/database/load_balancing/sidekiq_server_middleware.rb.

This `#sleep_if_needed` doesn't seem quite right to me.

It looks like we'll only sleep for a maximum of 0.8s. Depending on the scheduling latency. 
This seems a bit complicated to me, and I think we could wait a bit longer there independent of the scheduling latency.

Perhaps we could change that to something like this:

3.times do
  sleep 0.5
  break if databases_in_sync?
end

That would mean we'd wait less long in happy cases, but longer if we need to.

[...]

If there was already a 0.8s wait since the job was scheduled. We don't wait at all. I think 
MINIMUM_DELAY_INTERVAL_SECONDS is actually MAXIMUM_DELAY. 

We're relying on the retry mechanism, which also introduces a delay of about 15s 
(https://github.com/mperham/sidekiq/wiki/Error-Handling#automatic-job-retry). So the second time, 
the wait wouldn't do anything. But in practice I don't think we lag that far behind.

I think taking a small penalty in job run time is acceptable. We allow a 5m runtime for a low 
urgency jobs & a 5s one for high urgency ones. So I believe a ~1.5s maximum wait could do a 
great deal, and still have higher throughput because we don't need another roundtrip.

Edited Mar 09, 2023 by Luke Duncalfe