Skip to content

Follow-up: Improve retry mechanism for Sidekiq jobs with delayed data consistency

The following discussions from !55881 (merged) should be addressed:

by failing if the replica is not up-to-date, we are decreasing the retry count for something that is not the job's "fault". For example, we may try to deliver a webhook 3 times, but 2 of those attempts may be used up waiting for the replica to catch up. This could potentially have some impact on the application, and at the least, we should consider it. An alternative that I'm not completely fond of would be to sleep the job for a few seconds and try again, although this carries its own risk.


Perhaps we're better off just relying on the the default retry mechanism of Sidekiq, and adding validation in validate_worker_attributes!: if a worker uses data_consistency :delayed it cannot have retries set to 0.

When we get back to having the default of 25 retries for most workers, this 1 try for replication lag won't make that big of a difference.

Edited by Nikola Milojevic