Follow-up: Improve retry mechanism for Sidekiq jobs with delayed data consistency
The following discussions from !55881 (merged) should be addressed:
-
@andrewn commented in discussion:
by failing if the replica is not up-to-date, we are decreasing the retry count for something that is not the job's "fault". For example, we may try to deliver a webhook 3 times, but 2 of those attempts may be used up waiting for the replica to catch up. This could potentially have some impact on the application, and at the least, we should consider it. An alternative that I'm not completely fond of would be to sleep the job for a few seconds and try again, although this carries its own risk.
-
@reprazent commented in discussion 🚯
Perhaps we're better off just relying on the the default retry mechanism of Sidekiq, and adding validation in
validate_worker_attributes!: if a worker usesdata_consistency :delayedit cannot have retries set to 0.
When we get back to having the default of 25 retries for most workers, this 1 try for replication lag won't make that big of a difference.