Sticky failures of "Missing on Primary" attachments
While working on https://gitlab.com/gitlab-com/migration/issues/365 I found few failed syncs that
were fixed when I ran Geo::FileDownloadService
manually for them.
What's interesting is that the retry_count was set to a pretty big number (more than 10). And retry date was too far away, like days.
Imagine the case when we have "missed on primary" sync which is marked as successfully synced. We never reset retry_count
because we try to sync those "missed on primary" forever. That leads to a fact that all the MoP failures have a pretty high retry_count
. If at some point we have Networks issue or some other issue for a short period of time and MoP failures were run at this particular time then they are marked as success: false
but they are scheduled on some day that is few days/weeks away. That makes MoP failures "sticky" so they spoil overall sync statistic but Geo will nothing do about it for days/weeks. I don't think it's optimal. WDYT?
/cc @mkozono