Geo: Improve data integrity with reliable Sidekiq queuing
@vsizov started the conversation here: https://gitlab.com/gitlab-org/gitlab-ce/issues/36791. That issue does not propose a comprehensive fix, rather it aims to improve reliability incrementally with a small change.
But a lot of GitLab code assumes Sidekiq queues are reliable, which is false. Over time we lose small amounts of data due to lost jobs. The data loss is compounded on Geo secondaries, since Geo uses jobs extensively.
We've done a lot of work to combat this problem in Geo, but as a Disaster Recovery solution, we should have a reliable queue. It's also in our best interest to minimize data loss before replication.
~Geo should prioritize reliable queuing to improve data integrity everywhere.
Options:
- Use Sidekiq Enterprise: Not possible for CE, and is too expensive to incorporate into EE
- Migrate to RabbitMQ or other reliable queue: I assume this is too time-expensive
- Modify or build on top of Sidekiq to make it reliable
This was suggested here https://gitlab.com/gitlab-org/gitlab-ce/issues/36791#note_38179123 and https://gitlab.com/gitlab-org/gitlab-ce/issues/36791#note_73358701