Org Mover: Double check for dropped create or delete events prior to cutover
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Release notes
A Rake task was added to ensure data consistency while migrating an organization. It is now possible to double check for dropped create or delete "replication events" over the last X hours for Y data type`. An SRE or a UI can run this task just after Org maintenance mode is enabled, and just prior to finalizing an Org move.
Problem to solve
Org Mover's create and delete events are generally reliable, but it is possible for them to be lost or dropped. In that case, we rely on the redundant background job RegistryConsistencyWorker to provide an eventual consistency guarantee. But when Organization Maintenance mode is enabled prior to a cutover, we need a guarantee of consistency for all data, as quickly as possible.
Even if we were to significantly optimize the performance of RegistryConsistencyWorker, it's not scalable to check all replicable job artifact rows during Organization Maintenance mode.
Proposal
Assertion: Dropped delete events are much less critical than dropped create events.
-
Add Rake task to run on target cell. E.g. SINCE_HOURS_AGO=72 gitlab-rake gitlab:geo:double_check_for_dropped_create_events -
Open a follow up issue: We should also give the SRE a way to see when the last iteration of RegistryConsistencyWorkerfinished for a particular data type. This should be approximately possible manually by using theRegistryBatchercursor and associating IDs withcreated_attimestamps. But it would be just as easy to track that information precisely and expose it by Rake task. So if this issue gets implemented, then we should open a follow up feature request to do that. I think it would be around weight 2-3.
More details
Since RegistryConsistencyWorker is continuously running, we only need to check the space of
possible rows that were created since the RegistryConsistencyWorker last finished processing a whole table in order to recover from all dropped "job artifact was created" events. If it takes 7 days to process a whole table, then at most, we need to check the space of possible rows that were created since 7 days ago, in order to guarantee that all dropped create events are resolved. Which is a tiny fraction of the table.
This solution does not recover from dropped delete events, because they may have IDs that RegistryConsistencyWorker just finished processing.
Dropped update events should already be recovered via Org Mover's "verification" processes. And for mutable replicables, subsequent updates recover for past dropped update events.