Transposing the Geo event log into registry tables is generally racy: Discovery

Problem

This is the more general-purpose version of https://gitlab.com/gitlab-org/gitlab-ee/issues/5489

In our architecture, the log cursor pulls events off of the event log and updates the various *_registry tables with mutable state. Some time later, a sidekiq job will run and update that mutable state in some manner.

In #5489 (closed), I observed that two Geo::RepositoryUpdatedEvent events sent in quick succession might not do the right thing, as the sidekiq job for the first event will overwrite the state entered by the log cursor.

I've now observed that a Geo::RepositoryUpdateEvent followed by a Geo::RepositoryDeletedEvent may also be subject to a similar problem. It goes:

  • Geo::RepositoryUpdatedEvent received
    • project_registry row updated / created
    • enqueues a sidekiq job
  • sidekiq job for update begins to run
  • Geo::RepositoryDeletedEvent received
    • enqueues a sidekiq job (irrelevant)
    • project_registry row deleted
  • sidekiq job for update completes, creates project_registry row

This case may also explain some of the more-confusing "orphaned project_registry row" cases we've seen on gprd \o/

Where there are two cases, there are probably more, so I think this issue should focus on a more general rethink of how we're managing this mutable state, and how to avoid these kinds of races in general. Every update to the *_registry tables needs to take into account the fact that state may have changed since it last looked.

Ultimately, this is a consequence of having the log cursor execute its work asynchronously, out-of-order, etc, by pushing long-running tasks into sidekiq. This means we concurrently update the mutable state, and introduces the requirement for some sort of awareness or synchronization between the two updaters.

Since these jobs can be very long-running (it could take over an hour to process a RepositoryUpdatedEvent, for instance), I don't think we can reasonably drop the asynchronous execution part of the puzzle. I'll puzzle some more in the issue comments.

Proposals

We have a number of ideas here. I'll attempt to list them:

  1. Optimistic lock registry tables
  2. Version registry resyncs
  3. Use same ExclusiveLease across create, update, delete
  4. Soft delete XRegistry + soft delete XDeletedEvent + clean up

It seems like we've settled on doing 4) which should solve update-delete races.

But we also need something like 1) or 2) to eliminate update-update races.

Perhaps we should open a separate issue for 4) and schedule that?

And we could hold off scheduling this issue until we come to a decision on update-update races.

Edited Oct 08, 2018 by Andreas Kämmerle
Assignee Loading
Time tracking Loading