Skip to content

Geo: Fix SSF repository replication race condition on updates

Michael Kozono requested to merge mk/fix-dedup-race-condition into master

What does this MR do and why?

Describe in detail what your merge request does and why.

#391387 (closed):

When a mutable data type is updated, for example multiple times in quick succession, then the replication work is deduplicated by using an exclusive lease. But we need to ensure that at least one replication job runs from start to finish after the last known update.

One way to make that happen is to:

  1. Atomically (in a single, indivisible action in Postgres): Move a registry to synced, if it's current state is started.
  2. If the atomic update finds that the current state is not started, then reschedule the replication job.
  3. When a secondary site processes a new "update event", move the registry to pending.

We already do 1. and 2.. We don't currently do 3. for Git repositories in SSF.

This MR does 3..

This also fixes a problem with 2. in which: The rescheduled job starts running before the current one releases its lease, so the rescheduled job just exits.

This also fixes the race condition for Container repository replication in which: While another sync job is running, then pending! raises an error due to not being able to transition from started.

Minor downside

Technically this MR creates its own minor race condition in that it can, in rare cases, cause more replication jobs than strictly needed:

When a running sync reschedules itself, it reuses the "update event" mechanism, for convenience and dryness. But this change makes it so that that rescheduled job will first mark the registry pending. If another sync job started in the small window after the originally running sync, and before the rescheduled one, then the newly running sync will reschedule itself. Therefore it's possible to cause more syncs than needed.

We could fix this by making "reschedule" schedule a job that doesn't begin with marking the registry pending. But I don't think this is important, since the interloping sync job would have to be enqueued within what I expect would be a small window. And the consequence is small.

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

  1. Set up Geo
  2. Sidekiq concurrency must be allowed, i.e. you aren't forcing there to be a single worker at a time (unusual).
  3. Create a snippet if there are none.
  4. In the secondary site, tail -f log/geo.log | grep "FrameworkRepositorySyncService".
  5. In the secondary site, enqueue 5 "snippet updated" jobs all at once: bin/rails runner "5.times { Snippet.last.snippet_repository.replicator.reschedule_sync }".

Expected log output:

{"severity":"INFO","time":"2023-02-11T01:08:43.240Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Started snippet_repository sync"}
{"severity":"INFO","time":"2023-02-11T01:08:43.241Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Marking snippet_repository sync as started"}
{"severity":"INFO","time":"2023-02-11T01:08:43.259Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Trying to fetch snippet_repository"}
{"severity":"ERROR","time":"2023-02-11T01:08:43.263Z","correlation_id":"e563cbbcdb5e05c59d3922df13b938f9","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Cannot obtain an exclusive lease for geo_sync_ssf_service:snippet_repository:21. There must be another instance already in execution."}
{"severity":"ERROR","time":"2023-02-11T01:08:43.274Z","correlation_id":"6b8e3ff6d64884585999b79c0ee4e93b","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Cannot obtain an exclusive lease for geo_sync_ssf_service:snippet_repository:21. There must be another instance already in execution."}
{"severity":"ERROR","time":"2023-02-11T01:08:43.285Z","correlation_id":"aca567aa3e6f9d87c3471689047859b6","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Cannot obtain an exclusive lease for geo_sync_ssf_service:snippet_repository:21. There must be another instance already in execution."}
{"severity":"ERROR","time":"2023-02-11T01:08:43.288Z","correlation_id":"4520baefa0fc33b7756e815233b3e61e","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Cannot obtain an exclusive lease for geo_sync_ssf_service:snippet_repository:21. There must be another instance already in execution."}
{"severity":"INFO","time":"2023-02-11T01:08:43.753Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Marking snippet_repository sync as successful"}
{"severity":"INFO","time":"2023-02-11T01:08:43.757Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Finished snippet_repository sync","download_time_s":0.509}
{"severity":"INFO","time":"2023-02-11T01:08:43.757Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Expiring caches for repository"}
{"severity":"INFO","time":"2023-02-11T01:08:43.763Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Finished snippet_repository sync"}
{"severity":"INFO","time":"2023-02-11T01:08:43.763Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Reschedule the sync because an updated event was processed during the sync"}
{"severity":"INFO","time":"2023-02-11T01:08:43.776Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Started snippet_repository sync"}
{"severity":"INFO","time":"2023-02-11T01:08:43.776Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Marking snippet_repository sync as started"}
{"severity":"INFO","time":"2023-02-11T01:08:43.779Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Trying to fetch snippet_repository"}
{"severity":"INFO","time":"2023-02-11T01:08:44.247Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Marking snippet_repository sync as successful"}
{"severity":"INFO","time":"2023-02-11T01:08:44.252Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Finished snippet_repository sync","download_time_s":0.475}
{"severity":"INFO","time":"2023-02-11T01:08:44.252Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Expiring caches for repository"}
{"severity":"INFO","time":"2023-02-11T01:08:44.257Z","correlation_id":"084689c2b67d15f0a4975e903731bc74","class":"Geo::FrameworkRepositorySyncService","host":"gdk2.test","message":"Finished snippet_repository sync"}

Notice:

  • There are 4 exclusive lease errors "deduplicating" the replication work.
  • There is a single "Reschedule the sync".
  • There is a successful sync after that.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Michael Kozono

Merge request reports