Geo repository replication using SSF may get stuck in a `started` state
Summary
It seems possible that snippet
repository replication may get stuck in a started
state. Attempts to replicate afterwards will fail with:
StateMachines::InvalidTransition (Cannot transition state via :start from :started (Reason(s): State cannot transition via "start"))
Steps to reproduce
Unsure at this point how we've ended up in this state. Assuming so far it's a different error (and we don't rescue it) than Gitlab::Git::Repository::NoRepository
and Gitlab::Shell::Error, Gitlab::Git::BaseError
.
What is the current bug behavior?
Snippet repository sync never succeeds because it's stuck in a started
state.
What is the expected correct behavior?
SSF Repository sync should function as normal.
Relevant logs and/or screenshots
irb(main):004:0> unsynced_snippets = Geo::SnippetRepositoryRegistry.all - Geo::SnippetRepositoryRegistry.synced
=> [#<Geo::SnippetRepositoryRegistry id: 4, retry_at: nil, last_synced_at: "2021-04-14 18:12:24", created_at: "2021-04-14 18:11:59", snippet_repository_id: 10, state: 1, re...
# notice the state: 1
irb(main):005:0> unsynced_snippets.first.replicator.send(:sync_repository)
Traceback (most recent call last):
7: from (irb):5
6: from ee/app/models/concerns/geo/repository_replicator_strategy.rb:38:in `sync_repository'
5: from ee/app/services/geo/framework_repository_sync_service.rb:29:in `execute'
4: from app/services/concerns/exclusive_lease_guard.rb:29:in `try_obtain_lease'
3: from ee/app/services/geo/framework_repository_sync_service.rb:32:in `block in execute'
2: from ee/app/services/geo/framework_repository_sync_service.rb:39:in `sync_repository'
1: from ee/app/services/geo/framework_repository_sync_service.rb:158:in `start_registry_sync!'
StateMachines::InvalidTransition (Cannot transition state via :start from :started (Reason(s): State cannot transition via "start"))
Possible fixes
Potentially ensure that we transition out of started
if any type of error occurs, and/or look for any long-running started
repositories that may be still stuck.