Geo: Cannot transition state via :pending from :failed due to "Verification failure can't be blank"
Summary
Reported in https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/88#note_1663030306.
Some Geo::EventWorker Sidekiq jobs with args project_repository, created are exiting with this error:
StateMachines::InvalidTransition: Cannot transition state via :pending from :failed (Reason(s): Verification failure can't be blank)
This transition sounds like it should be valid, at face value.
Steps to reproduce
I expected steps to reproduce to be straightforward, however, this did not reproduce the error for me locally as of 29 Nov.
- In a secondary Geo site, get a Rails console
sudo gitlab-rails console- Get a project repo registry record:
r = Geo::ProjectRepositoryRegistry.last - Take note of the
project_id. - Put it in sync failed state for a day
r.failed!(message: "Mike did it"); r.update!(retry_at: 1.day.from_now)
- Get a project repo registry record:
- In the primary Geo site, get a Rails console
sudo gitlab-rails console- Publish a
createdevent for that repo (replaceproject_idwith the actual integer):Project.find(project_id).replicator.geo_handle_after_create
- Publish a
What is the current bug behavior?
These Sidekiq jobs exit with this error (which is currently muddying GitLab Dedicated's metrics and alerts).
Apparently the projects may eventually become unaffected by themselves.
What is the expected correct behavior?
If the transition is indeed valid (we should double-check that this is true), then these Sidekiq jobs should exit successfully.
Relevant logs and/or screenshots
Possible fixes
Proposal:
-
Allow transition from sync pending or sync failed, to sync failed. => !143158 (merged) -
When registry transitions to sync failed, transition to verification disabled.Won't do, see #433182 (comment 1748579436) -
When registry transitions to sync pending, transition to verification disabled. => !143158 (merged) -
Adjust tests
Workaround to unstick currently stuck records
Paste this into Rails console in the secondary site:
def fix_verification_failure_cant_be_blank_for_registry(registry_class)
affected = registry_class.where(verification_state: 3, verification_failure: nil)
puts "Found #{affected.count} affected #{registry_class} records"
update_attrs = {
state: 3,
last_sync_failure: "Manually setting sync failed",
retry_count: 1,
retry_at: registry_class.next_retry_time(1),
verification_state: 4,
verification_retry_count: 1,
verification_retry_at: nil,
verified_at: nil
}
affected.each_batch do |batch|
num_updated = batch.update_all(update_attrs)
puts "Manually set sync failed for #{num_updated} #{registry_class} records"
end
end
def fix_verification_failure_cant_be_blank
registry_classes = ::Gitlab::Geo.verification_enabled_replicator_classes.map(&:registry_class)
registry_classes.each do |registry_class|
fix_verification_failure_cant_be_blank_for_registry(registry_class)
end
while true do
total_waiting = registry_classes.reduce(0) do |sum, registry_class|
num_waiting = registry_class.where(state: 3, verification_state: 4).count
puts "#{num_waiting} #{registry_class} affected records waiting to be resynced" if num_waiting > 0
sum + num_waiting
end
break if total_waiting.zero?
sleep 3
end
puts "Done"
end
fix_verification_failure_cant_be_blank