Skip to content

Geo: Cannot transition state via :pending from :failed due to "Verification failure can't be blank"

Summary

Reported in https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/88#note_1663030306.

Some Geo::EventWorker Sidekiq jobs with args project_repository, created are exiting with this error:

StateMachines::InvalidTransition: Cannot transition state via :pending from :failed (Reason(s): Verification failure can't be blank)

This transition sounds like it should be valid, at face value.

Steps to reproduce

I expected steps to reproduce to be straightforward, however, this did not reproduce the error for me locally as of 29 Nov.

  1. In a secondary Geo site, get a Rails console sudo gitlab-rails console
    1. Get a project repo registry record: r = Geo::ProjectRepositoryRegistry.last
    2. Take note of the project_id.
    3. Put it in sync failed state for a day r.failed!(message: "Mike did it"); r.update!(retry_at: 1.day.from_now)
  2. In the primary Geo site, get a Rails console sudo gitlab-rails console
    1. Publish a created event for that repo (replace project_id with the actual integer): Project.find(project_id).replicator.geo_handle_after_create

What is the current bug behavior?

These Sidekiq jobs exit with this error (which is currently muddying GitLab Dedicated's metrics and alerts).

Apparently the projects may eventually become unaffected by themselves.

What is the expected correct behavior?

If the transition is indeed valid (we should double-check that this is true), then these Sidekiq jobs should exit successfully.

Relevant logs and/or screenshots

Possible fixes

Proposal:

  • Allow transition from sync pending or sync failed, to sync failed. => !143158 (merged)
  • When registry transitions to sync failed, transition to verification disabled. Won't do, see #433182 (comment 1748579436)
  • When registry transitions to sync pending, transition to verification disabled. => !143158 (merged)
  • Adjust tests

Workaround to unstick currently stuck records

Paste this into Rails console in the secondary site:

def fix_verification_failure_cant_be_blank_for_registry(registry_class)
  affected = registry_class.where(verification_state: 3, verification_failure: nil)
  puts "Found #{affected.count} affected #{registry_class} records"
  
  update_attrs = {
    state: 3,
    last_sync_failure: "Manually setting sync failed",
    retry_count: 1,
    retry_at: registry_class.next_retry_time(1),
    verification_state: 4,
    verification_retry_count: 1,
    verification_retry_at: nil,
    verified_at: nil
  }

  affected.each_batch do |batch|
    num_updated = batch.update_all(update_attrs)
    puts "Manually set sync failed for #{num_updated} #{registry_class} records"
  end
end

def fix_verification_failure_cant_be_blank
  registry_classes = ::Gitlab::Geo.verification_enabled_replicator_classes.map(&:registry_class)

  registry_classes.each do |registry_class|
    fix_verification_failure_cant_be_blank_for_registry(registry_class)
  end

  while true do
    total_waiting = registry_classes.reduce(0) do |sum, registry_class|
      num_waiting = registry_class.where(state: 3, verification_state: 4).count
      puts "#{num_waiting} #{registry_class} affected records waiting to be resynced" if num_waiting > 0
      sum + num_waiting
    end

    break if total_waiting.zero?

    sleep 3
  end

  puts "Done"
end

fix_verification_failure_cant_be_blank
Edited by Michael Kozono