Draft: Fix negative Geo count

What does this MR do and why?

Fixes a race condition in Geo secondary metric collection that can produce negative object counts (most commonly seen for Job Artifacts "queued" count).

Root cause

load_secondary_ssf_replicable_data in GeoNodeStatus collects registry_count, synced_count, and failed_count via separate, non-atomic batch_count queries. Between queries, registries can be created and synced, causing synced_count + failed_count to exceed the previously-snapshotted registry_count. The frontend computes queued = total - synced - failed, which goes negative.

A second vector is BatchCounter returning its FALLBACK = -1 on failure, which also results in a negative derived count.

Fix

After collecting all metrics for each replicator, ensure_consistent_counts adjusts count and registry_count upward to max(synced + failed, 0) when the race causes them to be lower. This guarantees the derived "queued" count is never negative.

References

Screenshots or screen recordings

-

How to set up and validate locally

  1. Start a GDK Geo secondary and open a Rails console

  2. Simulate the race condition by monkey-patching registry_count to return a stale value:

    replicator = Geo::JobArtifactReplicator
    real_synced = replicator.synced_count
    real_failed = replicator.failed_count
    stale_count = [real_synced + real_failed - 10, 0].max
    
    replicator.define_singleton_method(:registry_count) { stale_count }
    
    status = GeoNodeStatus.current_node_status
    name = replicator.replicable_name_plural
    count = status.send("#{name}_count").to_i
    synced = status.send("#{name}_synced_count").to_i
    failed = status.send("#{name}_failed_count").to_i
    queued = count - synced - failed
    
    puts "queued: #{queued}"
  3. Before this MRqueued prints a negative number (e.g. -10)

  4. After this MRqueued prints 0

  5. Run the specs:

bundle exec rspec ee/spec/models/geo_node_status_spec.rb -e "load_data_from_current_node"

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #439592

Merge request reports

Loading