Draft: Fix negative Geo count
What does this MR do and why?
Fixes a race condition in Geo secondary metric collection that can produce negative object counts (most commonly seen for Job Artifacts "queued" count).
Root cause
load_secondary_ssf_replicable_data in GeoNodeStatus collects registry_count, synced_count, and failed_count via separate, non-atomic batch_count queries. Between queries, registries can be created and synced, causing synced_count + failed_count to exceed the previously-snapshotted registry_count. The frontend computes queued = total - synced - failed, which goes negative.
A second vector is BatchCounter returning its FALLBACK = -1 on failure, which also results in a negative derived count.
Fix
After collecting all metrics for each replicator, ensure_consistent_counts adjusts count and registry_count upward to max(synced + failed, 0) when the race causes them to be lower. This guarantees the derived "queued" count is never negative.
References
Screenshots or screen recordings
-
How to set up and validate locally
-
Start a GDK Geo secondary and open a Rails console
-
Simulate the race condition by monkey-patching
registry_countto return a stale value:replicator = Geo::JobArtifactReplicator real_synced = replicator.synced_count real_failed = replicator.failed_count stale_count = [real_synced + real_failed - 10, 0].max replicator.define_singleton_method(:registry_count) { stale_count } status = GeoNodeStatus.current_node_status name = replicator.replicable_name_plural count = status.send("#{name}_count").to_i synced = status.send("#{name}_synced_count").to_i failed = status.send("#{name}_failed_count").to_i queued = count - synced - failed puts "queued: #{queued}" -
Before this MR:
queuedprints a negative number (e.g.-10) -
After this MR:
queuedprints0 -
Run the specs:
bundle exec rspec ee/spec/models/geo_node_status_spec.rb -e "load_data_from_current_node"
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #439592