Add gauge reconciliation for metrics

Active-state gauges (new, queued, running, cancel-requested) can drift due to race conditions in increment/decrement operations. This adds a reconciliation mechanism that corrects gauge values by querying CockroachDB for the actual counts.

Changes:

  • Add reconcile_gauges() in middleware/metrics.py that reconciles by_state, by_state_ranch, and by_state_token gauges via crud/metrics.py for DB queries
  • Add reconcile_metrics() middleware wrapper with RBAC enforcement
  • Add POST /v0.1/metrics/reconcile admin-only endpoint with RBAC (AccessObject.METRICS, AccessAction.RECONCILE)
  • Add ValkeyCache.update_fields() for partial hash updates via HSET
  • Add background periodic reconciliation via FastAPI lifespan (interval configurable via METRICS_RECONCILE_INTERVAL_SECONDS, default 300s), skipped when METRICS_ENABLED is false
  • Store reconciliation metadata (duration, timestamp) in a single tf:metrics:reconcile:metadata Valkey hash, exposed as Prometheus gauges
  • Add MetricsReconcileOut and ReconcileGaugeDiff response schemas using NucleusBaseModel

Architecture:

  • DB queries in crud/metrics.py (crud layer)
  • Business logic + RBAC in middleware/metrics.py (middleware layer)
  • Thin delegation in routers/metrics.py (router layer)

Note: There is an inherent race condition between reading cache values and overwriting with DB counts during reconciliation. Any in-flight state transition in that window is overwritten but self-corrects on the next cycle (default 300s).

Assisted-by: Claude Code

Signed-off-by: Miroslav Vadkerti mvadkert@redhat.com

Edited by Miroslav Vadkerti

Merge request reports

Loading