Add gauge reconciliation for metrics
Active-state gauges (new, queued, running, cancel-requested) can
drift due to race conditions in increment/decrement operations. This adds
a reconciliation mechanism that corrects gauge values by querying
CockroachDB for the actual counts.
Changes:
- Add
reconcile_gauges()inmiddleware/metrics.pythat reconcilesby_state,by_state_ranch, andby_state_tokengauges viacrud/metrics.pyfor DB queries - Add
reconcile_metrics()middleware wrapper with RBAC enforcement - Add
POST /v0.1/metrics/reconcileadmin-only endpoint with RBAC (AccessObject.METRICS,AccessAction.RECONCILE) - Add
ValkeyCache.update_fields()for partial hash updates viaHSET - Add background periodic reconciliation via FastAPI lifespan (interval
configurable via
METRICS_RECONCILE_INTERVAL_SECONDS, default 300s), skipped whenMETRICS_ENABLEDis false - Store reconciliation metadata (duration, timestamp) in a single
tf:metrics:reconcile:metadataValkey hash, exposed as Prometheus gauges - Add
MetricsReconcileOutandReconcileGaugeDiffresponse schemas usingNucleusBaseModel
Architecture:
- DB queries in
crud/metrics.py(crud layer) - Business logic + RBAC in
middleware/metrics.py(middleware layer) - Thin delegation in
routers/metrics.py(router layer)
Note: There is an inherent race condition between reading cache values and overwriting with DB counts during reconciliation. Any in-flight state transition in that window is overwritten but self-corrects on the next cycle (default 300s).
Assisted-by: Claude Code
Signed-off-by: Miroslav Vadkerti mvadkert@redhat.com
Edited by Miroslav Vadkerti