Geo: Unblock the updating of site status even if some metrics are slow to collect
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Release notes
The status of a Geo site is now updated every minute with the latest known metrics, instead of waiting for all metrics to be collected at one time.
Problem to solve
- The updating of site statuses is blocked by the collection of metrics, so if metrics collection takes 10 minutes or more, then site status is not updated for 10 minutes or more, which shows up as "Unhealthy".
- If all metrics are fast, except one metrics is slow to collect near the end of the job, then we continue to display the old status with all the old metrics even though we've already collected most metrics again.
Proposal
Decouple the collection of metrics from the updating of site status.
Implementation Guide
Geo::StatusUpdateWorker, formerly named Geo::MetricsUpdateWorker, runs 1 time per minute.
- Generate site status using the latest collected metrics
- Timeout dead metrics if needed
- Insert wanted metrics as needed
- Enqueue
Geo::MetricsCollectionWorkerwith capacity - Prune old metrics if stored as a log
Geo::MetricsCollectionWorker fills wanted metrics. It is a LimitedCapacity::Worker with max concurrency defaulting to 1 (we can make it configurable later).
- Pick up an unstarted metric atomically (for example by setting
started_atusingFOR UPDATE SKIP LOCKED) - Collect metric
- Update metric heartbeat between loops if needed (for example by touching
updated_at) - Update metric value and mark complete
New metrics table, in main DB if writable, else in Geo Tracking DB. Side note, I notice these fields have similarities to ml_candidate_metrics and observability_metrics_issues_connections.
- name
- type
- value
- started_at
- completed_at
- created_at
- updated_at
Weight 6.
Alternatively, we could store just "current" and "wanted" metrics in Redis, no migrations needed, no worries about writable DB or Tracking DB, no pruning needed. (It'd be harder to debug problems with "current" and "wanted" state though.)
Weight 4.
Benefits:
- Site statuses would be reliably updated every minute (no more "Unhealthy" due to 10 min old statuses).
- Displayed metrics would receive intermediate updates with all of the latest collected metrics, even if one slow metric is in the middle of taking 20 minutes to run.
- A customer could increase metric collection concurrency (unless DB load is too high) so that fast metrics can be collected and reported more often, without being blocked by slow metrics.
- You can view a history of metrics (if you store them as a log in a table as described below).
- We could later make the frequency of collection of individual metrics configurable.
Intended users
Feature Usage Metrics
I don't think we should track this.
But we could track number of requests for the Admin > Geo > Sites view, or number of requests for Geo statuses via the API.
Does this feature require an audit event?
No.