Geo: Job artifact metrics collection must scale
Problem
Geo metrics collection fundamentally doesn't scale because it repeatedly counts unchanged data.
The Geo::MetricsUpdateWorker runs every minute by default, executing queries to count primary verification states across 17+ data types. The number of job artifacts can be in the 100s of millions. Even with query optimizations, this approach has inherent scalability limits:
- Repeated work: Every metrics update rescans entire tables, even when most rows haven't changed. On large instances, this means reading gigabytes of data every minute to recalculate the same counts.
- Memory bandwidth saturation: Fast queries that fit in memory can still consume excessive memory bandwidth (observed at ~15GB/s on dedicated instances), putting pressure on the entire database server.
- Won't scale to GitLab.com: For Cells migration, Geo will need to handle GitLab.com-scale data. Repeatedly counting hundreds of millions of rows is not viable, even with optimized queries.
Impact: Even with immediate query improvements, the fundamental design prevents Geo metrics from scaling to support organization migrations to Cells or very large self-managed instances.
For example, see this customer support request https://gitlab.com/gitlab-com/core-sub-department/section-core-request-for-help/-/issues/86
Click here to expand more discussion in the customer support request
Could these queries be responsible for consuming all IO capacity on the DB
If you continue counting the table over and over in a loop, yes. Is that what Geo is trying to do? How often do we need these counts, and how accurate do they need to be? What triggers a counting operation to start?
Is our goal of counting the table reasonable?
I think it's reasonable to do every once in a while at self-managed scales, though it's not ideal. I'm curious if we're running in a tight loop somehow and re-counting as soon as we're done.
For gitlab.com, I don't think it's reasonable to count the total number of rows in a table on a regular basis. We could get counts once or twice if needed, but not in a loop.
Yes, all metrics are gathered once every minute by default. If the job duration increases to say, 10 minutes, then the job will be run every 10-11 minutes (due to job deduplication).
Thinking of the use-cases:
- During backfill (immediately after initial setup of Geo), these counts can be pretty stale and rough with little consequence
- During normal operation, the timeliness and accuracy of these metrics are important mostly for the purposes of alerting (e.g. alert if > 1% of items are "failed to sync" or "failed to verify"). Infrequent updates are relatively ok, but the metrics need to be as internally consistent as possible, or else we hear "How do we have 102% uploads synced, and -2% queued??".
- During an incident, frequent updates become more valuable. E.g. the sysadmin does an intervention, and then watches metrics intently to understand the effects. They want accurate updates as frequently as possible.
- During the lead-in to a planned failover, you are actively working to drive failures to 0%, so you also want accurate updates as frequently as possible.
How should we do it?
If you have to count, I think you're doing it in the best way possible. If you only need a (pretty bad) estimate, you could immediately query
pg_class.reltuplesto get the statistics collector's estimate for free. I'd expect this to be off by a few percent, more if the table is seeing a large volume of changes.Depending on how exact you need to be, and how exotic you're willing to get, I can think of a few other ways to try to maintain a running count with varying degrees of precision and less overall disk usage. (trying to skip re-counts based on the visibility map, trying to use some sort of trigger process to count as you go). But I don't like any of them. Counting large data sets is a fundamentally difficult operation.
┌──────────────────────┬───────────────────────┬────────────────────┬────────────────────┬───────────┬──────────────┐ │ queryid │ hours_of_io_since_old │ hours_of_io_old │ hours_io_new │ calls │ gb_disk_read │ │ int64 │ double │ double │ double │ int64 │ double │ ├──────────────────────┼───────────────────────┼────────────────────┼────────────────────┼───────────┼──────────────┤ │ -1572464638826216947 │ 647.8011465205045 │ 152.07045882223503 │ 799.8716053427395 │ 23100433 │ 95358.736696 │ │ 5199618098885104504 │ 60.632645844590186 │ 19.277093098644222 │ 79.9097389432344 │ 4876537 │ 2822.48636 │ │ -8878183716930224215 │ 1.8757748511440506 │ 2.1782135439325274 │ 4.053988395076578 │ 112684943 │ 72.707528 │ │ -3719403258649733832 │ 1.6618142243293845 │ 2.173576163870663 │ 3.8353903882000475 │ 84134636 │ 64.14332 │ │ 5024858779723305146 │ 1.6375619172419038 │ 2.2364127058838985 │ 3.8739746231258025 │ 3504521 │ 71.53516 │ └──────────────────────┴───────────────────────┴────────────────────┴────────────────────┴───────────┴──────────────┘Here are some statistics about the customer's database by disk-io-per-query over a period of about a month. You can see that the top query - the geo one we're discussing - uses over 10x as much io capacity as the next one, and the numbers fall off very quickly from there. So I expect it's responsible for the vast majority of disk reads in their database.
This is super helpful, thank you so much
@stomlinson.🙇 It sounds like the best path will be to selectively reduce metrics update frequency in order to reduce load.
Proposal
Implement a summary-based counting system that only recalculates metrics for data that has actually changed. Do job artifacts first. Leave other tables for follow ups if needed.
Core approach: Maintain pre-calculated counts in a summaries table organized by "buckets" (batches of rows). When data changes, invalidate only the affected buckets and recalculate them incrementally, rather than rescanning entire tables.
Key components:
- Summaries table: Store counts grouped by verification state for each bucket of rows. Use modulo or some hash function to distribute rows across buckets, avoiding hotspots from concurrent updates.
- Incremental invalidation: When a row's verification state changes, mark its bucket as "dirty". A background worker recalculates only dirty buckets, leaving unchanged buckets untouched.
- Initial population: When first enabling this system or adding a new data type, populate the summaries table by calculating all buckets. Add throttling (e.g., 100ms sleep between buckets) to manage memory pressure during initial population.
- Metrics retrieval: Reading metrics becomes a simple aggregation of the summaries table, with no expensive joins or table scans.
Benefits:
- Scales to any data size since work is proportional to changes, not total rows
- Reduces Geo metrics collection database load by orders of magnitude for stable data
- Maintains metric accuracy and consistency
Reference: Based on the bucketed counting benchmark proof of concept described in these calls 1, 2 (internal links).
Ideas:
Click to expand
create_table :geo_verification_state_summaries do |t|
t.text :replicable_type, null: false
t.integer :bucket_number, null: false
t.integer :total_count, default: 0
t.integer :verified_count, default: 0
t.integer :failed_count, default: 0
t.smallint :state, default: 0, null: false # enum: clean, dirty, calculating
t.datetime :last_calculated_at
t.datetime :state_changed_at
t.timestamps
t.index [:replicable_type, :bucket_number], unique: true
t.index [:replicable_type, :state], where: "state IN (1, 2)" # dirty or calculating
end
Worker:
# Atomically claim a dirty bucket
summary = GeoVerificationStateSummary
.where(state: :dirty)
.order(:state_changed_at) # oldest first
.limit(1)
.lock
.first
summary.update!(state: :calculating)
# ... do calculation ...
summary.update!(state: :clean, last_calculated_at: Time.current, counts: ...)
A DB trigger per data type (one per state table) (example):
-- For ci_job_artifacts
CREATE TRIGGER geo_job_artifact_state_summary_trigger
AFTER INSERT OR UPDATE OF verification_state OR DELETE ON ci_job_artifact_states
FOR EACH ROW
EXECUTE FUNCTION mark_geo_verification_summary_dirty('Ci::JobArtifact', 100000);
-- For uploads
CREATE TRIGGER geo_upload_state_summary_trigger
AFTER INSERT OR UPDATE OF verification_state OR DELETE ON upload_states
FOR EACH ROW
EXECUTE FUNCTION mark_geo_verification_summary_dirty('Upload', 30000);
-- For lfs_objects
CREATE TRIGGER geo_lfs_object_state_summary_trigger
AFTER INSERT OR UPDATE OF verification_state OR DELETE ON lfs_object_states
FOR EACH ROW
EXECUTE FUNCTION mark_geo_verification_summary_dirty('LfsObject', 30000);
-- Trigger function with parameters
CREATE OR REPLACE FUNCTION mark_geo_verification_summary_dirty()
RETURNS TRIGGER AS $$
DECLARE
v_bucket_number integer;
v_replicable_type text := TG_ARGV[0]; -- passed from trigger definition
v_bucket_count integer := TG_ARGV[1]::integer; -- passed from trigger definition
v_id bigint;
BEGIN
-- Get the ID from the appropriate column
IF TG_OP = 'DELETE' THEN
v_id := OLD.id; -- or job_artifact_id, upload_id, etc depending on table
ELSE
v_id := NEW.id;
END IF;
v_bucket_number := v_id % v_bucket_count;
INSERT INTO geo_verification_state_summaries
(replicable_type, bucket_number, state, state_changed_at, created_at, updated_at)
VALUES
(v_replicable_type, v_bucket_number, 1, NOW(), NOW(), NOW())
ON CONFLICT (replicable_type, bucket_number)
DO UPDATE SET
state = 1,
state_changed_at = NOW(),
updated_at = NOW()
WHERE geo_verification_state_summaries.state != 2;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
Workaround
Note that if the cron job runs at intervals longer than 10 minutes, then the Geo node status would be always Unhealthy . You should ignore the status in this case.