Geo: Create geo_ci_job_artifact_verification_summaries table

What does this MR do and why?

Adds geo_ci_job_artifact_verification_summaries table for improved metrics counting.

Problem

Geo currently collects job artifact verification metrics by scanning the entire ci_job_artifact_states table every minute. This approach doesn't scale for tables with 100M+ rows, causing timeouts and excessive database load.

Solution

This MR introduces a bucketed verification summaries table as the foundation for a more scalable metrics collection approach. Instead of counting the full table, we'll pre-calculate counts per bucket and only recalculate buckets that have changed.

How it works:

  • Job artifacts are assigned to buckets using job_artifact_id % 100,000
  • Each bucket stores pre-calculated total_count, verified_count, and failed_count
  • Buckets have a state machine: cleandirtycalculatingclean
  • A future DB trigger (MR2/3) will mark buckets as dirty when artifacts change
  • A calculator worker will recalculate only dirty buckets

Why 100k buckets:

  • With 100M artifacts, each bucket averages ~1000 rows - fast to recount
  • Modulo distribution spreads new records evenly (unlike range-based which creates hotspots)
  • Bucket count is fixed; no need to add buckets as data grows

Why CI database (gitlab_ci_cell_local):

  • The source table ci_job_artifact_states is on the CI database
  • PostgreSQL triggers cannot write across databases
  • Model inherits from Ci::ApplicationRecord to route correctly

This is MR 1 of 3. Follow-up MRs will:

  • MR 2 - add the backfill service & calculator worker (behind FF) - #590853 (closed)
  • MR 3- DB trigger, and feature-flagged switch in the replicator - #590854

Query Plans database

Seeded data in postgres.ai (CI database):

Created the table, indexes, and sequence matching the migration, then seeded 100k rows across 3 states:

  • 90k clean (state = 0)
  • 5k dirty (state = 1)
  • 5k calculating (state = 2)

Raw SQL and query plans:

Dirty scope:

SELECT "geo_ci_job_artifact_verification_summaries".*
FROM "geo_ci_job_artifact_verification_summaries"
WHERE "geo_ci_job_artifact_verification_summaries"."state" = 1
ORDER BY "geo_ci_job_artifact_verification_summaries"."state_changed_at" ASC;

Plan (1.6ms execution): https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/49285/commands/147088

Calculating scope:

SELECT "geo_ci_job_artifact_verification_summaries".*
FROM "geo_ci_job_artifact_verification_summaries"
WHERE "geo_ci_job_artifact_verification_summaries"."state" = 2;

Plan (0.8ms execution): https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/49285/commands/147089

Table size/growth:

  • Fixed maximum of 100k rows (one per bucket) (data will be backfilled in follow-up issue - #590853 (closed))
  • Reads: periodic queries for dirty/calculating buckets + SUM aggregations for metrics collection - follow-up issue - #590853 (closed)
  • Writes: triggered by verification state changes on ci_job_artifact_states, plus calculator worker updates - follow-up issue - #590854
  • Availability risk: low - small fixed-size table

How to set up and validate locally

  1. Ensure your GDK is set up and running
  2. Run the migration:
    cd gitlab && bin/rails db:migrate
  3. Verify the table exists:
    bin/rails runner "puts Geo::CiJobArtifactVerificationSummary.table_exists?"
    Expected output: true
  4. Run the specs:
    bin/rspec ee/spec/models/geo/ci_job_artifact_verification_summary_spec.rb
    Expected output: 6 examples, 0 failures

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #590852 (closed)

Edited by Scott Murray

Merge request reports

Loading