Skip to content

Backfill CI queuing tables

Marius Bobin requested to merge mb-ci-backfill-queuing-tables into master

What does this MR do and why?

Related to #340585 (closed)

This migration is not needed on .com because we have rolled out the feature using feature flags. But if self-managed users skip some the releases in which we enabled those feature flags by default, they will have stuck pending builds.

At this moment, on .com we have ~14k pending builds, so I don't think we will see more for self-managed, that's why I'm using smaller batch sizes.

gitlabhq_production=> select count(*) from ci_pending_builds;
 count
-------
 13825
(1 row)

Sample queries from my local env:

-- post migration batches
SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."runner_id" IS NULL ORDER BY "ci_builds"."runner_id" ASC, "ci_builds"."id" ASC LIMIT 1 
SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."runner_id" IS NULL AND "ci_builds"."id" >= 2 ORDER BY "ci_builds"."runner_id" ASC, "ci_builds"."id" ASC LIMIT 1 OFFSET 1000 
SELECT MIN("ci_builds"."id"), MAX("ci_builds"."id") FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."runner_id" IS NULL AND "ci_builds"."id" >= 2 AND "ci_builds"."id" < 21129 
INSERT INTO "background_migration_jobs" ("created_at", "updated_at", "class_name", "arguments") VALUES ('2022-02-10 15:29:12.956883', '2022-02-10 15:29:12.956883', 'BackfillCiQueuingTables', '[2,21128]') RETURNING "id" 
SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."runner_id" IS NULL AND "ci_builds"."id" >= 21129 ORDER BY "ci_builds"."runner_id" ASC, "ci_builds"."id" ASC LIMIT 1 OFFSET 1000 
SELECT MIN("ci_builds"."id"), MAX("ci_builds"."id") FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."runner_id" IS NULL AND "ci_builds"."id" >= 21129 
INSERT INTO "background_migration_jobs" ("created_at", "updated_at", "class_name", "arguments") VALUES ('2022-02-10 15:29:12.973778', '2022-02-10 15:29:12.973778', 'BackfillCiQueuingTables', '[21129,23059]') RETURNING "id" 

-- background migration batching 
SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."id" BETWEEN 2 AND 21128 AND "ci_builds"."id" >= 3800 ORDER BY "ci_builds"."id" ASC LIMIT 1 OFFSET 100 
SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."id" BETWEEN 2 AND 21128 AND "ci_builds"."id" >= 3800 AND "ci_builds"."id" < 4005 AND (NOT EXISTS (SELECT 1 FROM "ci_pending_builds" WHERE (ci_builds.id = ci_pending_builds.build_id))) 

-- batch preloads  
SELECT "projects".* FROM "projects" WHERE "projects"."id" = 17 
SELECT "namespaces".* FROM "namespaces" WHERE "namespaces"."id" = 1 
SELECT "project_ci_cd_settings".* FROM "project_ci_cd_settings" WHERE "project_ci_cd_settings"."project_id" = 17 

-- queries for each upsert
SELECT "taggings"."tag_id" FROM "taggings" WHERE "taggings"."taggable_id" = 3800 AND "taggings"."taggable_type" = 'CommitStatus' 
INSERT INTO "ci_pending_builds" ("build_id","project_id","protected","instance_runners_enabled","namespace_id","minutes_exceeded","tag_ids","namespace_traversal_ids") VALUES (3800, 17, TRUE, TRUE, 1, FALSE, '{}', '{1}') ON CONFLICT ("build_id") DO UPDATE SET "project_id"=excluded."project_id","protected"=excluded."protected","instance_runners_enabled"=excluded."instance_runners_enabled","namespace_id"=excluded."namespace_id","minutes_exceeded"=excluded."minutes_exceeded","tag_ids"=excluded."tag_ids","namespace_traversal_ids"=excluded."namespace_traversal_ids" RETURNING "build_id" 
  
SELECT "taggings"."tag_id" FROM "taggings" WHERE "taggings"."taggable_id" = 3896 AND "taggings"."taggable_type" = 'CommitStatus' 
INSERT INTO "ci_pending_builds" ("build_id","project_id","protected","instance_runners_enabled","namespace_id","minutes_exceeded","tag_ids","namespace_traversal_ids") VALUES (3896, 17, TRUE, TRUE, 1, FALSE, '{}', '{1}') ON CONFLICT ("build_id") DO UPDATE SET "project_id"=excluded."project_id","protected"=excluded."protected","instance_runners_enabled"=excluded."instance_runners_enabled","namespace_id"=excluded."namespace_id","minutes_exceeded"=excluded."minutes_exceeded","tag_ids"=excluded."tag_ids","namespace_traversal_ids"=excluded."namespace_traversal_ids" RETURNING "build_id" 

Execution plans from production


-- finding lower limit for batching
gitlabhq_production=> explain (analyze, buffers) SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."runner_id" IS NULL ORDER BY "ci_builds"."runner_id" ASC, "ci_builds"."id" ASC LIMIT 1;
                                                                                 QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..0.59 rows=1 width=12) (actual time=20.287..20.289 rows=1 loops=1)
   Buffers: shared hit=14 read=29 dirtied=1 written=6
   I/O Timings: read=14.437 write=5.101
   ->  Index Only Scan using index_ci_builds_runner_id_pending_covering on ci_builds  (cost=0.41..58249.24 rows=315923 width=12) (actual time=20.285..20.286 rows=1 loops=1)
         Index Cond: (runner_id IS NULL)
         Heap Fetches: 35
         Buffers: shared hit=14 read=29 dirtied=1 written=6
         I/O Timings: read=14.437 write=5.101
 Planning Time: 77.890 ms
 Execution Time: 20.332 ms
(10 rows)

-- finding the lower limit for next query
gitlabhq_production=> SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."runner_id" IS NULL ORDER BY "ci_builds"."runner_id" ASC, "ci_builds"."id" ASC LIMIT 1;
     id
------------
 1801920260
(1 row)

--- finding the upper limit for the batch
gitlabhq_production=> explain (analyze, buffers) SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."runner_id" IS NULL AND "ci_builds"."id" >= 1801920260 ORDER BY "ci_builds"."runner_id" ASC, "ci_builds"."id" ASC LIMIT 1 OFFSET 1000;
                                                                                   QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1266.02..1267.28 rows=1 width=12) (actual time=1859.122..1859.124 rows=1 loops=1)
   Buffers: shared hit=2223 read=4429 dirtied=8 written=3632
   I/O Timings: read=1450.010 write=315.193
   ->  Index Only Scan using index_ci_builds_runner_id_pending_covering on ci_builds  (cost=0.41..54258.18 rows=42871 width=12) (actual time=1.200..1858.913 rows=1001 loops=1)
         Index Cond: ((runner_id IS NULL) AND (id >= 1801920260))
         Heap Fetches: 5870
         Buffers: shared hit=2223 read=4429 dirtied=8 written=3632
         I/O Timings: read=1450.010 write=315.193
 Planning Time: 0.451 ms
 Execution Time: 1859.152 ms
(10 rows)

-- the upper limit value is:
gitlabhq_production=> SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."runner_id" IS NULL AND "ci_builds"."id" >= 1801920260 ORDER BY "ci_builds"."runner_id" ASC, "ci_builds"."id" ASC LIMIT 1 OFFSET 1000;
     id
------------
 2074410874
(1 row)

-- yielded batching parameters
gitlabhq_production=> explain (analyze, buffers) SELECT MIN("ci_builds"."id"), MAX("ci_builds"."id") FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."runner_id" IS NULL AND "ci_builds"."id" >= 1801920260 AND "ci_builds"."id" < 2074410874 ;
                                                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=8705.80..8705.81 rows=1 width=16) (actual time=3.789..3.790 rows=1 loops=1)
   Buffers: shared hit=6 read=5 written=3
   I/O Timings: read=3.318 write=0.333
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.58..4352.90 rows=1 width=8) (actual time=1.872..1.873 rows=1 loops=1)
           Buffers: shared hit=2 read=3 written=2
           I/O Timings: read=1.698 write=0.101
           ->  Index Scan using ci_builds_pkey on ci_builds  (cost=0.58..184769115.01 rows=42453 width=8) (actual time=1.871..1.871 rows=1 loops=1)
                 Index Cond: ((id IS NOT NULL) AND (id >= 1801920260) AND (id < 2074410874))
                 Filter: ((runner_id IS NULL) AND ((status)::text = 'pending'::text) AND ((type)::text = 'Ci::Build'::text))
                 Buffers: shared hit=2 read=3 written=2
                 I/O Timings: read=1.698 write=0.101
   InitPlan 2 (returns $1)
     ->  Limit  (cost=0.58..4352.90 rows=1 width=8) (actual time=1.910..1.910 rows=1 loops=1)
           Buffers: shared hit=4 read=2 written=1
           I/O Timings: read=1.620 write=0.232
           ->  Index Scan Backward using ci_builds_pkey on ci_builds ci_builds_1  (cost=0.58..184769115.01 rows=42453 width=8) (actual time=1.909..1.909 rows=1 loops=1)
                 Index Cond: ((id IS NOT NULL) AND (id >= 1801920260) AND (id < 2074410874))
                 Filter: ((runner_id IS NULL) AND ((status)::text = 'pending'::text) AND ((type)::text = 'Ci::Build'::text))
                 Rows Removed by Filter: 1
                 Buffers: shared hit=4 read=2 written=1
                 I/O Timings: read=1.620 write=0.232
 Planning Time: 1.551 ms
 Execution Time: 3.838 ms
(24 rows)

-- finding the lower limit for the next batch
gitlabhq_production=> explain (analyze, buffers) SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."runner_id" IS NULL AND "ci_builds"."id" >= 2074410874 ORDER BY "ci_builds"."runner_id" ASC, "ci_builds"."id" ASC LIMIT 1 OFFSET 1000 ;
                                                                                 QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1730.48..1734.61 rows=1 width=12) (actual time=449.784..449.786 rows=1 loops=1)
   Buffers: shared hit=92 read=1516 written=1380
   I/O Timings: read=366.399 write=69.881
   ->  Index Only Scan using index_ci_builds_runner_id_pending_covering on ci_builds  (cost=0.41..1730.48 rows=419 width=12) (actual time=14.997..449.627 rows=1001 loops=1)
         Index Cond: ((runner_id IS NULL) AND (id >= 2074410874))
         Heap Fetches: 1414
         Buffers: shared hit=92 read=1516 written=1380
         I/O Timings: read=366.399 write=69.881
 Planning Time: 0.636 ms
 Execution Time: 449.816 ms
(10 rows)


-- background migration queries
-- finding the new upper limit for the batch
gitlabhq_production=> explain (analyze, buffers)  SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."id" BETWEEN 1801920260 AND 2074410874 AND "ci_builds"."id" >= 1801920260 ORDER BY "ci_builds"."id" ASC LIMIT 1 OFFSET 100 ;
                                                                                     QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=60283.82..60283.83 rows=1 width=8) (actual time=100.657..100.660 rows=1 loops=1)
   Buffers: shared hit=17350
   ->  Sort  (cost=60283.57..60595.77 rows=124877 width=8) (actual time=100.644..100.653 rows=101 loops=1)
         Sort Key: id
         Sort Method: top-N heapsort  Memory: 29kB
         Buffers: shared hit=17350
         ->  Index Only Scan using index_ci_builds_runner_id_pending_covering on ci_builds  (cost=0.41..55501.90 rows=124877 width=8) (actual time=0.017..100.497 rows=1000 loops=1)
               Index Cond: ((id >= 1801920260) AND (id <= 2074410874) AND (id >= 1801920260))
               Heap Fetches: 5896
               Buffers: shared hit=17350
 Planning Time: 0.560 ms
 Execution Time: 100.710 ms
(12 rows)

-- Upper limit value is:
gitlabhq_production=> SELECT "ci_builds"."id" FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."id" BETWEEN 1801920260 AND 2074410874 AND "ci_builds"."id" >= 1801920260 ORDER BY "ci_builds"."id" ASC LIMIT 1 OFFSET 100 ;
     id
------------
 2072839025
(1 row)

-- loading usable data:
gitlabhq_production=> explain (analyze, buffers) SELECT "ci_builds".* FROM "ci_builds" WHERE "ci_builds"."status" = 'pending' AND "ci_builds"."type" = 'Ci::Build' AND "ci_builds"."id" BETWEEN 1801920260 AND 2074410874 AND "ci_builds"."id" >= 1801920260 AND "ci_builds"."id" < 2072839025 AND (NOT EXISTS (SELECT 1 FROM "ci_pending_builds" WHERE (ci_builds.id = ci_pending_builds.build_id)));
                                                                                 QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=3477.16..219498.15 rows=124152 width=1290) (actual time=456.001..456.005 rows=0 loops=1)
   Hash Cond: (ci_builds.id = ci_pending_builds.build_id)
   Buffers: shared hit=1117 read=13965 dirtied=265 written=12841
   I/O Timings: read=101.554 write=230.701
   ->  Index Scan using index_ci_builds_runner_id_pending_covering on ci_builds  (cost=0.41..214453.98 rows=124153 width=1290) (actual time=0.539..408.643 rows=100 loops=1)
         Index Cond: ((id >= 1801920260) AND (id <= 2074410874) AND (id >= 1801920260) AND (id < 2072839025))
         Buffers: shared hit=674 read=13631 written=12527
         I/O Timings: read=82.728 write=216.176
   ->  Hash  (cost=3271.89..3271.89 rows=16389 width=8) (actual time=47.042..47.044 rows=13361 loops=1)
         Buckets: 32768  Batches: 1  Memory Usage: 778kB
         Buffers: shared hit=443 read=334 dirtied=265 written=314
         I/O Timings: read=18.826 write=14.525
         ->  Seq Scan on ci_pending_builds  (cost=0.00..3271.89 rows=16389 width=8) (actual time=0.054..44.415 rows=13361 loops=1)
               Buffers: shared hit=443 read=334 dirtied=265 written=314
               I/O Timings: read=18.826 write=14.525
 Planning Time: 109.865 ms
 Execution Time: 456.222 ms
(17 rows)

Screenshots or screen recordings

Migration output:

marius@roast-carrot:gitlab$ VERSION=20220208115439 bin/rails db:migrate:redo
== 20220208115439 StartBackfillCiQueuingTables: reverting =====================
== 20220208115439 StartBackfillCiQueuingTables: reverted (0.0000s) ============

== 20220208115439 StartBackfillCiQueuingTables: migrating =====================
-- Scheduled 3 BackfillCiQueuingTables jobs with a maximum of 1000 records per batch and an interval of 120 seconds.

The migration is expected to take at least 360 seconds. Expect all jobs to have completed after 2022-02-10 13:49:11 UTC."
== 20220208115439 StartBackfillCiQueuingTables: migrated (0.1072s) ============

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Marius Bobin

Merge request reports