Skip to content

Improve EnqueuerWorker query performance

Steve Abrams requested to merge fix-enqueuer-query into master

What does this MR do and why?

We are in the process of migrating all container repositories to the new container registry. GitLab rails drives this process with the EnqueuerWorker. The EnqueuerWorker is responsible for finding the next qualified container repository for import, and then starting the import by making an API request to the registry.

In #362544 (closed), we discovered a strange performance issue where the main query used by this worker is using a Seq Scan rather than an Index Scan despite having a well defined index for the query. We discovered that the postgresql planner is mistakenly calculating the cost of using an index as being higher than using a sequential scan due to startup costs.

The planner expects it will find a matching record faster if it just starts scanning the table from the beginning based on its internal statistics, but if we ask it for more records, it will decide to use the index instead.

In the previous MR that fixed this problem, !87733 (merged), we updated from LIMIT 1 to LIMIT 2. But it seems that the planner now thinks it will find the first 2 records faster with a Seq Scan. So in this MR, we increase to LIMIT 25 to ensure we do not run into the problem again during the duration of the registry migration (just a few more weeks).

The migration has a fairly strict due date and is continuously running, so in order to keep risk low, we are choosing to make this minimal change to the query rather than explore other options such as setting the seq_page_cost higher in the transaction that the query is performed in.

The entire system driving the migration is behind a feature flag so I have not included a changelog.

Database

The Postgres query planner uses a variety of statistics and weights to decide on a query plan. So it will weight a Seq Scan as a certain cost based on the number of rows it expects to have to scan. The number of rows it expects to scan is determined by the table statistics, which include rough estimates of things like row (tuple) counts. What is happening here is the cost calculated for the Seq Scan is less than the cost of an index scan, so the Seq Scan is used. We can see the cost difference by looking at the explain plan, and then disabling Seq Scans and looking at the explain plan again:

-- This was tested on a postgres.ai cloned instance
gitlabhq_dblab# explain SELECT "container_repositories".* FROM "container_repositories"                                                                                                                                                              INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id"                                                                                                                                                                     INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id"                                                                                                                                                                             WHERE "container_repositories"."migration_state" = 'default'                                                                                                                                                                                         AND "container_repositories"."created_at" < '2022-01-23 00:00:00'                                                                                                                                                                                    AND (NOT EXISTS (                                                                                                                                                                                                                                            
  SELECT 1                                                                                                                                                                                                                                             
  FROM feature_gates                                                                                                                                                                                                                                   
  WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'                                                                                                                                                                             
  AND feature_gates.key = 'actors'                                                                                                                                                                                                                     
  AND feature_gates.value = concat('Group:', namespaces.traversal_ids[1])                                                                                                                                                                            )) LIMIT 2;

                                                                                    QUERY PLAN
═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 Limit  (cost=1.28..7.86 rows=2 width=168)
   ->  Nested Loop Anti Join  (cost=1.28..2009839.49 rows=611156 width=168)
         ->  Nested Loop  (cost=1.00..1812567.19 rows=650825 width=196)
               ->  Nested Loop  (cost=0.44..1395617.51 rows=650825 width=172)
                     ->  Seq Scan on container_repositories  (cost=0.00..218572.13 rows=650825 width=168)
                           Filter: ((created_at < '2022-01-23 00:00:00'::timestamp without time zone) AND (migration_state = 'default'::text))
                     ->  Index Scan using projects_pkey on projects  (cost=0.44..1.81 rows=1 width=8)
                           Index Cond: (id = container_repositories.project_id)
               ->  Index Scan using namespaces_pkey on namespaces  (cost=0.56..0.64 rows=1 width=32)
                     Index Cond: (id = projects.namespace_id)
         ->  Index Only Scan using index_feature_gates_on_feature_key_and_key_and_value on feature_gates  (cost=0.28..0.30 rows=1 width=8)
               Index Cond: ((feature_key = 'container_registry_phase_2_deny_list'::text) AND (key = 'actors'::text) AND (value = concat('Group:', (namespaces.traversal_ids)[1])))
(12 rows)

----------------------
-- Disable Seq Scan --
----------------------
gitlabhq_dblab# set enable_seqscan to off;
SET

gitlabhq_dblab# explain SELECT "container_repositories".* FROM "container_repositories"                                                                                                                                                              INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id"                                                                                                                                                                     INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id"                                                                                                                                                                             WHERE "container_repositories"."migration_state" = 'default'                                                                                                                                                                                         AND "container_repositories"."created_at" < '2022-01-23 00:00:00'                                                                                                                                                                                    AND (NOT EXISTS (                                                                                                                                                                                                                                            
  SELECT 1                                                                                                                                                                                                                                             
  FROM feature_gates                                                                                                                                                                                                                                   
  WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'                                                                                                                                                                             
  AND feature_gates.key = 'actors'                                                                                                                                                                                                                     
  AND feature_gates.value = concat('Group:', namespaces.traversal_ids[1])                                                                                                                                                                            )) LIMIT 2;
                                                                                    QUERY PLAN
═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
 Limit  (cost=1.70..7.87 rows=2 width=168)
   ->  Nested Loop Anti Join  (cost=1.70..1883835.70 rows=611156 width=168)
         ->  Nested Loop  (cost=1.42..1686563.40 rows=650825 width=196)
               ->  Nested Loop  (cost=0.86..1269613.71 rows=650825 width=172)
                     ->  Index Scan using tmp_idx_container_repos_on_non_migrated on container_repositories  (cost=0.42..92568.34 rows=650825 width=168)
                           Filter: (migration_state = 'default'::text)
                     ->  Index Scan using projects_pkey on projects  (cost=0.44..1.81 rows=1 width=8)
                           Index Cond: (id = container_repositories.project_id)
               ->  Index Scan using namespaces_pkey on namespaces  (cost=0.56..0.64 rows=1 width=32)
                     Index Cond: (id = projects.namespace_id)
         ->  Index Only Scan using index_feature_gates_on_feature_key_and_key_and_value on feature_gates  (cost=0.28..0.30 rows=1 width=8)
               Index Cond: ((feature_key = 'container_registry_phase_2_deny_list'::text) AND (key = 'actors'::text) AND (value = concat('Group:', (namespaces.traversal_ids)[1])))
(12 rows)
  • With Seq Scan: cost=1.28..7.86
  • Without Seq Scan: cost=1.70..7.87

The Sec Scan has a lower estimated cost, so Postgres chooses that plan.

To decide on the best limit to give us enough buffer, I looked at a few other limits and compared the current cost with the cost of the Seq Scan. Right now, the Index Scan has better cost, so to get the Seq Scan cost, I removed the various indexes that the query might use until a Seq Scan was chosen:

Limit Cost Cost Difference
1 Index Scan: cost=1.70..4.79 Seq Scan: cost=1.28..4.57 -0.22
2 Index Scan: cost=1.70..7.87 Seq Scan: cost=1.28..7.86 -0.01
10 Index Scan: cost=1.70..32.53 Seq Scan: cost=1.28..34.17 1.64
25 Index Scan: cost=1.70..78.77 Seq Scan: cost=1.28..83.50 4.73
50 Index Scan: cost=1.70..155.83 Seq Scan: cost=1.28..165.71 9.88
100 Index Scan: cost=1.70..309.95 Seq Scan: cost=1.28..330.14 20.19
1000 Index Scan: cost=1005.74..2343.71 Seq Scan: cost=1005.32..2523.71 180 (this uses a parallel index scan leading to higher cost and query time)

We know that in the past, LIMIT 1 worked with the index but we had to transition to LIMIT 2 because we crossed the threshold where the Seq Scan cost less. Since then, we have hit the same threshold for LIMIT 2 and now have to transition to at least LIMIT 3. Both costs increase as the limit increases, but we can speculate that we have decreased the gap by close to 0.22 after switching to LIMIT 2.

Based on this, I recommend we add additional padding by going with LIMIT 25.

The query time is still reasonable at 168ms cold, 7ms warm. This query runs in a loop, so we can expect most of the time we will see the warm query.

Screenshots or screen recordings

With no returned values:

[28] pry(main)> ContainerRepository.ready_for_import.limit(25)[0]
  ContainerRepository Load (1.6ms)  SELECT "container_repositories".* FROM "container_repositories" INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id" INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" WHERE "container_repositories"."migration_state" = 'default' AND "container_repositories"."created_at" < '2022-01-23 00:00:00' AND (NOT EXISTS (
        SELECT 1
        FROM feature_gates
        WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
        AND feature_gates.key = 'actors'
        AND feature_gates.value = concat('Group:', namespaces.traversal_ids[1])
      )) LIMIT 25 /*application:console,db_config_name:main,line:(pry):28:in `__pry__'*/
=> nil

With multiple returned values (I've ommitted .ready_for_import from these to show how the LIMIT is affected and for easier setup):

[29] pry(main)> ContainerRepository.limit(25)[0]
  ContainerRepository Load (0.5ms)  SELECT "container_repositories".* FROM "container_repositories" LIMIT 10 /*application:console,db_config_name:main,line:(pry):29:in `__pry__'*/
=> #<ContainerRepository:0x00007fc98cf642c8

How to set up and validate locally

  1. Create some container_repositories locally:

    26.times { FactoryBot.create(:container_repository, project: Project.first) }
  2. Try running the following to ensure they return the same results:

    Example with no results returned (you may have something returned if you have older container_repositories records stored locally):

    ContainerRepository.ready_for_import.limit(2)[0] # old
    ContainerRepository.ready_for_import.limit(25)[0] # new

    Example with results returned:

    ContainerRepository.limit(2)[0] # old
    ContainerRepository.limit(25)[0] # new

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #365893 (closed);

Edited by Steve Abrams

Merge request reports