Improve performance of registry enqueuer query (!87733) · Merge requests · GitLab.org / GitLab

🔎 What does this MR do and why?

We are in the process of migrating all container repositories to the new container registry. GitLab rails drives this process with the EnqueuerWorker. The EnqueuerWorker is responsible for finding the next qualified container repository for import, and then starting the import by making an API request to the registry.

We discovered the main query being used to find the next qualified container repository is underperforming, using a Seq Scan instead of an Index Scan on container_repositories. This was rather confusing because we already have a few indexes that looked like the right index for the job:

"idx_container_repos_on_migration_state_migration_plan_created" btree (migration_state, migration_plan, created_at)
"tmp_idx_container_repos_on_non_migrated" btree (project_id, id) WHERE migration_state <> 'import_done'::text AND created_at < '2022-01-23 00:00:00'::timestamp without time zone

After some digging, we discovered that the postgresql planner is mistakenly calculating the cost of using an index as being higher than using a sequential scan due to startup costs. This means that for this particular query, no index on a related condition will ever be chosen over the sequential scan.

The planner expects it will find a matching record faster if it just starts scanning the table from the beginning based on its internal statistics, but if we ask it for two records, it will decide to use the index instead.

So in this MR, we update from using .take (which applies LIMIT 1) to using .limit(2)[0]. We cannot use .limit(1).first because rails will translate that into LIMIT 1. Using [0] forces rails to execute the query with LIMIT 2 and then use the first record returned. If no records are found, nil is returned just as it would have been using .take.

You may be asking "why are we adding what looks like a bandaid to the query rather than fixing this at a deeper level?". The registry migration is continuously running on production, so this query is currently executing around 2000 times per hour. We are on a tight timeline to have the migration completed, so we want to introduce as little changes and risk as possible. Updating to use a limit and ignore the extra returned record is the "boring solution" in this case. It allows us to keep moving forward, but without risking introducing any further bugs into the system.

💿 Database

The only update to this query is `LIMIT 1

SELECT "container_repositories".* FROM "container_repositories" 
INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id" 
INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" 
WHERE "container_repositories"."migration_state" = 'default' 
AND "container_repositories"."created_at" < '2022-01-23 00:00:00' 
AND (
  "container_repositories"."migration_plan" IN ('free', 'early_adopter') 
  OR "container_repositories"."migration_plan" IS NULL
)
AND (NOT EXISTS (
        SELECT 1
        FROM feature_gates
        WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
        AND feature_gates.key = 'actors'
        AND feature_gates.value = concat('Group:', namespaces.traversal_ids[1])
+      )) LIMIT 2;
-      )) LIMIT 1;

Explain plan before 16.67s: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/10144/commands/35811 😱
Explain plan after 16ms: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/10144/commands/35813 😎

Note also the large reduction in buffer usage.

Screenshots or screen recordings

With no returned values:

[28] pry(main)> ContainerRepository.ready_for_import.limit(2)[0]
  ContainerRepository Load (1.6ms)  SELECT "container_repositories".* FROM "container_repositories" INNER JOIN "projects" ON "projects"."id" = "container_repositories"."project_id" INNER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" WHERE "container_repositories"."migration_state" = 'default' AND "container_repositories"."created_at" < '2022-01-23 00:00:00' AND "container_repositories"."migration_plan" IN ('ultimate', 'gold', 'ultimate_trial') AND (NOT EXISTS (
        SELECT 1
        FROM feature_gates
        WHERE feature_gates.feature_key = 'container_registry_phase_2_deny_list'
        AND feature_gates.key = 'actors'
        AND feature_gates.value = concat('Group:', namespaces.traversal_ids[1])
      )) LIMIT 2 /*application:console,db_config_name:main,line:(pry):28:in `__pry__'*/
=> nil

With multiple returned values (I've ommitted .ready_for_import from these to show how the LIMIT is affected and for easier setup):

[29] pry(main)> ContainerRepository.limit(2)[0]
  ContainerRepository Load (0.5ms)  SELECT "container_repositories".* FROM "container_repositories" LIMIT 2 /*application:console,db_config_name:main,line:(pry):29:in `__pry__'*/
=> #<ContainerRepository:0x00007fc98cf642c8

How to set up and validate locally

Create some container_repositories locally:

10.times { FactoryBot.create(:container_repository, project: Project.first) }

Try running the following to ensure they return the same results:

Example with no results returned (you may have something returned if you have older container_repositories records stored locally):
```
ContainerRepository.ready_for_import.take # old
ContainerRepository.ready_for_import.limit(2)[0] # new
```
Example with results returned:
```
ContainerRepository.take # old
ContainerRepository.limit(2)[0] # new
```

📏 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related: #362544 (closed)

Edited May 16, 2022 by Steve Abrams

Improve performance of registry enqueuer query

🔎 What does this MR do and why?

💿 Database

Screenshots or screen recordings

How to set up and validate locally

📏 MR acceptance checklist

Merge request reports