Resolve "Geo: High number of sequential scans in project_repository_states" (!5348) · Merge requests · GitLab.org / GitLab

What does this MR do?

This forces the query to find unverified projects to be verified on the primary noge to use the appropriated indexes instead of sequential scans in project_repository_states table.

This is the query plan against a database with ~400K rows in projects table, ~322K rows in the project_repository_states table:

Query

Geo::RepositoryVerificationFinder.new.find_unverified_projects(batch_size: 1000).where(repository_storage: 'default')

Before

=> EXPLAIN for: SELECT  "projects"."id" FROM "projects" LEFT OUTER JOIN "project_repository_states" ON "projects"."id" = "project_repository_states"."project_id" WHERE "project_repository_states"."id" IS NULL AND "projects"."repository_storage" = $1  ORDER BY projects.last_repository_updated_at ASC NULLS LAST LIMIT 1000 [["repository_storage", "default"]]
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Limit  (cost=42104.93..42104.94 rows=1 width=12)
   ->  Sort  (cost=42104.93..42104.94 rows=1 width=12)
         Sort Key: projects.last_repository_updated_at
         ->  Hash Left Join  (cost=16133.00..42104.92 rows=1 width=12)
               Hash Cond: (projects.id = project_repository_states.project_id)
               Filter: (project_repository_states.id IS NULL)
               ->  Seq Scan on projects  (cost=0.00..16452.33 rows=398026 width=12)
                     Filter: ((repository_storage)::text = 'default'::text)
               ->  Hash  (cost=11250.00..11250.00 rows=297600 width=8)
                     ->  Seq Scan on project_repository_states  (cost=0.00..11250.00 rows=297600 width=8)

After

=> EXPLAIN for: SELECT  "projects"."id" FROM "projects" LEFT OUTER JOIN "project_repository_states" ON "projects"."id" = "project_repository_states"."project_id" WHERE "project_repository_states"."project_id" IS NULL AND "projects"."repository_storage" = $1  ORDER BY projects.last_repository_updated_at ASC NULLS LAST LIMIT 1000 [["repository_storage", "default"]]
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..8482.74 rows=1000 width=12)
   ->  Nested Loop Anti Join  (cost=0.84..225712.68 rows=26611 width=12)
         ->  Index Scan using index_projects_on_last_repository_updated_at on projects  (cost=0.42..31747.09 rows=398026 width=12)
               Filter: ((repository_storage)::text = 'default'::text)
         ->  Index Only Scan using index_project_repository_states_on_project_id on project_repository_states  (cost=0.42..0.48 rows=1 width=4)
               Index Cond: (project_id = projects.id)

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Closes #5513 (closed)

Resolve "Geo: High number of sequential scans in project_repository_states"

What does this MR do?

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Merge request reports