Skip to content

Resolve "Geo: High number of sequential scans in project_repository_states"

What does this MR do?

This forces the query to find unverified projects to be verified on the primary noge to use the appropriated indexes instead of sequential scans in project_repository_states table.

This is the query plan against a database with ~400K rows in projects table, ~322K rows in the project_repository_states table:

  • Query
Geo::RepositoryVerificationFinder.new.find_unverified_projects(batch_size: 1000).where(repository_storage: 'default')
  • Before
=> EXPLAIN for: SELECT  "projects"."id" FROM "projects" LEFT OUTER JOIN "project_repository_states" ON "projects"."id" = "project_repository_states"."project_id" WHERE "project_repository_states"."id" IS NULL AND "projects"."repository_storage" = $1  ORDER BY projects.last_repository_updated_at ASC NULLS LAST LIMIT 1000 [["repository_storage", "default"]]
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Limit  (cost=42104.93..42104.94 rows=1 width=12)
   ->  Sort  (cost=42104.93..42104.94 rows=1 width=12)
         Sort Key: projects.last_repository_updated_at
         ->  Hash Left Join  (cost=16133.00..42104.92 rows=1 width=12)
               Hash Cond: (projects.id = project_repository_states.project_id)
               Filter: (project_repository_states.id IS NULL)
               ->  Seq Scan on projects  (cost=0.00..16452.33 rows=398026 width=12)
                     Filter: ((repository_storage)::text = 'default'::text)
               ->  Hash  (cost=11250.00..11250.00 rows=297600 width=8)
                     ->  Seq Scan on project_repository_states  (cost=0.00..11250.00 rows=297600 width=8)
  • After
=> EXPLAIN for: SELECT  "projects"."id" FROM "projects" LEFT OUTER JOIN "project_repository_states" ON "projects"."id" = "project_repository_states"."project_id" WHERE "project_repository_states"."project_id" IS NULL AND "projects"."repository_storage" = $1  ORDER BY projects.last_repository_updated_at ASC NULLS LAST LIMIT 1000 [["repository_storage", "default"]]
                                                                   QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.84..8482.74 rows=1000 width=12)
   ->  Nested Loop Anti Join  (cost=0.84..225712.68 rows=26611 width=12)
         ->  Index Scan using index_projects_on_last_repository_updated_at on projects  (cost=0.42..31747.09 rows=398026 width=12)
               Filter: ((repository_storage)::text = 'default'::text)
         ->  Index Only Scan using index_project_repository_states_on_project_id on project_repository_states  (cost=0.42..0.48 rows=1 width=4)
               Index Cond: (project_id = projects.id)

Does this MR meet the acceptance criteria?

  • Changelog entry added, if necessary
  • Documentation created/updated
  • API support added
  • Tests added for this feature/bug
  • Review
    • Has been reviewed by UX
    • Has been reviewed by Frontend
    • Has been reviewed by Backend
    • Has been reviewed by Database
  • EE specific content should be in the top level /ee folder
  • Conform by the merge request performance guides
  • Conform by the style guides
  • Squashed related commits together
  • Internationalization required/considered
  • If paid feature, have we considered GitLab.com plan and how it works for groups and is there a design for promoting it to users who aren't on the correct plan
  • End-to-end tests pass (package-qa manual pipeline job)

What are the relevant issue numbers?

Closes #5513 (closed)

Edited by Douglas Barbosa Alexandre

Merge request reports