Skip to content

Resolve "Geo - Revisit the indexes on the project_repository_states table"

What does this MR do?

Improve the query returned by Geo::RepositoryVerificationFinder#find_outdated_projects to find projects where the checksum needs to be recalculated. Follow the query plans were run locally against a database with ~400k rows in the projects, and project_repository_states tables.

EXPLAIN ANALYZE
SELECT
    "projects"."id"
FROM
    "projects"
    INNER JOIN "project_repository_states" ON "project_repository_states"."project_id" = "projects"."id"
    WHERE ("project_repository_states"."repository_verification_checksum" IS NULL
        AND "project_repository_states"."last_repository_verification_failure" IS NULL
        OR "project_repository_states"."wiki_verification_checksum" IS NULL
        AND "project_repository_states"."last_wiki_verification_failure" IS NULL)
    AND "projects"."repository_storage" = 'default'
LIMIT 1000;
  • Before:
Limit  (cost=3.58..302.13 rows=1000 width=4) (actual time=330.079..335.280 rows=1000 loops=1)
  ->  Merge Join  (cost=3.58..85720.56 rows=287110 width=4) (actual time=330.078..335.220 rows=1000 loops=1)
        Merge Cond: (projects.id = project_repository_states.project_id)
        ->  Index Scan using projects_pkey on projects  (cost=0.42..34092.75 rows=400011 width=4) (actual time=0.006..94.979 rows=213383 loops=1)
              Filter: ((repository_storage)::text = 'default'::text)
        ->  Index Scan using index_project_repository_states_on_project_id on project_repository_states  (cost=0.42..48412.66 rows=287110 width=4) (actual time=224.091..226.711 rows=1000 loops=1)
              Filter: (((repository_verification_checksum IS NULL) AND (last_repository_verification_failure IS NULL)) OR ((wiki_verification_checksum IS NULL) AND (last_wiki_verification_failure IS NULL)))
              Rows Removed by Filter: 212383
Planning time: 0.485 ms
Execution time: 335.348 ms
  • After:
Limit  (cost=1.40..167.31 rows=1000 width=4) (actual time=135.072..136.107 rows=1000 loops=1)
  ->  Merge Join  (cost=1.40..47635.61 rows=287110 width=4) (actual time=135.072..136.024 rows=1000 loops=1)
        Merge Cond: (projects.id = project_repository_states.project_id)
        ->  Index Scan using projects_pkey on projects  (cost=0.42..34092.75 rows=400011 width=4) (actual time=0.005..120.190 rows=213383 loops=1)
              Filter: ((repository_storage)::text = 'default'::text)
        ->  Index Only Scan using idx_repository_states_outdated_checksums on project_repository_states  (cost=0.42..9232.16 rows=287110 width=4) (actual time=0.031..0.182 rows=1000 loops=1)
              Heap Fetches: 298
Planning time: 0.470 ms
Execution time: 136.184 ms

Does this MR meet the acceptance criteria?

  • Changelog entry added, if necessary
  • Conform by the code review guidelines
    • Has been reviewed by a Backend maintainer
    • Has been reviewed by a Database specialist
  • EE specific content should be in the top level /ee folder
  • Conform by the merge request performance guides
  • Conform by the style guides
  • If you have multiple commits, please combine them into a few logically organized commits by squashing them
  • Internationalization required/considered
  • If paid feature, have we considered GitLab.com plan and how it works for groups and is there a design for promoting it to users who aren't on the correct plan
  • End-to-end tests pass (package-qa manual pipeline job)

What are the relevant issue numbers?

Closes #6014 (closed)

Edited by Douglas Barbosa Alexandre

Merge request reports