Skip to content

Use more-efficient indexing for the MergeRequestDiff storage migration

What does this MR do?

Alters the query we use to select merge request diffs to migrate to rely on the newly-added files_count column.

This is populated for old records via a background migration merged recently: !38936 (merged) . However, if the files_count column is nil for any case, it doesn't really matter - the scheduler is stateless and runs forever, so records will be added to the list to migrate as the files_count column gets filled.

Selecting records with the new query (when: always)

Here's the new query we use when plucking a list of 1000 records to migrate:

SELECT "merge_request_diffs"."id"
FROM "merge_request_diffs"
WHERE (NOT "merge_request_diffs"."stored_externally" OR "merge_request_diffs"."stored_externally" IS NULL)
AND (files_count > 0)

With seq scan disabled (I only have ~100 rows), this gives me the following plan locally:

                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using index_merge_request_diffs_by_id_partial on merge_request_diffs  (cost=0.14..10.14 rows=133 width=4)
(1 row)

Previously, we'd have a WHERE (SELECT ... from merge_request_diff_files WHERE merge_request_diffs.id = merge_request_diff_files.id), which started timing out once the migration process got ~50% through the 72 million rows we have in the merge_reuqest_diffs table.

Selecting records with the new query (when: outdated)

We modify these queries too, but we don't use them on GitLab.com and there's no expectation that they scale to .com levels at the moment. Still, here they are:

Old merged diffs

SELECT "merge_request_diffs"."id"
FROM "merge_request_diffs"
INNER JOIN "merge_request_metrics"
  ON "merge_request_diffs"."merge_request_id" = "merge_request_metrics"."merge_request_id"
  AND "merge_request_metrics"."merged_at" IS NOT NULL
INNER JOIN "merge_requests" 
  ON "merge_request_diffs"."merge_request_id" = "merge_requests"."id"
WHERE (
  "merge_request_diffs"."stored_externally" = FALSE OR
  "merge_request_diffs"."stored_externally" IS NULL
)
AND "merge_request_diffs"."files_count" > 0 
AND "merge_requests"."state_id" = 3
AND "merge_request_metrics"."merged_at" <= '2020-08-10 13:41:55.499024'
AND "merge_request_metrics"."merged_at" IS NOT NULL
LIMIT 1000
 Limit  (cost=0.42..15.05 rows=1 width=4)
   ->  Nested Loop  (cost=0.42..15.05 rows=1 width=4)
         ->  Nested Loop  (cost=0.27..14.07 rows=1 width=12)
               ->  Index Only Scan using index_merge_request_metrics_on_merge_request_id_and_merged_at on merge_request_metrics  (cost=0.12..5.89 rows=1 width=4)
                     Index Cond: (merged_at <= '2020-08-10 13:44:14.198804'::timestamp without time zone)
               ->  Index Scan using index_merge_request_diffs_on_merge_request_id_and_id on merge_request_diffs  (cost=0.14..8.16 rows=1 width=8)
                     Index Cond: (merge_request_id = merge_request_metrics.merge_request_id)
                     Filter: (((NOT stored_externally) OR (stored_externally IS NULL)) AND (files_count > 0))
         ->  Index Scan using merge_requests_pkey on merge_requests  (cost=0.14..0.56 rows=1 width=4)
               Index Cond: (id = merge_request_diffs.merge_request_id)
               Filter: (state_id = 3)
(11 rows)

Old closed diffs

SELECT "merge_request_diffs"."id"
FROM "merge_request_diffs"
INNER JOIN "merge_requests"
  ON "merge_requests"."id" = "merge_request_diffs"."merge_request_id"
INNER JOIN "merge_request_metrics"
  ON "merge_request_metrics"."merge_request_id" = "merge_requests"."id"
WHERE (
  "merge_request_diffs"."stored_externally" = 'f' 
  OR "merge_request_diffs"."stored_externally" IS NULL
)
AND "merge_request_diffs"."files_count" > 0
AND "merge_requests"."state_id" = 2
AND "merge_request_metrics"."latest_closed_at" <= '2020-08-10 13:47:16.224416'
LIMIT 1000
 Limit  (cost=0.42..17.30 rows=1 width=4)
   ->  Nested Loop  (cost=0.42..17.30 rows=1 width=4)
         ->  Nested Loop  (cost=0.27..16.32 rows=1 width=12)
               ->  Index Scan using index_merge_request_metrics_on_latest_closed_at on merge_request_metrics  (cost=0.12..8.14 rows=1 width=4)
                     Index Cond: (latest_closed_at <= '2020-08-10 13:47:16.224416+00'::timestamp with time zone)
               ->  Index Scan using index_merge_request_diffs_on_merge_request_id_and_id on merge_request_diffs  (cost=0.14..8.16 rows=1 width=8)
                     Index Cond: (merge_request_id = merge_request_metrics.merge_request_id)
                     Filter: (((NOT stored_externally) OR (stored_externally IS NULL)) AND (files_count > 0))
         ->  Index Scan using merge_requests_pkey on merge_requests  (cost=0.14..0.56 rows=1 width=4)
               Index Cond: (id = merge_request_diffs.merge_request_id)
               Filter: (state_id = 2)

Not latest diffs

SELECT "merge_request_diffs"."id"
FROM "merge_request_diffs"
INNER JOIN "merge_requests"
  ON "merge_requests"."id" = "merge_request_diffs"."merge_request_id"
  AND "merge_request_diffs"."id" != "merge_requests"."latest_merge_request_diff_id"
WHERE (
  "merge_request_diffs"."stored_externally" = 'f'
  OR "merge_request_diffs"."stored_externally" IS NULL
)
AND "merge_request_diffs"."files_count" > 0
LIMIT 1000
 Limit  (cost=47.30..57.35 rows=200 width=4)
   ->  Hash Join  (cost=47.30..57.35 rows=200 width=4)
         Hash Cond: (merge_request_diffs.merge_request_id = merge_requests.id)
         Join Filter: (merge_request_diffs.id <> merge_requests.latest_merge_request_diff_id)
         ->  Bitmap Heap Scan on merge_request_diffs  (cost=9.20..18.71 rows=201 width=8)
               Recheck Cond: ((files_count > 0) AND ((NOT stored_externally) OR (stored_externally IS NULL)))
               ->  Bitmap Index Scan on index_merge_request_diffs_by_id_partial  (cost=0.00..9.15 rows=201 width=0)
         ->  Hash  (cost=35.45..35.45 rows=212 width=8)
               ->  Index Scan using merge_requests_pkey on merge_requests  (cost=0.14..35.45 rows=212 width=8)

rails db:migrate

lupine@t470p:~/dev/gdk/gitlab$ bin/rails db:migrate
== 20200813143304 AddNewExternalDiffMigrationIndex: migrating =================
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:merge_request_diffs, :id, {:name=>"index_merge_request_diffs_by_id_partial", :where=>"files_count > 0 AND ((NOT stored_externally) OR (stored_externally IS NULL))", :algorithm=>:concurrently})
   -> 0.0023s
-- add_index(:merge_request_diffs, :id, {:name=>"index_merge_request_diffs_by_id_partial", :where=>"files_count > 0 AND ((NOT stored_externally) OR (stored_externally IS NULL))", :algorithm=>:concurrently})
   -> 0.0080s
== 20200813143304 AddNewExternalDiffMigrationIndex: migrated (0.0109s) ========

== 20200813143356 RemoveOldExternalDiffMigrationIndex: migrating ==============
-- transaction_open?()
   -> 0.0000s
-- indexes(:merge_request_diffs)
   -> 0.0021s
== 20200813143356 RemoveOldExternalDiffMigrationIndex: migrated (0.0027s) =====

Adding the new index seems fast: https://gitlab.slack.com/archives/CLJMDRD8C/p1597671440196300

The query has been executed. Duration: 7.359 min

This might be because most of the rows still have files_count: nil, though 🤷

rails db:migrate:down

lupine@t470p:~/dev/gdk/gitlab$ VERSION=20200813143356 bin/rails db:migrate:down
== 20200813143356 RemoveOldExternalDiffMigrationIndex: reverting ==============
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:merge_request_diffs, [:merge_request_id, :id], {:where=>{:stored_externally=>[nil, false]}, :algorithm=>:concurrently})
   -> 0.0045s
== 20200813143356 RemoveOldExternalDiffMigrationIndex: reverted (0.0048s) =====

lupine@t470p:~/dev/gdk/gitlab$ VERSION=20200813143304 bin/rails db:migrate:down
== 20200813143304 AddNewExternalDiffMigrationIndex: reverting =================
-- transaction_open?()
   -> 0.0000s
-- indexes(:merge_request_diffs)
   -> 0.0033s
-- remove_index(:merge_request_diffs, {:algorithm=>:concurrently, :name=>"index_merge_request_diffs_by_id_partial"})
   -> 0.0038s
== 20200813143304 AddNewExternalDiffMigrationIndex: reverted (0.0075s) ========

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team

Closes #227570 (closed)

Edited by Nick Thomas

Merge request reports