Add index on merge_request_diffs for id and project_id
What does this MR do and why?
This index will eventually be used to fix a timeout:
PG::QueryCanceled: ERROR: canceling statement due to statement timeout
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:111:in `public_send'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:111:in `block in read_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:65:in `read'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:110:in `read_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:48:in `select_all'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:127:in `public_send'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:127:in `block in write_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:141:in `block in read_write'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:228:in `retry_with_backoff'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:130:in `read_write'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:126:in `write_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:96:in `method_missing'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:127:in `public_send'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:127:in `block in write_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:141:in `block in read_write'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:228:in `retry_with_backoff'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:130:in `read_write'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:126:in `write_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:96:in `method_missing'
/opt/gitlab/embedded/service/gitlab-rails/gems/activerecord-gitlab/lib/active_record/gitlab_patches/rescue_from.rb:31:in `exec_queries'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/keep_around.rake:61:in `add_merge_request_diff_shas'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/keep_around.rake:28:in `block (4 levels) in <top (required)>'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/keep_around.rake:112:in `block in create_csv'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/keep_around.rake:111:in `open'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/keep_around.rake:111:in `create_csv'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/keep_around.rake:16:in `block (3 levels) in <top (required)>'
/opt/gitlab/embedded/bin/bundle:25:in `load'
/opt/gitlab/embedded/bin/bundle:25:in `<main>'
Tasks: TOP => gitlab:keep_around:orphaned
(See full trace by running task with --trace)
In the MR where this query was added. We already know that there are no available indexes to improve the performance.
The objective here is to add an index and change this query. The best solution appears to be to use the denormalized merge_request_diffs.project_id. We can't use this currently because the script enumerates over all merge_request_diffs for a given project and this requires matching id ranges and this appears to make the proposed query, without the index, worse than the current query. So the order of operations is:
- Async add the index (this MR)
- Sync add the index
- Update the query to use
merge_request_diffs.project_idinstead of joining throughmerge_requests
Functionally this would complete this project, but because this is a huge table we likely have to remove the existing index on project_id called index_merge_request_diffs_on_project_id. This would take place over two releases also.
Queries
Current query
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/36335/commands/111817
SELECT
"merge_request_diffs"."id",
"merge_request_diffs"."start_commit_sha",
"merge_request_diffs"."head_commit_sha"
FROM
"merge_request_diffs"
INNER JOIN "merge_requests" ON "merge_requests"."id" = "merge_request_diffs"."merge_request_id"
WHERE
"merge_requests"."target_project_id" = 278964
ORDER BY
"merge_request_diffs"."merge_request_id" ASC, "merge_request_diffs"."id" ASC
LIMIT 1000;
Limit (cost=15112.02..23609.74 rows=1000 width=88) (actual time=2659.472..3791.300 rows=1000 loops=1)
Buffers: shared hit=3071 read=11332 dirtied=1373
WAL: records=1526 fpi=1373 bytes=6976413
I/O Timings: read=8607.009 write=0.000
-> Incremental Sort (cost=15112.02..3678776.26 rows=431135 width=88) (actual time=2659.470..3791.192 rows=1000 loops=1)
Sort Key: merge_request_diffs.merge_request_id, merge_request_diffs.id
Buffers: shared hit=3071 read=11332 dirtied=1373
WAL: records=1526 fpi=1373 bytes=6976413
I/O Timings: read=8607.009 write=0.000
-> Nested Loop (cost=15090.36..3667094.93 rows=431135 width=88) (actual time=2601.205..3789.707 rows=1023 loops=1)
Buffers: shared hit=3071 read=11332 dirtied=1373
WAL: records=1526 fpi=1373 bytes=6976413
I/O Timings: read=8607.009 write=0.000
-> Gather Merge (cost=15089.78..34741.64 rows=168734 width=4) (actual time=2597.774..2605.222 rows=735 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=607 read=9841 dirtied=1355
WAL: records=1505 fpi=1355 bytes=6832124
I/O Timings: read=7436.132 write=0.000
-> Sort (cost=14089.76..14265.52 rows=70306 width=4) (actual time=2551.588..2551.714 rows=1427 loops=3)
Sort Key: merge_requests.id
Sort Method: quicksort Memory: 1537kB
Buffers: shared hit=607 read=9841 dirtied=1355
WAL: records=1505 fpi=1355 bytes=6832124
I/O Timings: read=7436.132 write=0.000
-> Parallel Index Only Scan using index_on_merge_requests_for_latest_diffs on public.merge_requests (cost=0.57..8429.65 rows=70306 width=4) (actual time=4.122..2540.593 rows=58437 loops=3)
Index Cond: (merge_requests.target_project_id = 278964)
Heap Fetches: 13495
Buffers: shared hit=589 read=9841 dirtied=1355
WAL: records=1505 fpi=1355 bytes=6832124
I/O Timings: read=7436.132 write=0.000
-> Index Scan using index_merge_request_diffs_on_merge_request_id_and_id on public.merge_request_diffs (cost=0.57..21.28 rows=25 width=88) (actual time=1.296..1.610 rows=1 loops=735)
Index Cond: (merge_request_diffs.merge_request_id = merge_requests.id)
Buffers: shared hit=2464 read=1491 dirtied=18
WAL: records=21 fpi=18 bytes=144289
I/O Timings: read=1170.876 write=0.000
Settings: work_mem = '100MB', effective_cache_size = '472585MB', jit = 'off', random_page_cost = '1.5', seq_page_cost = '4'
Proposed query
SELECT
"merge_request_diffs"."id",
"merge_request_diffs"."start_commit_sha",
"merge_request_diffs"."head_commit_sha"
FROM
"merge_request_diffs"
WHERE
"merge_request_diffs"."project_id" = 278964
ORDER BY
"merge_request_diffs"."id" ASC
LIMIT 1000;
Without index
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/36335/commands/111818
Limit (cost=1000.60..88857.81 rows=1000 width=84) (actual time=43296.752..57091.541 rows=1000 loops=1)
Buffers: shared hit=282785 read=210071 dirtied=1300
WAL: records=1608 fpi=1300 bytes=10434171
I/O Timings: read=114679.328 write=0.000
-> Gather Merge (cost=1000.60..100707243.35 rows=1146249 width=84) (actual time=43296.750..57091.456 rows=1000 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=282785 read=210071 dirtied=1300
WAL: records=1608 fpi=1300 bytes=10434171
I/O Timings: read=114679.328 write=0.000
-> Parallel Index Scan using merge_request_diffs_pkey on public.merge_request_diffs (cost=0.57..100573937.75 rows=477604 width=84) (actual time=2771.333..38966.978 rows=438 loops=3)
Filter: (merge_request_diffs.project_id = 278964)
Rows Removed by Filter: 339183
Buffers: shared hit=282785 read=210071 dirtied=1300
WAL: records=1608 fpi=1300 bytes=10434171
I/O Timings: read=114679.328 write=0.000
Settings: effective_cache_size = '472585MB', jit = 'off', random_page_cost = '1.5', seq_page_cost = '4', work_mem = '100MB'
With index
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/36335/commands/111820
Limit (cost=0.57..1229.46 rows=1000 width=84) (actual time=4.177..831.189 rows=1000 loops=1)
Buffers: shared hit=25 read=940
I/O Timings: read=822.867 write=0.000
-> Index Scan using index_merge_request_diffs_on_project_id_and_id on public.merge_request_diffs (cost=0.57..1409024.91 rows=1146588 width=84) (actual time=4.175..830.757 rows=1000 loops=1)
Index Cond: (merge_request_diffs.project_id = 278964)
Buffers: shared hit=25 read=940
I/O Timings: read=822.867 write=0.000
Settings: seq_page_cost = '4', work_mem = '100MB', effective_cache_size = '472585MB', jit = 'off', random_page_cost = '1.5'
References
Please include cross links to any resources that are relevant to this MR. This will give reviewers and future readers helpful context to give an efficient review of the changes introduced.
- Follow up add index sync issue #520027 (closed)
- Follow up to remove existing index issue #521444 (closed)
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Screenshots or screen recordings
Screenshots are required for UI changes, and strongly recommended for all other merge requests.
| Before | After |
|---|---|
How to set up and validate locally
Numbered steps to set up and validate the change are strongly suggested.