Optimize StuckCiJobsWorker running builds query
What does this MR do?
In !62239 (merged) we optimized queries related to selecting pending stuck builds. This applies the same optimization to running builds selected by the StuckCiJobsWorker. Running builds were not initially included in the last MR because the initial solution generated a different and inefficient query plan when the build status was 'running'.
Related Issues: #334399 (closed), #291010 (closed)
I found that the running query is also causing the most frequent timeouts(with 634 events):
Database
Existing Query (Existing indexes)
explain SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" = 'Ci::Build'
AND ("ci_builds"."status" IN ('running'))
AND ci_builds.updated_at < '2021-06-23 14:07:20.385873' limit 100;
Limit (cost=0.42..145.19 rows=100 width=1282)
-> Index Scan using index_ci_builds_runner_id_running on ci_builds (cost=0.42..429826.47 rows=296912 width=1282)
Filter: (updated_at < '2021-06-23 14:07:20.385873'::timestamp without time zone)
Time: 9.658 min
- planning: 24.088 ms
- execution: 9.658 min
- I/O read: 9.489 min
- I/O write: 0.000 ms
Shared buffers:
- hits: 129456 (~1011.40 MiB) from the buffer pool
- reads: 433393 (~3.30 GiB) from the OS file cache, including disk I/O
- dirtied: 159093 (~1.20 GiB)
- writes: 0
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/4766/commands/16979
New Query
explain SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" = 'Ci::Build'
AND ("ci_builds"."status" IN ('running'))
AND ((ci_builds.created_at BETWEEN '2021-05-30 15:07:20.385754' AND '2021-06-04 14:07:20.385873')
AND (ci_builds.updated_at BETWEEN '2021-05-30 15:07:20.385754' AND '2021-06-04 14:07:20.385873'))
limit 100;
Explain Plan
Limit (cost=0.58..3560.61 rows=26 width=1281) (actual time=0.078..4.541 rows=100 loops=1)
Buffers: shared hit=1070
I/O Timings: read=0.000 write=0.000
-> Index Scan using ci_builds_gitlab_monitor_metrics on public.ci_builds (cost=0.58..3560.61 rows=26 width=1281) (actual time=0.076..4.519 rows=100 loops=1)
Index Cond: (((ci_builds.status)::text = 'running'::text) AND (ci_builds.created_at >= '2021-05-30 15:07:20.385754'::timestamp without time zone) AND (ci_builds.created_at <= '2021-06-04 14:07:20.385873'::timestamp without time zone))
Filter: ((ci_builds.updated_at >= '2021-05-30 15:07:20.385754'::timestamp without time zone) AND (ci_builds.updated_at <= '2021-06-04 14:07:20.385873'::timestamp without time zone))
Rows Removed by Filter: 0
Buffers: shared hit=1070
I/O Timings: read=0.000 write=0.000
Cold Cache Perf
Time: 735.886 ms
- planning: 24.521 ms
- execution: 711.365 ms
- I/O read: 633.322 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 1198 (~9.40 MiB) from the buffer pool
- reads: 13338 (~104.20 MiB) from the OS file cache, including disk I/O
- dirtied: 2540 (~19.80 MiB)
- writes: 0
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/4460/commands/15565
Warm Cache Perf
Time: 5.155 ms
- planning: 0.568 ms
- execution: 4.587 ms
- I/O read: 0.000 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 1070 (~8.40 MiB) from the buffer pool
- reads: 0 from the OS file cache, including disk I/O
- dirtied: 0
- writes: 0
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/4460/commands/15566
Does this MR meet the acceptance criteria?
Conformity
-
I have included a changelog entry, or it's not needed. (Does this MR need a changelog?) -
I have added/updated documentation, or it's not needed. (Is documentation required?) -
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides.
Availability and Testing
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.) -
I have tested this MR in all supported browsers, or it's not needed. -
I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.