Optimize StuckCiJobsWorker running builds query
What does this MR do?
Optimizes the performance of a query that times out finding running builds with the stuck ci jobs worker.
Issue: #291010 (closed)
Thanks to @mbobin for discovering this way of optimizing the query. Which is based off !68880 (merged).
Database
New Query
Final Query implemented in this MR
explain SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" = 'Ci::Build'
AND "ci_builds"."id"
IN (
SELECT "ci_builds"."id"
FROM "ci_builds"
WHERE
"ci_builds"."type" = 'Ci::Build'
AND ("ci_builds"."status" IN ('running'))
AND (ci_builds.created_at < '2021-08-24 17:41:01.993524')
order by created_at, project_id
) AND (updated_at < '2021-08-24 17:41:01.994189') limit 100;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------
Limit (cost=472222.26..472594.62 rows=100 width=1289) (actual time=2.670..2.672 rows=0 loops=1)
Buffers: shared hit=821
-> Nested Loop (cost=472222.26..1629886.42 rows=310895 width=1289) (actual time=2.669..2.670 rows=0 loop
s=1)
Buffers: shared hit=821
-> HashAggregate (cost=472221.68..475475.50 rows=325382 width=4) (actual time=0.828..2.394 rows=30
loops=1)
Group Key: ci_builds_1.id
Buffers: shared hit=665
-> Index Scan using ci_builds_gitlab_monitor_metrics on ci_builds ci_builds_1 (cost=0.58..46
8154.40 rows=325382 width=16) (actual time=0.031..0.801 rows=30 loops=1)
Index Cond: (((status)::text = 'running'::text) AND (created_at < '2021-08-24 17:41:01.9
93524'::timestamp without time zone))
Buffers: shared hit=665
-> Index Scan using ci_builds_pkey on ci_builds (cost=0.58..3.55 rows=1 width=1289) (actual time=0
.009..0.009 rows=0 loops=30)
Index Cond: (id = ci_builds_1.id)
Filter: ((updated_at < '2021-08-24 17:41:01.994189'::timestamp without time zone) AND ((type):
:text = 'Ci::Build'::text))
Rows Removed by Filter: 1
Buffers: shared hit=156
Planning Time: 0.818 ms
Execution Time: 10.773 ms
(17 rows)
Time: 13.633 ms
Cold cache (81.192 ms):
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6179/commands/20850 Warm cache (4.062 ms):
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6179/commands/20851 Although running this later in the afternoon(close to when the other ones were run) I get (1.523 s cold and 28ms warm). cold: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20884 warm: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20885Original Query implemented in this MR
SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" = 'Ci::Build'
AND "ci_builds"."id"
IN (
SELECT "ci_builds"."id"
FROM "ci_builds"
WHERE "ci_builds"."type" = 'Ci::Build'
AND ("ci_builds"."status" IN ('running'))
AND (ci_builds.created_at < '2021-08-24 17:41:01.993524')
)
AND (updated_at < '2021-08-24 17:41:01.994189')
limit 100;
Existing Query
SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" = 'Ci::Build'
AND ("ci_builds"."status" IN ('running'))
AND (updated_at < '2021-08-24 17:41:01.994189') limit 100;
cold cache(1.140s): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20878
warm cache(28.366ms): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20879
Alternate Query(explored in a past MR):
explain SELECT "ci_builds".*
FROM "ci_builds"
WHERE "ci_builds"."type" = 'Ci::Build'
AND ("ci_builds"."status" IN ('running'))
AND (created_at < '2021-08-24 17:41:01.993524')
AND (updated_at < '2021-08-24 17:41:01.994189') limit 100;
cold cache(1.086s): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20881
warm cache(25.876ms): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20882
How to setup and validate locally (strongly suggested)
- Enable the feature flag
Feature.enable(:ci_new_query_for_running_stuck_jobs) - Add some
runningbuilds that were created_at or updated_at over an hour ago.earlier_than_timeout = Time.current - 2.hours pipeline = FactoryBot.create(:ci_pipeline, project: Project.first) build_1 = FactoryBot.create(:ci_build, :running, pipeline: pipeline) build_2 = FactoryBot.create(:ci_build, :running, pipeline: pipeline) build_1.update_columns(created_at: earlier_than_timeout, updated_at: earlier_than_timeout) build_2.update_columns(created_at: earlier_than_timeout, updated_at: earlier_than_timeout) - Run the StuckCiJobsWorker
StuckCiJobsWorker.new.perform - Ensure the builds were marked as stuck/failed
Ci::Build.find(build_1.id) => #<Ci::Build:0x00007fb8e9c15620 id: 947, status: "failed", ... failure_reason: "stuck_or_timeout_failure",
Does this MR meet the acceptance criteria?
Conformity
-
I have included changelog trailers, or none are needed. (Does this MR need a changelog?) -
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?) -
I have self-reviewed this MR per code review guidelines. -
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines) -
I have followed the style guides.
Availability and Testing
I just added tests that it behaves the same before and after the feature flag.
-
I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)