Optimize StuckCiJobsWorker running builds query (!68891) · Merge requests · GitLab.org / GitLab

What does this MR do?

Optimizes the performance of a query that times out finding running builds with the stuck ci jobs worker.

Thanks to @mbobin for discovering this way of optimizing the query. Which is based off !68880 (merged).

Database

New Query

Final Query implemented in this MR

explain SELECT "ci_builds".* 
FROM "ci_builds" 
WHERE "ci_builds"."type" = 'Ci::Build' 
AND "ci_builds"."id" 
IN (
    SELECT "ci_builds"."id" 
    FROM "ci_builds" 
    WHERE 
    "ci_builds"."type" = 'Ci::Build' 
    AND ("ci_builds"."status" IN ('running')) 
    AND (ci_builds.created_at < '2021-08-24 17:41:01.993524')
    order by created_at, project_id
) AND (updated_at < '2021-08-24 17:41:01.994189') limit 100;

                                                                                      QUERY PLAN

-------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------
 Limit  (cost=472222.26..472594.62 rows=100 width=1289) (actual time=2.670..2.672 rows=0 loops=1)
   Buffers: shared hit=821
   ->  Nested Loop  (cost=472222.26..1629886.42 rows=310895 width=1289) (actual time=2.669..2.670 rows=0 loop
s=1)
         Buffers: shared hit=821
         ->  HashAggregate  (cost=472221.68..475475.50 rows=325382 width=4) (actual time=0.828..2.394 rows=30
 loops=1)
               Group Key: ci_builds_1.id
               Buffers: shared hit=665
               ->  Index Scan using ci_builds_gitlab_monitor_metrics on ci_builds ci_builds_1  (cost=0.58..46
8154.40 rows=325382 width=16) (actual time=0.031..0.801 rows=30 loops=1)
                     Index Cond: (((status)::text = 'running'::text) AND (created_at < '2021-08-24 17:41:01.9
93524'::timestamp without time zone))
                     Buffers: shared hit=665
         ->  Index Scan using ci_builds_pkey on ci_builds  (cost=0.58..3.55 rows=1 width=1289) (actual time=0
.009..0.009 rows=0 loops=30)
               Index Cond: (id = ci_builds_1.id)
               Filter: ((updated_at < '2021-08-24 17:41:01.994189'::timestamp without time zone) AND ((type):
:text = 'Ci::Build'::text))
               Rows Removed by Filter: 1
               Buffers: shared hit=156
 Planning Time: 0.818 ms
 Execution Time: 10.773 ms
(17 rows)

Time: 13.633 ms

Original Query implemented in this MR

SELECT "ci_builds".* 
FROM "ci_builds" 
WHERE "ci_builds"."type" = 'Ci::Build' 
AND "ci_builds"."id" 
IN (
    SELECT "ci_builds"."id" 
    FROM "ci_builds" 
    WHERE "ci_builds"."type" = 'Ci::Build' 
    AND ("ci_builds"."status" IN ('running')) 
    AND (ci_builds.created_at < '2021-08-24 17:41:01.993524')
) 
AND (updated_at < '2021-08-24 17:41:01.994189')
limit 100;

Cold cache (81.192 ms): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6179/commands/20850

Warm cache (4.062 ms): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6179/commands/20851

Although running this later in the afternoon(close to when the other ones were run) I get (1.523 s cold and 28ms warm).

cold: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20884

warm: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20885

Existing Query

SELECT "ci_builds".* 
FROM "ci_builds" 
WHERE "ci_builds"."type" = 'Ci::Build' 
AND ("ci_builds"."status" IN ('running')) 
AND (updated_at < '2021-08-24 17:41:01.994189') limit 100;

cold cache(1.140s): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20878

warm cache(28.366ms): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20879

Alternate Query(explored in a past MR):

explain SELECT "ci_builds".* 
FROM "ci_builds" 
WHERE "ci_builds"."type" = 'Ci::Build' 
AND ("ci_builds"."status" IN ('running')) 
AND (created_at < '2021-08-24 17:41:01.993524')
AND (updated_at < '2021-08-24 17:41:01.994189') limit 100;

cold cache(1.086s): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20881

warm cache(25.876ms): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20882

How to setup and validate locally (strongly suggested)

Enable the feature flag

Feature.enable(:ci_new_query_for_running_stuck_jobs)

Add some running builds that were created_at or updated_at over an hour ago.

earlier_than_timeout = Time.current - 2.hours
pipeline = FactoryBot.create(:ci_pipeline, project: Project.first)
build_1 = FactoryBot.create(:ci_build, :running, pipeline: pipeline)
build_2 = FactoryBot.create(:ci_build, :running, pipeline: pipeline)
build_1.update_columns(created_at: earlier_than_timeout, updated_at: earlier_than_timeout)
build_2.update_columns(created_at: earlier_than_timeout, updated_at: earlier_than_timeout)

Run the StuckCiJobsWorker
```
StuckCiJobsWorker.new.perform
```

Ensure the builds were marked as stuck/failed

Ci::Build.find(build_1.id)
=> #<Ci::Build:0x00007fb8e9c15620
id: 947,
status: "failed",
...
failure_reason: "stuck_or_timeout_failure",

Does this MR meet the acceptance criteria?

Conformity

I have included changelog trailers, or none are needed. (Does this MR need a changelog?)
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.

Availability and Testing

I just added tests that it behaves the same before and after the feature flag.

I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)

Edited Aug 26, 2021 by Allison Browne

Optimize StuckCiJobsWorker running builds query