Skip to content

Optimize StuckCiJobsWorker running builds query

What does this MR do?

Optimizes the performance of a query that times out finding running builds with the stuck ci jobs worker.

Issue: #291010 (closed)

Thanks to @mbobin for discovering this way of optimizing the query. Which is based off !68880 (merged).

Database

New Query

Final Query implemented in this MR

explain SELECT "ci_builds".* 
FROM "ci_builds" 
WHERE "ci_builds"."type" = 'Ci::Build' 
AND "ci_builds"."id" 
IN (
    SELECT "ci_builds"."id" 
    FROM "ci_builds" 
    WHERE 
    "ci_builds"."type" = 'Ci::Build' 
    AND ("ci_builds"."status" IN ('running')) 
    AND (ci_builds.created_at < '2021-08-24 17:41:01.993524')
    order by created_at, project_id
) AND (updated_at < '2021-08-24 17:41:01.994189') limit 100;
                                                                                      QUERY PLAN

-------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------
 Limit  (cost=472222.26..472594.62 rows=100 width=1289) (actual time=2.670..2.672 rows=0 loops=1)
   Buffers: shared hit=821
   ->  Nested Loop  (cost=472222.26..1629886.42 rows=310895 width=1289) (actual time=2.669..2.670 rows=0 loop
s=1)
         Buffers: shared hit=821
         ->  HashAggregate  (cost=472221.68..475475.50 rows=325382 width=4) (actual time=0.828..2.394 rows=30
 loops=1)
               Group Key: ci_builds_1.id
               Buffers: shared hit=665
               ->  Index Scan using ci_builds_gitlab_monitor_metrics on ci_builds ci_builds_1  (cost=0.58..46
8154.40 rows=325382 width=16) (actual time=0.031..0.801 rows=30 loops=1)
                     Index Cond: (((status)::text = 'running'::text) AND (created_at < '2021-08-24 17:41:01.9
93524'::timestamp without time zone))
                     Buffers: shared hit=665
         ->  Index Scan using ci_builds_pkey on ci_builds  (cost=0.58..3.55 rows=1 width=1289) (actual time=0
.009..0.009 rows=0 loops=30)
               Index Cond: (id = ci_builds_1.id)
               Filter: ((updated_at < '2021-08-24 17:41:01.994189'::timestamp without time zone) AND ((type):
:text = 'Ci::Build'::text))
               Rows Removed by Filter: 1
               Buffers: shared hit=156
 Planning Time: 0.818 ms
 Execution Time: 10.773 ms
(17 rows)

Time: 13.633 ms

Original Query implemented in this MR
SELECT "ci_builds".* 
FROM "ci_builds" 
WHERE "ci_builds"."type" = 'Ci::Build' 
AND "ci_builds"."id" 
IN (
    SELECT "ci_builds"."id" 
    FROM "ci_builds" 
    WHERE "ci_builds"."type" = 'Ci::Build' 
    AND ("ci_builds"."status" IN ('running')) 
    AND (ci_builds.created_at < '2021-08-24 17:41:01.993524')
) 
AND (updated_at < '2021-08-24 17:41:01.994189')
limit 100;

Cold cache (81.192 ms): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6179/commands/20850

Warm cache (4.062 ms): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6179/commands/20851

Although running this later in the afternoon(close to when the other ones were run) I get (1.523 s cold and 28ms warm).

cold: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20884

warm: https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20885

Existing Query

SELECT "ci_builds".* 
FROM "ci_builds" 
WHERE "ci_builds"."type" = 'Ci::Build' 
AND ("ci_builds"."status" IN ('running')) 
AND (updated_at < '2021-08-24 17:41:01.994189') limit 100;

cold cache(1.140s): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20878

warm cache(28.366ms): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20879

Alternate Query(explored in a past MR):

explain SELECT "ci_builds".* 
FROM "ci_builds" 
WHERE "ci_builds"."type" = 'Ci::Build' 
AND ("ci_builds"."status" IN ('running')) 
AND (created_at < '2021-08-24 17:41:01.993524')
AND (updated_at < '2021-08-24 17:41:01.994189') limit 100;

cold cache(1.086s): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20881

warm cache(25.876ms): https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/6191/commands/20882

How to setup and validate locally (strongly suggested)

  1. Enable the feature flag
    Feature.enable(:ci_new_query_for_running_stuck_jobs)
  2. Add some running builds that were created_at or updated_at over an hour ago.
    earlier_than_timeout = Time.current - 2.hours
    pipeline = FactoryBot.create(:ci_pipeline, project: Project.first)
    build_1 = FactoryBot.create(:ci_build, :running, pipeline: pipeline)
    build_2 = FactoryBot.create(:ci_build, :running, pipeline: pipeline)
    build_1.update_columns(created_at: earlier_than_timeout, updated_at: earlier_than_timeout)
    build_2.update_columns(created_at: earlier_than_timeout, updated_at: earlier_than_timeout)
  3. Run the StuckCiJobsWorker
    StuckCiJobsWorker.new.perform
  4. Ensure the builds were marked as stuck/failed
    Ci::Build.find(build_1.id)
    => #<Ci::Build:0x00007fb8e9c15620
    id: 947,
    status: "failed",
    ...
    failure_reason: "stuck_or_timeout_failure",

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

I just added tests that it behaves the same before and after the feature flag.

Edited by Allison Browne

Merge request reports

Loading