Skip to content

GraphQL: Expose jobs statistics in CiRunnersConnection

What does this MR do and why?

Describe in detail what your merge request does and why.

This MR exposes statistics for queued times of jobs executed by runners in a CiRunnersConnection:

{
  runners(type: INSTANCE_TYPE) {
    count
    jobsStatistics {
      queuedDuration {
        p50
        p75
        p90
        p95
        p99
      }
    }
  }
}

It resorts to the technique described in Efficient IN operator queries to avoid query timeouts against ci_builds.

To avoid a huge MR at the end, I'm breaking up the work into several MRs to make reviews easier for everyone. The work is divided as follows:

step MR
1 GraphQL: Add CI jobs statistics types (!107743 - merged)
2 GraphQL: Use specific connection type for runne... (!107744 - merged)
3 Add :read_jobs_statistics to GlobalPolicy (!107747 - merged)
4 GraphQL: Expose jobs statistics in CiRunnersCon... (!107694 - merged)
5 Do not consider ci_runners ordering for job que... (!107961 - merged)

Part of #377963 (closed)

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

header header
admin on EE image
admin on CE image

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

  1. Go to the shell in your GDK gitlab directory and run bundle exec rake "gitlab:seed:runner_fleet[root]". This will seed your GDK with some runners and jobs required for testing this MR.

  2. Log into http://gdk.test:3000 as an administrator. Make sure you have an Ultimate license enabled.

  3. Visit http://gdk.test:3000/-/graphql-explorer and enter the following query:

    {
      runners(type: INSTANCE_TYPE) {
        count
        jobsStatistics {
          queuedDuration {
            p50
            p75
            p90
            p95
            p99
          }
        }
      }
    }
  4. Upon running the query, you should see the different percentile values.

  5. On the terminal in your GDK gitlab directory, follow the instructions to Simulate a CE instance with a licensed GDK:

    export FOSS_ONLY=1
    gdk restart rails && gdk restart webpack
  6. Re-run the query. It should note that the statistics object is not available.

Database query plans

SQL query
SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY builds.duration) AS p50, PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER
  BY builds.duration) AS p75, PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY builds.duration) AS p90,
  PERCENTILE_DISC(0.95) WITHIN GROUP (ORDER BY builds.duration) AS p95,
  PERCENTILE_DISC(0.99) WITHIN GROUP (ORDER BY builds.duration) AS p99
FROM (
  SELECT ("ci_builds"."started_at" - "ci_builds"."queued_at") AS duration
  FROM (WITH RECURSIVE "array_cte" AS MATERIALIZED (
      SELECT "ci_runners"."id"
      FROM "ci_runners"
      WHERE "ci_runners"."runner_type" = 1), "recursive_keyset_cte" AS ((
          SELECT NULL::ci_builds AS records, array_cte_id_array, ci_builds_runner_id_array,
            ci_builds_id_array, 0::bigint AS count
          FROM (
	    SELECT ARRAY_AGG("array_cte"."id") AS array_cte_id_array, ARRAY_AGG("ci_builds"."runner_id") AS
	      ci_builds_runner_id_array, ARRAY_AGG("ci_builds"."id") AS ci_builds_id_array
          FROM (
            SELECT "array_cte"."id"
            FROM array_cte) array_cte
          LEFT JOIN LATERAL (
            SELECT "ci_builds"."runner_id" AS runner_id, "ci_builds"."id" AS id
            FROM "ci_builds"
            WHERE "ci_builds"."runner_id" = "array_cte"."id"
            ORDER BY "ci_builds"."runner_id" DESC, "ci_builds"."id" DESC
            LIMIT 1) ci_builds ON TRUE
        WHERE "ci_builds"."runner_id" IS NOT NULL
          AND "ci_builds"."id" IS NOT NULL) array_scope_lateral_query
      LIMIT 1)
  UNION ALL (
    SELECT (
        SELECT ci_builds
        FROM "ci_builds"
        WHERE "ci_builds"."id" = recursive_keyset_cte.ci_builds_id_array[position]
	LIMIT 1), array_cte_id_array, recursive_keyset_cte.ci_builds_runner_id_array[:position_query.position - 1] ||
	  next_cursor_values.runner_id || recursive_keyset_cte.ci_builds_runner_id_array[position_query.position + 1:],
	  recursive_keyset_cte.ci_builds_id_array[:position_query.position - 1] || next_cursor_values.id ||
	  recursive_keyset_cte.ci_builds_id_array[position_query.position + 1:],
      recursive_keyset_cte.count + 1
    FROM recursive_keyset_cte, LATERAL (
        SELECT runner_id, id, position
        FROM UNNEST(ci_builds_runner_id_array, ci_builds_id_array)
          WITH ORDINALITY AS u (runner_id, id, position)
        WHERE runner_id IS NOT NULL
          AND id IS NOT NULL
        ORDER BY 1 DESC, 2 DESC
        LIMIT 1) AS position_query, LATERAL (
        SELECT "record"."runner_id", "record"."id"
        FROM (
          VALUES (NULL, NULL)) AS nulls
        LEFT JOIN (
          SELECT "ci_builds"."runner_id" AS runner_id, "ci_builds"."id" AS id
          FROM "ci_builds"
          WHERE "ci_builds"."runner_id" = recursive_keyset_cte.array_cte_id_array[position]
	    AND (("ci_builds"."runner_id", "ci_builds"."id") <
	      (recursive_keyset_cte.ci_builds_runner_id_array[position],
	      recursive_keyset_cte.ci_builds_id_array[position]))
          ORDER BY "ci_builds"."runner_id" DESC, "ci_builds"."id" DESC
          LIMIT 1) record ON TRUE
      LIMIT 1) AS next_cursor_values))
SELECT (records).*
FROM "recursive_keyset_cte" AS "ci_builds"
WHERE (count <> 0)) ci_builds
WHERE "ci_builds"."started_at" IS NOT NULL
LIMIT 100) builds
Execution plan

https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/14068/commands/49515

 Aggregate  (cost=627.94..627.95 rows=1 width=80) (actual time=942.373..942.387 rows=1 loops=1)
   Buffers: shared hit=1230 read=477 dirtied=62
   I/O Timings: read=914.891 write=0.000
   ->  Limit  (cost=622.16..625.68 rows=100 width=16) (actual time=538.522..941.438 rows=100 loops=1)
         Buffers: shared hit=1221 read=476 dirtied=62
         I/O Timings: read=914.327 write=0.000
         ->  Subquery Scan on ci_builds  (cost=622.16..625.68 rows=100 width=16) (actual time=538.520..941.399 rows=100 loops=1)
               Buffers: shared hit=1221 read=476 dirtied=62
               I/O Timings: read=914.327 write=0.000
               ->  CTE Scan on recursive_keyset_cte ci_builds_1  (cost=622.16..624.43 rows=100 width=3967) (actual time=538.461..941.212 rows=100 loops=1)
                     Filter: (((ci_builds_1.records).started_at IS NOT NULL) AND (ci_builds_1.count <> 0))
                     Rows Removed by Filter: 1
                     Buffers: shared hit=1221 read=476 dirtied=62
                     I/O Timings: read=914.327 write=0.000
                     CTE array_cte
                       ->  Index Scan using index_ci_runners_on_runner_type on public.ci_runners  (cost=0.43..81.64 rows=96 width=4) (actual time=4.942..136.997 rows=73 loops=1)
                             Index Cond: (ci_runners.runner_type = 1)
                             Buffers: shared hit=5 read=75 dirtied=26
                             I/O Timings: read=134.382 write=0.000
                     CTE recursive_keyset_cte
                       ->  Recursive Union  (cost=75.55..540.52 rows=101 width=136) (actual time=526.379..940.333 rows=101 loops=1)
                             Buffers: shared hit=1221 read=476 dirtied=62
                             I/O Timings: read=914.327 write=0.000
                             ->  Limit  (cost=75.55..75.57 rows=1 width=136) (actual time=526.373..526.378 rows=1 loops=1)
                                   Buffers: shared hit=255 read=262 dirtied=51
                                   I/O Timings: read=518.192 write=0.000
                                   ->  Subquery Scan on array_scope_lateral_query  (cost=75.55..75.57 rows=1 width=136) (actual time=526.372..526.377 rows=1 loops=1)
                                         Buffers: shared hit=255 read=262 dirtied=51
                                         I/O Timings: read=518.192 write=0.000
                                         ->  Aggregate  (cost=75.55..75.56 rows=1 width=96) (actual time=526.371..526.374 rows=1 loops=1)
                                               Buffers: shared hit=255 read=262 dirtied=51
                                               I/O Timings: read=518.192 write=0.000
                                               ->  Nested Loop  (cost=0.70..74.82 rows=96 width=16) (actual time=18.264..525.968 rows=60 loops=1)
                                                     Buffers: shared hit=255 read=262 dirtied=51
                                                     I/O Timings: read=518.192 write=0.000
                                                     ->  CTE Scan on array_cte  (cost=0.00..1.92 rows=96 width=4) (actual time=4.946..137.314 rows=73 loops=1)
                                                           Buffers: shared hit=5 read=75 dirtied=26
                                                           I/O Timings: read=134.382 write=0.000
                                                     ->  Subquery Scan on ci_builds_3  (cost=0.70..0.75 rows=1 width=12) (actual time=5.319..5.319 rows=1 loops=73)
                                                           Filter: ((ci_builds_3.runner_id IS NOT NULL) AND (ci_builds_3.id IS NOT NULL))
                                                           Rows Removed by Filter: 0
                                                           Buffers: shared hit=250 read=187 dirtied=25
                                                           I/O Timings: read=383.811 write=0.000
                                                           ->  Limit  (cost=0.70..0.74 rows=1 width=12) (actual time=5.316..5.316 rows=1 loops=73)
                                                                 Buffers: shared hit=250 read=187 dirtied=25
                                                                 I/O Timings: read=383.811 write=0.000
                                                                 ->  Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on public.ci_builds ci_builds_4  (cost=0.70..1566.28 rows=45567 width=12) (actual time=5.313..5.313 rows=1 loops=73)
                                                                       Index Cond: (ci_builds_4.runner_id = array_cte.id)
                                                                       Heap Fetches: 36
                                                                       Buffers: shared hit=250 read=187 dirtied=20
                                                                       I/O Timings: read=383.811 write=0.000
                             ->  Nested Loop  (cost=0.87..46.29 rows=10 width=136) (actual time=4.133..4.135 rows=1 loops=100)
                                   Buffers: shared hit=966 read=214 dirtied=11
                                   I/O Timings: read=396.135 write=0.000
                                   ->  Nested Loop  (cost=0.16..1.98 rows=10 width=112) (actual time=0.066..0.067 rows=1 loops=100)
                                         Buffers: shared hit=6
                                         I/O Timings: read=0.000 write=0.000
                                         ->  WorkTable Scan on recursive_keyset_cte  (cost=0.00..0.20 rows=10 width=104) (actual time=0.000..0.001 rows=1 loops=100)
                                               I/O Timings: read=0.000 write=0.000
                                         ->  Limit  (cost=0.16..0.16 rows=1 width=20) (actual time=0.063..0.063 rows=1 loops=100)
                                               Buffers: shared hit=6
                                               I/O Timings: read=0.000 write=0.000
                                               ->  Sort  (cost=0.16..0.18 rows=10 width=20) (actual time=0.062..0.062 rows=1 loops=100)
                                                     Sort Key: u.runner_id DESC, u.id DESC
                                                     Sort Method: top-N heapsort  Memory: 25kB
                                                     Buffers: shared hit=6
                                                     I/O Timings: read=0.000 write=0.000
                                                     ->  Function Scan on  u  (cost=0.01..0.11 rows=10 width=20) (actual time=0.018..0.035 rows=60 loops=100)
                                                           Filter: ((u.runner_id IS NOT NULL) AND (u.id IS NOT NULL))
                                                           Rows Removed by Filter: 0
                                                           I/O Timings: read=0.000 write=0.000
                                   ->  Limit  (cost=0.71..0.78 rows=1 width=12) (actual time=2.539..2.539 rows=1 loops=100)
                                         Buffers: shared hit=559 read=113 dirtied=9
                                         I/O Timings: read=248.235 write=0.000
                                         ->  Nested Loop Left Join  (cost=0.71..0.78 rows=1 width=12) (actual time=2.538..2.538 rows=1 loops=100)
                                               Buffers: shared hit=559 read=113 dirtied=9
                                               I/O Timings: read=248.235 write=0.000
                                               ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.000 rows=1 loops=100)
                                                     I/O Timings: read=0.000 write=0.000
                                               ->  Limit  (cost=0.71..0.75 rows=1 width=12) (actual time=2.536..2.536 rows=1 loops=100)
                                                     Buffers: shared hit=559 read=113 dirtied=9
                                                     I/O Timings: read=248.235 write=0.000
                                                     ->  Index Only Scan using index_ci_builds_on_runner_id_and_id_desc on public.ci_builds ci_builds_5  (cost=0.71..560.47 rows=15189 width=12) (actual time=2.533..2.533 rows=1 loops=100)
                                                           Index Cond: ((ROW(ci_builds_5.runner_id, ci_builds_5.id) < ROW((recursive_keyset_cte.ci_builds_runner_id_array)[u."position"], (recursive_keyset_cte.ci_builds_id_array)[u."position"])) AND (ci_builds_5.runner_id = (recursive_keyset_cte.array_cte_id_array)[u."position"]))
                                                           Heap Fetches: 255
                                                           Buffers: shared hit=559 read=113 dirtied=7
                                                           I/O Timings: read=248.235 write=0.000
                                   SubPlan 2
                                     ->  Limit  (cost=0.58..3.60 rows=1 width=1272) (actual time=1.510..1.510 rows=1 loops=100)
                                           Buffers: shared hit=401 read=101 dirtied=2
                                           I/O Timings: read=147.900 write=0.000
                                           ->  Index Scan using index_ci_builds_on_id_partition_id_unique on public.ci_builds ci_builds_2  (cost=0.58..3.60 rows=1 width=1272) (actual time=1.506..1.506 rows=1 loops=100)
                                                 Index Cond: (ci_builds_2.id = (recursive_keyset_cte.ci_builds_id_array)[u."position"])
                                                 Buffers: shared hit=399 read=101
                                                 I/O Timings: read=147.900 write=0.000

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Pedro Pombeiro

Merge request reports