Skip to content

New cycle analytics query backend

Adam Hegyi requested to merge new-cycle-analytics-query-backend into master

What does this MR do?

What is Cycle Analytics

  • TLDR: Find all the Issue or MergeRequest records matching with a date range query (start_event and end_event) = Stage.
  • Calculate the duration (end_event_time - start_event_time)
  • Extract the median duration
  • Extract the list of records relevant to the date range
  • In EE cycle analytics stages will be customizable. In CE we only provide the default 7 stages.

High Level Overview

  • DataCollector is the high level interface for the feature.
  • BaseQueryBuilder is responsible for providing the base query, joining the absolutely necessary tables and do high level filtering.
  • An Event (start, end) could alter the query (join additional tables when needed, apply_query_customization). It defines a timestamp expression that will be used for the duration calculation.
  • Median and Records are using the base query provided by the DataCollector and do additional query manipulation.
  Defined for a Group (EE) or for a Project (CE)

  +---------------------+
  | Stages              |
  |   +-------------+   |
  |   |   Stage A   |   |                                 +--------+
  |   |             |   |      +---------------+   +----> | Median |
  |   +-------------+   |      |               |   |      +--------+
  |   | Start Event | +------> | DataCollector | +-+
  |   +-------------+   |      |               |   |      +---------+
  |   | End Event   |   |      +---------------+   +----> | Records |
  |   +-------------+   |                                 +---------+
  |                     |
  |   +-------------+   |
  |   |   Stage B   |   |
  |   |             |   |
  |   +-------------+   |
  |   | Start Event |   |
  |   +-------------+   |
  |   | End Event   |   |
  |   +-------------+   |
  |                     |
  |   ...               |
  +---------------------+

Database Related Changes (compared to the old CA feature, lib/gitlab/cycle_analytics)

  • User AR instead of Arel for building the query
  • Using percentile_disc function for the median calculation
  • Not using CTE tables
  • Avoid unnecessary table joins (join tables when the Stage actually requires it)
  • Loading records with one query (IssuableFinder query is merged), which helps implementing pagination later

Try it in the Console

Review stage definition:

  • Start: merge request created (merge_requests.created_at column)
  • End: merge request merged (merge_request_metrics.merged_at column)
stage = Analytics::CycleAnalytics::ProjectStage.new(start_event_identifier: :merge_request_created, end_event_identifier: :merge_request_merged, project_id: 19)

# each supported event is represented as a Class
# stage.start_event.class => Gitlab::CycleAnalytics::StageEvents::MergeRequestCreated
# stage.end_event.class => stage.end_event.class => Gitlab::CycleAnalytics::StageEvents::MergeRequestMerged

dc = Gitlab::Analytics::CycleAnalytics::DataCollector.new(stage: stage, params: { from: 30.days.ago, current_user: User.first })
puts dc.median.seconds
puts dc.records_fetcher.serialized_records

Query Plans

In CE we have only 7 stages, each of them fires a different query (queries are quite similar).

Issue Stage (Median query)

This is one of the most complex one because of multiple timestamp columns (COALESCE).

Old CA Backend (to be replaced with the new backend)

WITH "cte_table_for_issue" AS
  (SELECT "issues"."project_id" AS project_id,
          "projects"."path" AS project_path,
          "routes"."path" AS namespace_path,
          COALESCE("issue_metrics"."first_associated_with_milestone_at", "issue_metrics"."first_added_to_board_at")-COALESCE("issues"."created_at") AS issue
   FROM "issues"
   INNER JOIN "issue_metrics" ON "issues"."id" = "issue_metrics"."issue_id"
   INNER JOIN "projects" ON "issues"."project_id" = "projects"."id"
   INNER JOIN "routes" ON "projects"."namespace_id" = "routes"."source_id"
   WHERE "issues"."project_id" IN (13083)
     AND "routes"."source_type" = 'Namespace'
     AND "issues"."created_at" >= '2019-06-01 10:37:29.419650'
     AND ("issue_metrics"."first_added_to_board_at" IS NOT NULL
          OR "issue_metrics"."first_associated_with_milestone_at" IS NOT NULL)),
     "ordered_records" AS
  (SELECT "cte_table_for_issue"."issue" AS issue,
          row_number() OVER (
                             ORDER BY "cte_table_for_issue"."issue") AS row_id,

     (SELECT COUNT(1)
      FROM "cte_table_for_issue"
      WHERE "cte_table_for_issue"."issue" >= CAST('0' AS INTERVAL)) ct
   FROM "cte_table_for_issue"
   WHERE "cte_table_for_issue"."issue" >= CAST('0' AS INTERVAL))
SELECT AVG(EXTRACT(EPOCH
                   FROM "ordered_records"."issue")) AS median
FROM "ordered_records"
WHERE "ordered_records"."row_id" BETWEEN "ordered_records"."ct" / 2.0 AND ("ordered_records"."ct" / 2.0 + 1)

Plan:

depesz

Aggregate  (cost=11566.78..11566.79 rows=1 width=8) (actual time=1255.783..1255.783 rows=1 loops=1)
  Buffers: shared hit=13943 read=5796
  I/O Timings: read=1111.193
  CTE cte_table_for_issue
    ->  Nested Loop  (cost=1.99..11446.08 rows=1491 width=54) (actual time=2.165..1240.749 rows=3420 loops=1)
          Buffers: shared hit=13938 read=5796
          I/O Timings: read=1111.193
          ->  Nested Loop  (cost=1.56..548.48 rows=3283 width=50) (actual time=1.270..24.483 rows=3909 loops=1)
                Buffers: shared hit=4046 read=1
                I/O Timings: read=0.725
                ->  Nested Loop  (cost=1.00..7.04 rows=1 width=38) (actual time=0.204..0.208 rows=1 loops=1)
                      Buffers: shared hit=9
                      ->  Index Scan using projects_pkey on projects  (cost=0.43..3.45 rows=1 width=21) (actual time=0.120..0.123 rows=1 loops=1)
                            Index Cond: (id = 13083)
                            Buffers: shared hit=4
                      ->  Index Scan using index_routes_on_source_type_and_source_id on routes  (cost=0.56..3.58 rows=1 width=25) (actual time=0.083..0.085 rows=1 loops=1)
                            Index Cond: (((source_type)::text = 'Namespace'::text) AND (source_id = projects.namespace_id))
                            Buffers: shared hit=5
                ->  Index Only Scan using index_issues_on_project_id_and_created_at_and_id_and_state on issues  (cost=0.56..475.71 rows=6573 width=16) (actual time=1.065..23.366 rows=3909 loops=1)
                      Index Cond: ((project_id = 13083) AND (created_at >= '2019-06-01 10:37:29.41965+00'::timestamp with time zone))
                      Heap Fetches: 857
                      Buffers: shared hit=4037 read=1
                      I/O Timings: read=0.725
          ->  Index Scan using index_issue_metrics on issue_metrics  (cost=0.43..3.31 rows=1 width=20) (actual time=0.305..0.309 rows=1 loops=3909)
                Index Cond: (issue_id = issues.id)
                Filter: ((first_added_to_board_at IS NOT NULL) OR (first_associated_with_milestone_at IS NOT NULL))
                Rows Removed by Filter: 0
                Buffers: shared hit=9892 read=5795
                I/O Timings: read=1110.468
  CTE ordered_records
    ->  WindowAgg  (cost=90.61..99.30 rows=497 width=32) (actual time=1249.088..1251.129 rows=3420 loops=1)
          Buffers: shared hit=13943 read=5796
          I/O Timings: read=1111.193
          InitPlan 2 (returns $3)
            ->  Aggregate  (cost=34.79..34.80 rows=1 width=8) (actual time=1.072..1.072 rows=1 loops=1)
                  ->  CTE Scan on cte_table_for_issue  (cost=0.00..33.55 rows=497 width=0) (actual time=0.013..0.678 rows=3420 loops=1)
                        Filter: (issue >= '00:00:00'::interval)
          ->  Sort  (cost=55.81..57.05 rows=497 width=16) (actual time=1248.005..1248.754 rows=3420 loops=1)
                Sort Key: cte_table_for_issue_1.issue
                Sort Method: quicksort  Memory: 257kB
                Buffers: shared hit=13943 read=5796
                I/O Timings: read=1111.193
                ->  CTE Scan on cte_table_for_issue cte_table_for_issue_1  (cost=0.00..33.55 rows=497 width=16) (actual time=2.171..1245.282 rows=3420 loops=1)
                      Filter: (issue >= '00:00:00'::interval)
                      Buffers: shared hit=13938 read=5796
                      I/O Timings: read=1111.193
  ->  CTE Scan on ordered_records  (cost=0.00..21.12 rows=55 width=16) (actual time=1251.665..1255.730 rows=2 loops=1)
        Filter: (((row_id)::numeric >= ((ct)::numeric / 2.0)) AND ((row_id)::numeric <= (((ct)::numeric / 2.0) + '1'::numeric)))
        Rows Removed by Filter: 3418
        Buffers: shared hit=13943 read=5796
        I/O Timings: read=1111.193
Planning time: 15.523 ms
Execution time: 1256.194 ms

New

Relevant file: lib/gitlab/analytics/cycle_analytics/median.rb

SELECT EXTRACT(EPOCH
               FROM percentile_cont(0.5) WITHIN GROUP(
                                                      ORDER BY COALESCE("issue_metrics"."first_associated_with_milestone_at", "issue_metrics"."first_added_to_board_at") - "issues"."created_at")) AS median
FROM "issues"
INNER JOIN "issue_metrics" ON "issue_metrics"."issue_id" = "issues"."id"
WHERE "issues"."project_id" = 13083
  AND "issues"."created_at" >= '2019-06-01 10:37:29.419650'
  AND COALESCE("issue_metrics"."first_associated_with_milestone_at", "issue_metrics"."first_added_to_board_at") - "issues"."created_at" > CAST('0' AS INTERVAL);

Plan:

depesz

Aggregate  (cost=22312.96..22312.97 rows=1 width=8) (actual time=25.479..25.479 rows=1 loops=1)
  Buffers: shared hit=19731
  ->  Nested Loop  (cost=1.00..22312.08 rows=174 width=24) (actual time=0.089..25.193 rows=354 loops=1)
        Buffers: shared hit=19731
        ->  Index Only Scan using index_issues_on_project_id_and_created_at_and_id_and_state on issues  (cost=0.56..475.71 rows=6573 width=4) (actual time=0.025..3.435 rows=3909 loops=1)
              Index Cond: ((project_id = 13083) AND (created_at >= '2019-06-01 10:37:29.41965+00'::timestamp with time zone))
              Heap Fetches: 864
              Buffers: shared hit=4044
        ->  Index Scan using index_issue_metrics on issue_metrics  (cost=0.43..3.31 rows=1 width=28) (actual time=0.005..0.005 rows=0 loops=3909)
              Index Cond: (issue_id = issues.id)
              Filter: (((first_added_to_board_at IS NOT NULL) OR (first_associated_with_milestone_at IS NOT NULL)) AND (first_mentioned_in_commit_at IS NOT NULL) AND ((first_mentioned_in_commit_at - COALESCE(first_associated_with_milestone_at, first_added_to_board_at)) > '00:00:00'::interval))
              Rows Removed by Filter: 1
              Buffers: shared hit=15687
Planning time: 0.430 ms
Execution time: 25.559 ms

Issue Stage (Records query)

Note: the query loads records that the given user (in this example ahegyi) is allowed to see. We "merge" IssuesFinder or MergeRequestsFinder queries, most of the complexity comes from there.

Relevant file: lib/gitlab/analytics/cycle_analytics/records_fetcher.rb

 SELECT     "issues"."title",
           "issues"."iid",
           "issues"."id",
           "issues"."created_at",
           "issues"."author_id",
           "issues"."project_id",
           Extract(epoch FROM COALESCE("issue_metrics"."first_associated_with_milestone_at", "issue_metrics"."first_added_to_board_at") - "issues"."created_at") AS total_time
FROM       "issues"
INNER JOIN "issue_metrics"
ON         "issue_metrics"."issue_id" = "issues"."id"
WHERE      (
                      issues.confidential IS NOT true
           OR         (
                                 issues.confidential = true
                      AND        (
                                            issues.author_id = 4156052
                                 OR         EXISTS
                                            (
                                                   SELECT true
                                                   FROM   issue_assignees
                                                   WHERE  user_id = 4156052
                                                   AND    issue_id = issues.id)
                                 OR         EXISTS
                                            (
                                                   SELECT 1
                                                   FROM   "project_authorizations"
                                                   WHERE  "project_authorizations"."user_id" = 4156052
                                                   AND    (
                                                                 project_authorizations.project_id = issues.project_id)
                                                   AND    (
                                                                 project_authorizations.access_level >= 20)))))
AND        "issues"."project_id" = 13083
AND        "issues"."created_at" >= '2019-06-01 10:37:29.419650'
AND        COALESCE("issue_metrics"."first_associated_with_milestone_at", "issue_metrics"."first_added_to_board_at") - "issues"."created_at" > cast('0' AS interval)
ORDER BY   COALESCE("issue_metrics"."first_associated_with_milestone_at", "issue_metrics"."first_added_to_board_at") DESC limit 20 

plan:

depesz

Limit  (cost=71149.63..71149.68 rows=20 width=85) (actual time=2705.348..2705.355 rows=20 loops=1)
  Buffers: shared hit=14542 read=5924
  I/O Timings: read=2520.487
  ->  Sort  (cost=71149.63..71153.87 rows=1695 width=85) (actual time=2705.346..2705.349 rows=20 loops=1)
        Sort Key: (COALESCE(issue_metrics.first_associated_with_milestone_at, issue_metrics.first_added_to_board_at)) DESC
        Sort Method: top-N heapsort  Memory: 28kB
        Buffers: shared hit=14542 read=5924
        I/O Timings: read=2520.487
        ->  Nested Loop  (cost=1.00..71104.53 rows=1695 width=85) (actual time=1.784..2700.890 rows=3453 loops=1)
              Buffers: shared hit=14539 read=5924
              I/O Timings: read=2520.487
              ->  Index Scan using index_issues_on_project_id_and_created_at_and_id_and_state on issues  (cost=0.56..51149.07 rows=5975 width=69) (actual time=0.203..63.344 rows=3944 loops=1)
                    Index Cond: ((project_id = 13083) AND (created_at >= '2019-06-01 10:37:29.41965+00'::timestamp with time zone))
                    Filter: ((confidential IS NOT TRUE) OR (confidential AND ((author_id = 4156052) OR (alternatives: SubPlan 1 or hashed SubPlan 2) OR (alternatives: SubPlan 3 or hashed SubPlan 4))))
                    Buffers: shared hit=4622 read=13
                    I/O Timings: read=5.213
                    SubPlan 1
                      ->  Index Only Scan using index_issue_assignees_on_issue_id_and_user_id on issue_assignees  (cost=0.43..3.45 rows=1 width=0) (never executed)
                            Index Cond: ((issue_id = issues.id) AND (user_id = 4156052))
                            Heap Fetches: 0
                    SubPlan 2
                      ->  Index Scan using index_issue_assignees_on_user_id on issue_assignees issue_assignees_1  (cost=0.43..66.79 rows=62 width=4) (actual time=0.128..0.696 rows=22 loops=1)
                            Index Cond: (user_id = 4156052)
                            Buffers: shared hit=25
                    SubPlan 3
                      ->  Index Only Scan using index_project_authorizations_on_user_id_project_id_access_level on project_authorizations  (cost=0.56..3.59 rows=1 width=0) (never executed)
                            Index Cond: ((user_id = 4156052) AND (project_id = issues.project_id) AND (access_level >= 20))
                            Heap Fetches: 0
                    SubPlan 4
                      ->  Index Only Scan using index_project_authorizations_on_user_id_project_id_access_level on project_authorizations project_authorizations_1  (cost=0.56..863.51 rows=2744 width=4) (actual time=0.138..10.496 rows=1662 loops=1)
                            Index Cond: ((user_id = 4156052) AND (access_level >= 20))
                            Heap Fetches: 561
                            Buffers: shared hit=591 read=12
                            I/O Timings: read=5.201
              ->  Index Scan using index_issue_metrics on issue_metrics  (cost=0.43..3.33 rows=1 width=20) (actual time=0.662..0.666 rows=1 loops=3944)
                    Index Cond: (issue_id = issues.id)
                    Filter: (((COALESCE(first_associated_with_milestone_at, first_added_to_board_at))::timestamp with time zone - issues.created_at) > '00:00:00'::interval)
                    Rows Removed by Filter: 0
                    Buffers: shared hit=9917 read=5911
                    I/O Timings: read=2515.274
Planning time: 9.114 ms
Execution time: 2705.529 ms

Does this MR meet the acceptance criteria?

Conformity

Performance and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Adam Hegyi

Merge request reports