Skip to content

Collect Dora Daily Metrics [RUN ALL RSPEC] [RUN AS-IF-FOSS]

Shinya Maeda requested to merge dora-metrics-modeling into master

What does this MR do?

This MR introduces a modeling to calculate/persist DORA key metrics at daily interval. When a deployment finished, the metrics will be update for the environment.

The main reason to persist pre-aggregated data is to address the performance/scalability problem. The current DORA metrics APIs are suffering on the scalability due to intensive aggregation on the Ruby with expensive database queries. This sometimes hits the puma's 60 seconds timeout and harms the SLA. This new pre-aggregation approach should resolve these concerns and allow users to query the DORA metrics in a reasonable timing (e.g. <100 msec).

The metrics refresh process (Dora::DailyMetrics.refresh!) is idempotent by design, meaning, the same result will be produced everytime. This is safe and easy to use for multi-purposes. By using this characteristics, the refresh worker (::Dora::DailyMetrics::RefreshWorker) has a proper deduplication strategy so that we can execute the refresh process only when necessary.

The collected metrics will be visualized in the graph in the future. For example, you can see how the deployment frequency is visualized today in the documentation.

With the worker, we can also backfill the daily metrics in the past. This is not concern at this moment because the graph visualization's date range is "last 90 days" at most. So we can simply rollout this change for 3 months and that should be enough.

You can see the full PoC in this MR.

This feature is behind dora_daily_metrics feature flag, which is disabled by default.

Related #291746 (closed)

Database query performance

Tested on busy environments https://gitlab.com/gitlab-com/www-gitlab-com/-/environments/137 and https://gitlab.com/gitlab-org/gitlab/-/environments/1178942.

Dora::DailyMetrics.refresh!(environment, date) (The full query for updating the metrics)

INSERT INTO
    dora_daily_metrics (
        environment_id,
        date,
        deployment_frequency,
        lead_time_for_changes
    )
VALUES
    (
        137,
        '2021-03-02',
        (
            SELECT
                COUNT("deployments"."id")
            FROM
                "deployments"
            WHERE
                "deployments"."environment_id" = 137
                AND "deployments"."finished_at" >= '2021-03-02 00:00:00'
                AND "deployments"."finished_at" <= '2021-03-02 23:59:59.999999'
                AND "deployments"."status" = 2
        ),
        (
            SELECT
                PERCENTILE_CONT(0.5) WITHIN GROUP(
                    ORDER BY
                        EXTRACT(
                            EPOCH
                            FROM
                                (
                                    deployments.finished_at - merge_request_metrics.merged_at
                                )
                        )
                )
            FROM
                "deployments"
                INNER JOIN "deployment_merge_requests" ON "deployment_merge_requests"."deployment_id" = "deployments"."id"
                INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "deployment_merge_requests"."merge_request_id"
            WHERE
                "deployments"."environment_id" = 137
                AND "deployments"."finished_at" >= '2021-03-02 00:00:00'
                AND "deployments"."finished_at" <= '2021-03-02 23:59:59.999999'
                AND "deployments"."status" = 2
        )
    ) ON CONFLICT (environment_id, date) DO
UPDATE
SET
    deployment_frequency = (
        SELECT
            COUNT("deployments"."id")
        FROM
            "deployments"
        WHERE
            "deployments"."environment_id" = 137
            AND "deployments"."finished_at" >= '2021-03-02 00:00:00'
            AND "deployments"."finished_at" <= '2021-03-02 23:59:59.999999'
            AND "deployments"."status" = 2
    ),
    lead_time_for_changes = (
        SELECT
            PERCENTILE_CONT(0.5) WITHIN GROUP(
                ORDER BY
                    EXTRACT(
                        EPOCH
                        FROM
                            (
                                deployments.finished_at - merge_request_metrics.merged_at
                            )
                    )
            )
        FROM
            "deployments"
            INNER JOIN "deployment_merge_requests" ON "deployment_merge_requests"."deployment_id" = "deployments"."id"
            INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "deployment_merge_requests"."merge_request_id"
        WHERE
            "deployments"."environment_id" = 137
            AND "deployments"."finished_at" >= '2021-03-02 00:00:00'
            AND "deployments"."finished_at" <= '2021-03-02 23:59:59.999999'
            AND "deployments"."status" = 2
    )

Timing

Time: 88.035 ms
  - planning: 8.223 ms
  - execution: 79.812 ms
    - I/O read: 0.123 ms
    - I/O write: N/A

Shared buffers:
  - hits: 7906 (~61.80 MiB) from the buffer pool
  - reads: 5 (~40.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 5 (~40.00 KiB)
  - writes: 0

Plan

 ModifyTable on public.dora_daily_metrics  (cost=1825.76..1825.78 rows=1 width=28) (actual time=39.511..39.518 rows=0 loops=1)
   Buffers: shared hit=7906 read=5 dirtied=5
   I/O Timings: read=0.123
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=129.57..129.58 rows=1 width=8) (actual time=21.123..21.124 rows=1 loops=1)
           Buffers: shared hit=1152
           ->  Index Scan using succeeded_deployments_environment_finished_at on public.deployments  (cost=0.57..129.29 rows=114 width=4) (actual time=0.522..20.745 rows=1625 loops=1)
                 Index Cond: ((deployments.environment_id = 137) AND (deployments.status = 2) AND (deployments.finished_at >= '2021-03-02 00:00:00+00'::timestamp with time zone) AND (deployments.finished_at <= '2021-03-02 23:59:59.999999+00'::timestamp with time zone))
                 Buffers: shared hit=1152
   InitPlan 2 (returns $3)
     ->  Aggregate  (cost=783.29..783.30 rows=1 width=8) (actual time=15.285..15.288 rows=1 loops=1)
           Buffers: shared hit=6735
           ->  Nested Loop  (cost=1.57..783.11 rows=18 width=16) (actual time=0.439..15.066 rows=118 loops=1)
                 Buffers: shared hit=6726
                 ->  Nested Loop  (cost=1.00..771.07 rows=18 width=12) (actual time=0.337..9.688 rows=118 loops=1)
                       Buffers: shared hit=6136
                       ->  Index Scan using succeeded_deployments_environment_finished_at on public.deployments deployments_1  (cost=0.57..129.29 rows=114 width=12) (actual time=0.046..1.366 rows=1625 loops=1)
                             Index Cond: ((deployments_1.environment_id = 137) AND (deployments_1.status = 2) AND (deployments_1.finished_at >= '2021-03-02 00:00:00+00'::timestamp with time zone) AND (deployments_1.finished_at <= '2021-03-02 23:59:59.999999+00'::timestamp with time zone))
                             Buffers: shared hit=1149
                       ->  Index Only Scan using deployment_merge_requests_pkey on public.deployment_merge_requests  (cost=0.44..5.17 rows=46 width=8) (actual time=0.005..0.005 rows=0 loops=1625)
                             Index Cond: (deployment_merge_requests.deployment_id = deployments_1.id)
                             Heap Fetches: 103
                             Buffers: shared hit=4987
                 ->  Index Scan using unique_merge_request_metrics_by_merge_request_id on public.merge_request_metrics  (cost=0.57..0.67 rows=1 width=12) (actual time=0.045..0.045 rows=1 loops=118)
                       Index Cond: (merge_request_metrics.merge_request_id = deployment_merge_requests.merge_request_id)
                       Buffers: shared hit=590
   InitPlan 3 (returns $4)
     ->  Aggregate  (cost=129.57..129.58 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=0)
           ->  Index Scan using succeeded_deployments_environment_finished_at on public.deployments deployments_2  (cost=0.57..129.29 rows=114 width=4) (actual time=0.000..0.000 rows=0 loops=0)
                 Index Cond: ((deployments_2.environment_id = 137) AND (deployments_2.status = 2) AND (deployments_2.finished_at >= '2021-03-02 00:00:00+00'::timestamp with time zone) AND (deployments_2.finished_at <= '2021-03-02 23:59:59.999999+00'::timestamp with time zone))
   InitPlan 4 (returns $7)
     ->  Aggregate  (cost=783.29..783.30 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=0)
           ->  Nested Loop  (cost=1.57..783.11 rows=18 width=16) (actual time=0.000..0.000 rows=0 loops=0)
                 ->  Nested Loop  (cost=1.00..771.07 rows=18 width=12) (actual time=0.000..0.000 rows=0 loops=0)
                       ->  Index Scan using succeeded_deployments_environment_finished_at on public.deployments deployments_3  (cost=0.57..129.29 rows=114 width=12) (actual time=0.000..0.000 rows=0 loops=0)
                             Index Cond: ((deployments_3.environment_id = 137) AND (deployments_3.status = 2) AND (deployments_3.finished_at >= '2021-03-02 00:00:00+00'::timestamp with time zone) AND (deployments_3.finished_at <= '2021-03-02 23:59:59.999999+00'::timestamp with time zone))
                       ->  Index Only Scan using deployment_merge_requests_pkey on public.deployment_merge_requests deployment_merge_requests_1  (cost=0.44..5.17 rows=46 width=8) (actual time=0.000..0.000 rows=0 loops=0)
                             Index Cond: (deployment_merge_requests_1.deployment_id = deployments_3.id)
                             Heap Fetches: 0
                 ->  Index Scan using unique_merge_request_metrics_by_merge_request_id on public.merge_request_metrics merge_request_metrics_1  (cost=0.57..0.67 rows=1 width=12) (actual time=0.000..0.000 rows=0 loops=0)
                       Index Cond: (merge_request_metrics_1.merge_request_id = deployment_merge_requests_1.merge_request_id)
   ->  Result  (cost=0.00..0.02 rows=1 width=28) (actual time=36.555..36.556 rows=1 loops=1)
         Buffers: shared hit=7901

Additional index for query optimization

CREATE INDEX succeeded_deployments_environment_finished_at ON deployments (environment_id, status, finished_at);

Timing

% time      seconds wait_event
------ ------------ -----------------------------
59.55    409.090590 IO.DataFileRead
19.47    133.760290 Running
7.21      49.503546 IO.DataFileExtend
7.14      49.042061 IO.BufFileWrite
3.30      22.703937 LWLock.WALWriteLock
2.22      15.258762 IO.BufFileRead
0.41       2.823533 IPC.ParallelCreateIndexScan
0.39       2.681513 IO.WALWrite
0.21       1.422950 IO.DataFileWrite
0.06       0.437876 LWLock.WALBufMappingLock
0.03       0.239136 IO.SLRURead
------ ------------ -----------------------------
100.00   686.964194

Individual query plans

Individual query plans **Dora::DailyMetrics.deployment_frequency(environment, date)** (The query for deployment frequency)
SELECT
    COUNT("deployments"."id")
FROM
    "deployments"
WHERE
    "deployments"."environment_id" = 137
    AND "deployments"."finished_at" >= '2021-03-02 00:00:00'
    AND "deployments"."finished_at" <= '2021-03-02 23:59:59.999999'
    AND "deployments"."status" = 2
Time: 6.772 ms
  - planning: 2.273 ms
  - execution: 4.499 ms
    - I/O read: 0.665 ms
    - I/O write: N/A

Shared buffers:
  - hits: 1142 (~8.90 MiB) from the buffer pool
  - reads: 10 (~80.00 KiB) from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0
 Aggregate  (cost=129.57..129.58 rows=1 width=8) (actual time=4.438..4.439 rows=1 loops=1)
   Buffers: shared hit=1142 read=10
   I/O Timings: read=0.665
   ->  Index Scan using succeeded_deployments_environment_finished_at on public.deployments  (cost=0.57..129.29 rows=114 width=4) (actual time=0.604..4.204 rows=1625 loops=1)
         Index Cond: ((deployments.environment_id = 137) AND (deployments.status = 2) AND (deployments.finished_at >= '2021-03-02 00:00:00+00'::timestamp with time zone) AND (deployments.finished_at <= '2021-03-02 23:59:59.999999+00'::timestamp with time zone))
         Buffers: shared hit=1142 read=10
         I/O Timings: read=0.665

Dora::DailyMetrics.lead_time_for_changes(environment, date) (The query for lead time for changes)

SELECT
    PERCENTILE_CONT(0.5) WITHIN GROUP(
        ORDER BY
            EXTRACT(
                EPOCH
                FROM
                    (
                        deployments.finished_at - merge_request_metrics.merged_at
                    )
            )
    )
FROM
    "deployments"
    INNER JOIN "deployment_merge_requests" ON "deployment_merge_requests"."deployment_id" = "deployments"."id"
    INNER JOIN "merge_request_metrics" ON "merge_request_metrics"."merge_request_id" = "deployment_merge_requests"."merge_request_id"
WHERE
    "deployments"."environment_id" = 137
    AND "deployments"."finished_at" >= '2021-03-02 00:00:00'
    AND "deployments"."finished_at" <= '2021-03-02 23:59:59.999999'
    AND "deployments"."status" = 2
Time: 9.515 ms
  - planning: 1.270 ms
  - execution: 8.245 ms
    - I/O read: N/A
    - I/O write: N/A

Shared buffers:
  - hits: 6738 (~52.60 MiB) from the buffer pool
  - reads: 0 from the OS file cache, including disk I/O
  - dirtied: 0
  - writes: 0
 Aggregate  (cost=783.29..783.30 rows=1 width=8) (actual time=8.159..8.160 rows=1 loops=1)
   Buffers: shared hit=6738
   ->  Nested Loop  (cost=1.57..783.11 rows=18 width=16) (actual time=0.208..8.014 rows=118 loops=1)
         Buffers: shared hit=6729
         ->  Nested Loop  (cost=1.00..771.07 rows=18 width=12) (actual time=0.188..6.938 rows=118 loops=1)
               Buffers: shared hit=6139
               ->  Index Scan using succeeded_deployments_environment_finished_at on public.deployments  (cost=0.57..129.29 rows=114 width=12) (actual time=0.128..3.247 rows=1625 loops=1)
                     Index Cond: ((deployments.environment_id = 137) AND (deployments.status = 2) AND (deployments.finished_at >= '2021-03-02 00:00:00+00'::timestamp with time zone) AND (deployments.finished_at <= '2021-03-02 23:59:59.999999+00'::timestamp with time zone))
                     Buffers: shared hit=1152
               ->  Index Only Scan using deployment_merge_requests_pkey on public.deployment_merge_requests  (cost=0.44..5.17 rows=46 width=8) (actual time=0.002..0.002 rows=0 loops=1625)
                     Index Cond: (deployment_merge_requests.deployment_id = deployments.id)
                     Heap Fetches: 103
                     Buffers: shared hit=4987
         ->  Index Scan using unique_merge_request_metrics_by_merge_request_id on public.merge_request_metrics  (cost=0.57..0.67 rows=1 width=12) (actual time=0.008..0.008 rows=1 loops=118)
               Index Cond: (merge_request_metrics.merge_request_id = deployment_merge_requests.merge_request_id)
               Buffers: shared hit=590

Database migration

Database migration

Here is the sanity check for the migration.

shinya@shinya-B550-VISION-D:~/workspace/thin-gdk/services/rails/src$ tre bin/rails db:migrate:redo VERSION=20210225090801
INFO: This script is a predefined script in devkitkat.
== 20210225090801 CreateDoraDailyMetrics: reverting ===========================
-- drop_table(:dora_daily_metrics)
   -> 0.0068s
== 20210225090801 CreateDoraDailyMetrics: reverted (0.0117s) ==================

== 20210225090801 CreateDoraDailyMetrics: migrating ===========================
-- create_table(:dora_daily_metrics, {:if_not_exists=>true})
   -> 0.0135s
-- transaction_open?()
   -> 0.0000s
-- current_schema()
   -> 0.0001s
-- execute("ALTER TABLE dora_daily_metrics\nADD CONSTRAINT dora_daily_metrics_deployment_frequency_positive\nCHECK ( deployment_frequency >= 0 )\nNOT VALID;\n")
   -> 0.0007s
-- current_schema()
   -> 0.0001s
-- execute("ALTER TABLE dora_daily_metrics VALIDATE CONSTRAINT dora_daily_metrics_deployment_frequency_positive;")
   -> 0.0010s
-- transaction_open?()
   -> 0.0000s
-- current_schema()
   -> 0.0001s
-- execute("ALTER TABLE dora_daily_metrics\nADD CONSTRAINT dora_daily_metrics_lead_time_for_changes_in_seconds_positive\nCHECK ( lead_time_for_changes_in_seconds >= 0 )\nNOT VALID;\n")
   -> 0.0006s
-- current_schema()
   -> 0.0001s
-- execute("ALTER TABLE dora_daily_metrics VALIDATE CONSTRAINT dora_daily_metrics_lead_time_for_changes_in_seconds_positive;")
   -> 0.0010s
== 20210225090801 CreateDoraDailyMetrics: migrated (0.0304s) ==================
shinya@shinya-B550-VISION-D:~/workspace/thin-gdk/services/rails/src$ tre bin/rails db:migrate:redo VERSION=20210303165201
INFO: This script is a predefined script in devkitkat.
== 20210303165201 AddIndexForSucceededDeployments: reverting ==================
-- transaction_open?()
   -> 0.0000s
-- indexes(:deployments)
   -> 0.0062s
-- remove_index(:deployments, {:algorithm=>:concurrently, :name=>"index_deployments_on_environment_id_status_and_finished_at"})
   -> 0.0018s
== 20210303165201 AddIndexForSucceededDeployments: reverted (0.0087s) =========

== 20210303165201 AddIndexForSucceededDeployments: migrating ==================
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:deployments, [:environment_id, :status, :finished_at], {:name=>"index_deployments_on_environment_id_status_and_finished_at", :algorithm=>:concurrently})
   -> 0.0051s
-- add_index(:deployments, [:environment_id, :status, :finished_at], {:name=>"index_deployments_on_environment_id_status_and_finished_at", :algorithm=>:concurrently})
   -> 0.0039s
== 20210303165201 AddIndexForSucceededDeployments: migrated (0.0096s) =========

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Shinya Maeda

Merge request reports