Skip to content

Loose index scan distinct count [RUN ALL RSPEC] [RUN AS-IF-FOSS]

Adam Hegyi requested to merge loose-index-scan-distinct-count into master

What does this MR do?

This MR introduces a new utility class to build batched distinct count queries using the loose index scan technique.

The utility class is integrated with BatchCount behind a feature flag: loose_index_scan_for_distinct_values


When counting distinct (distinct(column)) elements the DB reads all related rows from the disk. Example:

select count(distinct(project_id)) from issues where project_id in (278964, 2789645, 278966);
 Aggregate  (cost=3703.17..3703.18 rows=1 width=8) (actual time=45.722..45.723 rows=1 loops=1)
   ->  Index Only Scan using index_issues_on_project_id_and_iid on issues  (cost=0.57..3493.77 rows=83758 width=4) (actual time=0.032..30.022 rows=80808 loops=1)
         Index Cond: (project_id = ANY ('{278964,2789645,278966}'::integer[]))
         Heap Fetches: 3347
 Planning Time: 0.575 ms
 Execution Time: 45.773 ms

This query returns 1 as the result. To produce this result , the DB reads about 80000 items from the index.

The loose index scan minimalizes the number of rows read per distinct item by utilizing a recursive loop:

Gitlab::Database::DistinctCount.new(Issue, :project_id).count(from: 278964, to: 278967)
WITH RECURSIVE "counter_cte" AS (
                                   (SELECT "issues"."project_id" AS project_id
                                    FROM "issues"
                                    WHERE "issues"."project_id" >= 278964
                                      AND "issues"."project_id" < 278967
                                    ORDER BY "issues"."project_id" ASC
                                    LIMIT 1)
                                 UNION
                                   (SELECT
                                      (SELECT "issues"."project_id"
                                       FROM "issues"
                                       WHERE "issues"."project_id" > "counter_cte"."project_id"
                                         AND "issues"."project_id" < 278967
                                       ORDER BY "issues"."project_id" ASC
                                       LIMIT 1) AS project_id
                                    FROM "counter_cte"
                                    WHERE "counter_cte"."project_id" < 278967))
SELECT COUNT(project_id)
FROM "counter_cte" AS "issues";
 Aggregate  (cost=22.52..22.53 rows=1 width=8) (actual time=0.040..0.041 rows=1 loops=1)
   CTE counter_cte
     ->  Recursive Union  (cost=0.57..21.82 rows=31 width=4) (actual time=0.020..0.036 rows=2 loops=1)
           ->  Limit  (cost=0.57..0.61 rows=1 width=4) (actual time=0.019..0.019 rows=1 loops=1)
                 ->  Index Only Scan using index_issues_on_project_id_and_iid on issues issues_2  (cost=0.57..3679.39 rows=83257 width=4) (actual time=0.018..0.018 rows=1 loops=1)
                       Index Cond: ((project_id >= 278964) AND (project_id < 278967))
                       Heap Fetches: 0
           ->  WorkTable Scan on counter_cte  (cost=0.00..2.06 rows=3 width=4) (actual time=0.007..0.007 rows=0 loops=2)
                 Filter: (project_id < 278967)
                 Rows Removed by Filter: 0
                 SubPlan 1
                   ->  Limit  (cost=0.57..0.61 rows=1 width=4) (actual time=0.008..0.009 rows=0 loops=1)
                         ->  Index Only Scan using index_issues_on_project_id_and_iid on issues issues_1  (cost=0.57..15101.54 rows=344968 width=4) (actual time=0.008..0.008 rows=0 loops=1)
                               Index Cond: ((project_id > counter_cte.project_id) AND (project_id < 278967))
                               Heap Fetches: 0
   ->  CTE Scan on counter_cte issues  (cost=0.00..0.62 rows=31 width=4) (actual time=0.021..0.037 rows=2 loops=1)
 Planning Time: 0.979 ms
 Execution Time: 0.088 ms
(18 rows)

The Gitlab::Database::DistinctCount works with Usage Data with one limitation: GROUP BY is not supported.

Performance

I ran the distinct_count(::Ci::Build.where(time_period), :user_id) with the new code on DB lab and I got a bit better total runtime: 27min vs 32min (measured on PG.ai where I have about 160ms latency)

Results:

Query Distinct count Optimized distinct count
Slowest CI::Build batch 10s - https://explain.depesz.com/s/RWup 6s - https://explain.depesz.com/s/fubX
Random CI::Build batch 2.1s - https://explain.depesz.com/s/sRnv 0.8s - https://explain.depesz.com/s/Idpy
Random deployment batch 1s - https://explain.depesz.com/s/Cg9I 0.4s https://explain.depesz.com/s/CbcX

Each measurements are uncached, waited several hours between running the queries on a PRD replica.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Adam Hegyi

Merge request reports