Loose index scan distinct count [RUN ALL RSPEC] [RUN AS-IF-FOSS] (!55985) · Merge requests · GitLab.org / GitLab

Adam Hegyi requested to merge loose-index-scan-distinct-count into master Mar 08, 2021

What does this MR do?

This MR introduces a new utility class to build batched distinct count queries using the loose index scan technique.

The utility class is integrated with BatchCount behind a feature flag: loose_index_scan_for_distinct_values

When counting distinct (distinct(column)) elements the DB reads all related rows from the disk. Example:

select count(distinct(project_id)) from issues where project_id in (278964, 2789645, 278966);

 Aggregate  (cost=3703.17..3703.18 rows=1 width=8) (actual time=45.722..45.723 rows=1 loops=1)
   ->  Index Only Scan using index_issues_on_project_id_and_iid on issues  (cost=0.57..3493.77 rows=83758 width=4) (actual time=0.032..30.022 rows=80808 loops=1)
         Index Cond: (project_id = ANY ('{278964,2789645,278966}'::integer[]))
         Heap Fetches: 3347
 Planning Time: 0.575 ms
 Execution Time: 45.773 ms

This query returns 1 as the result. To produce this result , the DB reads about 80000 items from the index.

The loose index scan minimalizes the number of rows read per distinct item by utilizing a recursive loop:

Gitlab::Database::DistinctCount.new(Issue, :project_id).count(from: 278964, to: 278967)

WITH RECURSIVE "counter_cte" AS (
                                   (SELECT "issues"."project_id" AS project_id
                                    FROM "issues"
                                    WHERE "issues"."project_id" >= 278964
                                      AND "issues"."project_id" < 278967
                                    ORDER BY "issues"."project_id" ASC
                                    LIMIT 1)
                                 UNION
                                   (SELECT
                                      (SELECT "issues"."project_id"
                                       FROM "issues"
                                       WHERE "issues"."project_id" > "counter_cte"."project_id"
                                         AND "issues"."project_id" < 278967
                                       ORDER BY "issues"."project_id" ASC
                                       LIMIT 1) AS project_id
                                    FROM "counter_cte"
                                    WHERE "counter_cte"."project_id" < 278967))
SELECT COUNT(project_id)
FROM "counter_cte" AS "issues";

 Aggregate  (cost=22.52..22.53 rows=1 width=8) (actual time=0.040..0.041 rows=1 loops=1)
   CTE counter_cte
     ->  Recursive Union  (cost=0.57..21.82 rows=31 width=4) (actual time=0.020..0.036 rows=2 loops=1)
           ->  Limit  (cost=0.57..0.61 rows=1 width=4) (actual time=0.019..0.019 rows=1 loops=1)
                 ->  Index Only Scan using index_issues_on_project_id_and_iid on issues issues_2  (cost=0.57..3679.39 rows=83257 width=4) (actual time=0.018..0.018 rows=1 loops=1)
                       Index Cond: ((project_id >= 278964) AND (project_id < 278967))
                       Heap Fetches: 0
           ->  WorkTable Scan on counter_cte  (cost=0.00..2.06 rows=3 width=4) (actual time=0.007..0.007 rows=0 loops=2)
                 Filter: (project_id < 278967)
                 Rows Removed by Filter: 0
                 SubPlan 1
                   ->  Limit  (cost=0.57..0.61 rows=1 width=4) (actual time=0.008..0.009 rows=0 loops=1)
                         ->  Index Only Scan using index_issues_on_project_id_and_iid on issues issues_1  (cost=0.57..15101.54 rows=344968 width=4) (actual time=0.008..0.008 rows=0 loops=1)
                               Index Cond: ((project_id > counter_cte.project_id) AND (project_id < 278967))
                               Heap Fetches: 0
   ->  CTE Scan on counter_cte issues  (cost=0.00..0.62 rows=31 width=4) (actual time=0.021..0.037 rows=2 loops=1)
 Planning Time: 0.979 ms
 Execution Time: 0.088 ms
(18 rows)

The Gitlab::Database::DistinctCount works with Usage Data with one limitation: GROUP BY is not supported.

Performance

I ran the distinct_count(::Ci::Build.where(time_period), :user_id) with the new code on DB lab and I got a bit better total runtime: 27min vs 32min (measured on PG.ai where I have about 160ms latency)

Results:

Query	Distinct count	Optimized distinct count
Slowest CI::Build batch	10s - https://explain.depesz.com/s/RWup	6s - https://explain.depesz.com/s/fubX
Random CI::Build batch	2.1s - https://explain.depesz.com/s/sRnv	0.8s - https://explain.depesz.com/s/Idpy
Random deployment batch	1s - https://explain.depesz.com/s/Cg9I	0.4s https://explain.depesz.com/s/CbcX

Each measurements are uncached, waited several hours between running the queries on a PRD replica.

Does this MR meet the acceptance criteria?

Conformity

📋 Does this MR need a changelog?
- I have included a changelog entry.
- I have not included a changelog entry because _____.
Documentation (if required)
Code review guidelines
Merge request performance guidelines
Style guides
Database guides
Separation of EE specific content

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers
Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

Label as security and @ mention @gitlab-com/gl-security/appsec
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
Security reports checked/validated by a reviewer from the AppSec team

Edited Mar 08, 2021 by Adam Hegyi

Loose index scan distinct count [RUN ALL RSPEC] [RUN AS-IF-FOSS]