Skip to content

Fix usage ping timeouts with batch counters

1. What does this MR do?

Fixes usage ping timeouts with batch counters

Usage pings time out on large instances with large tables. This change adds batch counters to fix the issue #193075

Next steps

  1. Validate the approach and design
  2. Add batch counters to usage_activity_by_stage as well as large selected tables
  3. Add indexes for column and filters to optimize each batch

2. Performance

  1. batch_count method enables two modes of operation for regular and distinct counting
  2. batch_size is 10K for distinct and 100K for ordinary counts so that each query is < 500 ms (pessimistic for gitlab.com scale)
  3. We don't need to run usage ping on gitlab.com as we have already tap the postgres db to snowflake
  4. Smart retry with halved batch size and warmer cache
  5. Min & max queries are cheap with an index
  6. We need an index for the where conditions and range queries for each counter

2.1 Ordinary count

  • Duration of the query depends on the size of the table
  • Default batch size is 100K
  • On largest self-hosted instance with 60M events, 600 SQL queries x 200 ms = 120 seconds (2mins)
  • On gitlab.com with 600M events, 6000 SQL queries x 200 ms = 1200 seconds pessimistic (20mins)
 Gitlab::UsageData.batch_count(::Event)
gitlabhq_production=> \timing
Timing is on.
gitlabhq_production=> SELECT MIN("events"."id") FROM "events";
   min    
----------
 49239589
Time: 73.463 ms

gitlabhq_production=> SELECT MAX("events"."id") FROM "events";
    max    
-----------
 637883458
Time: 0.799 ms

gitlabhq_production=> SELECT COUNT(*) FROM (SELECT "events"."id" FROM "events" WHERE (id >= 223000123 and id < 223100123)) as t;
 count 
-------
 92883
Time: 330.060 ms
next batch Time: 84.080 ms
next batch Time: 162.814 ms
next batch Time: 138.671 ms

2.2 Distinct count

  • Duration of the query depends on the count of the distinct field (usually users so user_id, or author_id) and size of the table
  • Default batch size is 10K to yield to ensure each select query < 500ms
  • On largest self-hosted instance with 500K users, 50 SQL queries x 500 ms = 25seconds safe maximum
  • On gitlab.com with 6M users, 600 SQL queries x 500 ms = 300 seconds maximum (5mins)
Gitlab::UsageData.batch_count(::Issue, column: :author_id, mode: :distinct),
Cold cached run on gitlab.com
\timing
SELECT MIN("issues"."author_id") FROM "issues";
SELECT MAX("issues"."author_id") FROM "issues";

gitlabhq_production=> \timing
Timing is on.
gitlabhq_production=> SELECT MIN("issues"."author_id") FROM "issues";
 min 
-----
   1
Time: 67.517 ms

gitlabhq_production=> SELECT MAX("issues"."author_id") FROM "issues";
   max   
---------
 5377617
Time: 2.323 ms

gitlabhq_production=> SELECT COUNT(*) FROM (SELECT DISTINCT author_id FROM issues WHERE (author_id >= 100001 and author_id < 110001)) as t;
 count 
-------
  2152
Time: 503.101 ms
next batch Time: 363.973 ms
next batch Time: 406.356 ms
next batch Time: 267.688 ms
next batch Time: 217.886 ms

2.3 Top 20 Self-hosted instances by active user counts

screenshot-2020-02-06-13-50-22

2.4 Research Links

  1. Faster PostgreSQL Counting https://www.citusdata.com/blog/2016/10/12/count-performance/
  2. PostgreSQL count(*) made fast https://www.cybertec-postgresql.com/en/postgresql-count-made-fast/
  3. Use Subqueries to Count Distinct 50X Faster https://www.sisense.com/blog/use-subqueries-to-count-distinct-50x-faster/
  4. Getting count of distinct elements, per group, in PostgreSQL https://www.depesz.com/2014/01/29/getting-count-of-distinct-elements-per-group-in-postgresql/
  5. Approximate table counts based on TABLESAMPLE gitlab-foss!22650 (merged)
  6. Use approximate counts for big tables gitlab-foss!23599 (merged)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Yorick Peterse

Merge request reports