Batch counting
For large instances, we cannot count full relations anymore. For GitLab.com, a select count(*) from ci_builds
times out due to the size of the table, for example.
There are alternatives that produce an approximate count in an acceptable amount of time. An example for that is approximate counting based on TABLESAMPLE
or using database statistics to figure out the size of a relation.
The usage ping feature is a contender here because it counts all the things. For example, we already use approximate counting (example MR: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/23599) here.
In order to retrieve an exact count for large tables, we can also use a batched approach: We would simply iterate the relation in batches and maintain a counter for the number of records seen.
A bit simplified:
User.select(:id).find_in_batches(batch_size: 100000).reduce(0) { |counter, batch| counter += batch.size }
This produces queries like below. It is important to include the primary key to be able to apply a filter based on the primary key index for the batching.
Secondly, since this may take a long time to produce a result, we must make sure we don't run it in a read/write transaction (to make sure those are short and sweet and don't hold locks forever).
[ gprd ] production> User.select(:id).find_in_batches(batch_size: 100000).reduce(0) { |counter, batch| counter += batch.size }
D, [2019-07-25T12:42:34.168420 #4251] DEBUG -- : User Load (142.2ms) SELECT "users"."id" FROM "users" ORDER BY "users"."id" ASC LIMIT 100000
D, [2019-07-25T12:42:47.397972 #4251] DEBUG -- : SQL (0.9ms) SET SESSION CHARACTERISTICS AS TRANSACTION READ ONLY
D, [2019-07-25T12:42:47.532739 #4251] DEBUG -- : User Load (134.0ms) SELECT "users"."id" FROM "users" WHERE "users"."id" > 104868 ORDER BY "users"."id" ASC LIMIT 100000
D, [2019-07-25T12:42:55.278908 #4251] DEBUG -- : SQL (0.7ms) SET SESSION CHARACTERISTICS AS TRANSACTION READ ONLY
D, [2019-07-25T12:42:55.419353 #4251] DEBUG -- : User Load (139.9ms) SELECT "users"."id" FROM "users" WHERE "users"."id" > 208155 ORDER BY "users"."id" ASC LIMIT 100000
D, [2019-07-25T12:43:09.513571 #4251] DEBUG -- : User Load (79.5ms) SELECT "users"."id" FROM "users" WHERE "users"."id" > 313939 ORDER BY "users"."id" ASC LIMIT 100000