Usage ping timing out for larger instances (Step 0)

Parent Issue : https://gitlab.com/gitlab-org/telemetry/issues/308

Problem

Larger instances are timing out and sending a -1 for some usage ping queries

Investigation

Top candidates for timeouts are large tables (with over 10millions row) such as ci_builds.
For self-hosted instances latest omnibus statement timeout is 60s but larger instances in larger companies tend to have database administrators who could tweak this down.
Postgres is bad at counting, and unique & filtered counting will be worse
select count(*) from ci_builds takes 5mins30 seconds on a production replica to give 365millions

SELECT license_id, max(merge_requests), max(hostname),  array_agg(ci_builds), array_agg(merge_requests)
FROM analytics.version_usage_data_unpacked
WHERE  created_at > '2019-09-01'
group by 1
order by 2 desc
limit 100

Solution alternatives

Counting: https://www.citusdata.com/blog/2016/10/12/count-performance/

Boring solutions:

Increase timeouts for each count per session. (check transaction wrapping or pgbouncer pooling risks)
Optimize queries and indexes

Exotic solutions

Rough counters with (for example) 95% accuracy are possible by selecting the max(id) and surveying how many percentage of IDs are missing from 1 to max(id).
Get table storage size, and do a survey on deletions within a confidence level.
Mark any value above N millions (timeout) as simply 'Huge' and just return the result as 'it's bigger than 1million'
Sample size calculator for surveying a table fast https://www.calculator.net/sample-size-calculator.html?type=1&cl=95&ci=5&pp=50&ps=1000000000&x=49&y=17

Edited Feb 19, 2020 by Jerome Z Ng