Usage ping timing out for larger instances (Step 0)
Parent Issue : https://gitlab.com/gitlab-org/telemetry/issues/308
Problem
Larger instances are timing out and sending a -1 for some usage ping queries
Investigation
- Top candidates for timeouts are large tables (with over 10millions row) such as
ci_builds
. - For self-hosted instances latest omnibus statement timeout is
60s
but larger instances in larger companies tend to have database administrators who could tweak this down. - Postgres is bad at counting, and unique & filtered counting will be worse
-
select count(*) from ci_builds
takes 5mins30 seconds on a production replica to give365millions
SELECT license_id, max(merge_requests), max(hostname), array_agg(ci_builds), array_agg(merge_requests)
FROM analytics.version_usage_data_unpacked
WHERE created_at > '2019-09-01'
group by 1
order by 2 desc
limit 100
Solution alternatives
Counting: https://www.citusdata.com/blog/2016/10/12/count-performance/
- Boring solutions:
- Increase timeouts for each count per session. (check transaction wrapping or pgbouncer pooling risks)
- Optimize queries and indexes
- Exotic solutions
- Rough counters with (for example) 95% accuracy are possible by selecting the max(id) and surveying how many percentage of IDs are missing from 1 to max(id).
- Get table storage size, and do a survey on deletions within a confidence level.
- Mark any value above N millions (timeout) as simply 'Huge' and just return the result as 'it's bigger than 1million'
- Sample size calculator for surveying a table fast https://www.calculator.net/sample-size-calculator.html?type=1&cl=95&ci=5&pp=50&ps=1000000000&x=49&y=17
Edited by Jerome Z Ng