Skip to content

Usage ping timing out for larger instances (Step 0)

Parent Issue : https://gitlab.com/gitlab-org/telemetry/issues/308

Problem

Larger instances are timing out and sending a -1 for some usage ping queries

Investigation

  • Top candidates for timeouts are large tables (with over 10millions row) such as ci_builds.
  • For self-hosted instances latest omnibus statement timeout is 60s but larger instances in larger companies tend to have database administrators who could tweak this down.
  • Postgres is bad at counting, and unique & filtered counting will be worse
  • select count(*) from ci_builds takes 5mins30 seconds on a production replica to give 365millions
SELECT license_id, max(merge_requests), max(hostname),  array_agg(ci_builds), array_agg(merge_requests)
FROM analytics.version_usage_data_unpacked
WHERE  created_at > '2019-09-01'
group by 1
order by 2 desc
limit 100

Solution alternatives

Counting: https://www.citusdata.com/blog/2016/10/12/count-performance/

  1. Boring solutions:
  • Increase timeouts for each count per session. (check transaction wrapping or pgbouncer pooling risks)
  • Optimize queries and indexes
  1. Exotic solutions
Edited by Jerome Z Ng