Skip to content

WIP: Resolve "Usage ping timing out for larger instances"

What does this MR do?

We give up increasing statement timeouts even for usage ping queries:

Issues

  • Table modifications used by migrations could be blocked. We have no control over when users upgrade so this could randomly result in migrations taking much longer than necessary or even timing out.
  • Selects in a transaction block only alter statements, but due to lock queue in postgres, any following queries are also blocked. So long selects also block everything else in an no-downtime deployment environment.
  • Vacuuming will suffer resulting in degradation
  • Replicas will accumulate dead tuples, as long running queries prevent then from cleaning up removed tuples. See https://www.2ndquadrant.com/en/blog/when-autovacuum-does-not-vacuum/ for some details

General Issues which will always happen with long running calculations

  • Writes on the rows locked for the count queries may be blocked until the count queries finish. This would result in a service degradation every time the usage ping data is calculated.
  • Depending on the amount of CPU and disk resources used, other queries may end up being slower while the count queries are running
  • The queries can still time out, depending on what the rest of the system is doing, how much caches are available, etc.

Implementation details

  1. Use a statement timeout of 900 seconds to count ~ 1 Billion rows (see https://gitlab.com/gitlab-org/telemetry/issues/264#note_265826121 and the comments around it
  2. Use a transaction block and SET LOCAL statement_timeout to overcome pgbouncer transaction pooling set parameters
  3. returning a timeout_fallback of -2 will help to differentiate timeouts from other types of ActiveRecord errors

Steps before removing WIP

  1. Validate the whole idea of increased statement timeout for counts
    • Long timeout selects would block migrations: Most migrations disable timeouts and could wait for 15 minutes.
    • pgbouncer used in transaction pooling mode, and set local => is it possible to skip pgbouncer?
  2. Add specs & Wrap more counters in usage ping with with_statement_timeout
Edited by Alper Akgun

Merge request reports