Reduce database connection pool metric cardinality

Reduce database connection pool metric cardinality

Partially revert !223981 (merged) which added per-thread-name labels to gitlab_database_connection_pool_busy and gitlab_database_connection_pool_dead metrics. This caused a cardinality explosion making these metrics impossible to query and contributing to Mimir ingester OOMs.

Changes:

  • Remove per-thread splitting of busy and dead from the default metrics, reverting to scalar values from connection_pool.stat
  • Add multiprocess_mode to all DatabaseSampler gauges so metrics are aggregated across Puma worker processes (min for size, max for all others)
  • Add optional per-thread metrics under separate gauge names gitlab_database_extended_connection_pool_{busy,dead} gated behind the per_thread_db_connection_pool_metrics ops feature flag scoped to Feature.current_pod, allowing operators to enable detailed metrics for a percentage of pods via chatops

Important caveat of the idle metric: this counts connections that have already been initialized, but aren't in use. This means that busy + dead + idle <= connections. For saturation monitoring we need to use dead + busy.

The default metrics reduce cardinality from pods * processes * threads * db_hosts to pods * db_hosts.

gitlab-com/gl-infra/observability/team#4488

Edited by Bob Van Landuyt

Merge request reports

Loading