Measure sidekiq worker database usage
Client-side durations are only an estimate of each sidekiq worker's actual database usage.
client-side db duration = patroni active duration + pgbouncer queue wait duration + network ttl (we can ignore this)
Using pg_stat_activity
In https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3775#note_2111338131 and https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3775#note_2111620696, we discuss the use of pg_stat_activity table to approximate each worker's database usage.
Technical considerations to decide on:
-
How to approximate max total connections (sum of all pgbouncer's backend pool size)not necessary if we use a combination of both db duration and non-idle backends. -
Authorising database user to readwe can usepg_stat_activitytablepg_stat_get_activity(-1) - Connection to database: postgres exporter connects to patroni directly whereas rails connects through a pgbouncer. how should the sampler connect?
- Who should measure? gitlab-rails/external service (ongoing discussion in https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3775#note_2120098720): for starters, gitlab rails should measure it
- Sampling frequency
Using db_{main/ci}_duration_s
This approach is simple and does not place extra requirements of configuring the database user with special roles like pg_read_all_stats. However, this is a fuzzy signal since the signal is only accurate in the initial window of any bad workload since the offending workload will observe high db duration, then upon pgbouncer queue build-up, other workloads will start to experience heightened durations.
If we throttle on db duration quota depletion, we'd expect a low precision signal which would lead to throttling of a few workers. A handful of suspected workers will be buffered and enqueued at a controlled rate. The intended effect is that the pgbouncer pool congestion will alleviate while EOCs/incident-responders investigate.
Technical considerations to decide on:
- How do we determine what a good quota is?
- Sampling frequency and bucketing window size
Status 2024-11-20
[Phase 1] Set up sidekiq middleware to track re... (#3919 - moved) is done and we now are tracking resource usage of db_ci_durations_s and db_main_duration_s for all worker classes. See ff rollout issue at gitlab-org/gitlab#501502 (comment 2217438742).
[Phase 2] Set up a way to configure application... (#3920 - moved) is being reviewed and will enable us to refine the resource usage limit rules.
Due to the change in team structure, the 3rd task [Phase 3] Integrate sidekiq resource limiter mi... (#3921 - moved) will be assigned to @marcogreg.