Add pg_stat_activity sampler into gitlab rails
Context
This is a follow up of the discussions in
- #3818 (comment 2123853926)
- https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3775#note_2111620696
Problem
Database durations measured on the client side is not a sufficient indicator if the worker class of a job is the cause of resource bottleneck. Jobs with long db duration could be a victim of head-of-line blocking at pgbouncer or actually spending long duration on the database.
Approach
Phase 1: Introduce a background sampler
We can sample the pg_stat_activity
table through a SQL function pg_stat_get_activity(-1)
in a similar fashion as how the postgres exporters do so.
The gitlab
user will be able to view the queries made by the same user which works out for us since sidekiq and webservice pods use the same user. This has been verified using a console: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3775#note_2123814996
Sidekiq processes can use the sampler threads to collect these samples periodically. We store the past x minutes of samples in Redis for easy reference since during a sudden spike in consumption, sampling may be slower. Using the sampler thread provides reliability since all Sidekiq processes are involved in the sampling effort, avoiding a single point of failure.
We avoid oversampling by using a time-released exclusive lease.
Phase 2: Add functionality to read and aggregate the samples
**The exact logic has yet to be decided. **
When the db durations limits are exceeded by worker class, this is indicative of a potential upstream issue on pgbouncer or patroni. The throttling middleware could check recent activity samples to determine if a particular worker class is consuming majority of the backends. The worker class could then be throttled (exactly throttling logic to be discussed separately in #3815 (closed)).
Considerations:
- how far back should we be aggregating across? 1 minute sounds like a fair. Or we could use an interval that matches the application rate limit.
- what proportion of the backend does a worker need to be consuming to be deemed a target for throttling? i.e. what exactly constitutes being a majority?
Summary 2024-11-20
The feature flag has been enabled: gitlab-org/gitlab#503486 (closed). Samples are collected as seen in gitlab-org/gitlab#503486 (comment 2217724885).
Note: The feature flag is a long-lived ops flag, hence the feature flag rollout issue is closed after the flags are enabled.