Add telemetry for Sidekiq resource usage limits and throttling
Possible telemetry required based on #3818 (moved) and https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3775#note_2089502716:
- Concurrency limits per worker (or other limits we are using) - these values can change and ease of tracking is important
- Job buffer size: this helps to estimate the backlog of work
- Worker concurrency: to understand efficacy of throttling
-
measurement values (i.e. what we use to dictate throttling)this may not be important since we have logs and metrics for db duration and marginalia active count dashboards - Counts of limit being exceeded
- Throttling state (disabled/enabled), reason for throttling (which resource exceeded).
Actionable tasks
-
Points 1 and 2 already exists but needs refinement (#3823 (comment 2136899955)). MR for this is in gitlab-org/gitlab!168926 (merged). -
Point 3 is being added in gitlab-org/gitlab!167027 (merged) -
Point 5 can be implemented within the ApplicationRateLimiter(gitlab-org/gitlab!169655 (merged)) -
Point 6 is being implemented in gitlab-org/gitlab!168193 (closed) (no actual throttling yet).
Closing summary
We have improved telemetry in the application rate limiter and concurrency limit middleware which forms the foundation for understanding the impacts of worker throttling.
Work for throttling metrics can be tracked alongside throttling work in Implement Sidekiq throttling middleware in gitl... (#3815 - closed).
Edited by Sylvester Chin