Skip to content

Explore EWMA to determine throttling event dynamically on client-side DB duration

In #116 (closed), we're using client-side DB duration as one of the metrics to decide on throttling event, while Determining thresholds for database duration li... (gitlab-com/gl-infra/observability/team#3859 - closed) discusses what's the per-minute DB duration limit for each worker.

However Determining thresholds for database duration li... (gitlab-com/gl-infra/observability/team#3859 - closed) sets static limit based on the upper bound of the workers during normal period vs incident period. This works but it would require us to tune the limit over and over in the long run as traffic also grows. Another limitation is we wouldn't be able to catch drift in DB duration when traffic is low (as the static limit is currently set based on peak traffic).

Instead of setting quota per minute, we could explore implementing Exponentially Weighted Moving Average (EWMA) that would automatically catch the DB duration drift of any worker by tracking its moving average and stddev. If the worker's latest per-minute sum DB duration exceeds a certain Est + X * stddev, we can consider it as a throttling event, where the concurrency limit would be multiplicatively decreased.

EWMA also helps to smoothen out the curve since it considers historical data since Sidekiq workload tends to be bursty too (especially bursts at the top of the hour).

References

https://www.scs.stanford.edu/08sp-cs144/notes/l6.pdf (slide 14)

Credits to @qmnguyen0711 for the idea

Edited by Marco Gregorius
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information