Spike: Remove or improve dependency on FlushCounterIncrementWorker/BufferedCounter for CI related stats
Problems
Issues with FlushCounterIncrementWorker
- Sidekiq jobs can be killed or error and then the redis part is non-transactional leading to inaccurate statistics. - #438565
- Workers remained idle in transaction, saturating PGBouncer connections https://gitlab.com/gitlab-org/gitlab/-/issues/482785
- Storm of
FlushCounterIncrementWorkerthat blocks throughput of other workers - https://gitlab.com/gitlab-org/gitlab/-/issues/482785.
- The issue is that the worker gets scheduled on every increment, even though it should only run once every 10 minutes by design. We have limits on concurrency and deduplication but since the worker only takes .5 seconds on average it almost never practically gets deduplicated and we get alerts when there are way too many jobs being blocked by the concurrency limit.
- Alerts firing that the
FlushCounterIncrementsWorkerSidekiq worker has too many jobs being deferred by Concurrency Limit () - The FlushCounterIncrementsWorker worker has 1.20026e+06 jobs queued in the concurrency limit queue, exceeding the threshold for over %(thresholdDuration)s.
- https://gitlab.com/gitlab-com/support/internal-requests/-/issues/26758
- The worker has been disabled and has a small lookback window in some cases. It's possible some stats need to be re-calculated. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19461#note_2871435539
- The worker doesn't have clear ownership when it causes issues and CI often gets caught supporting repo size statistics which we are not domain experts in. I've recently moved the ownership of the worker to source code but it would be good for us to own the components for the stats that are in our domain. !211119 (merged)
Solution
Options
- Small iteration - We can consider storing the last flush time for that redis key(in redis) and skip enqueuing the worker entirely if it's been enqueued in the last 10 minutes. This would mean we no longer enqueue the worker with delay. !211815 (diffs)
- Use either an OLAP database like clickhouse for storage and retrieval. OR use postgres as a queue and periodically flush that queue to ProjectStatistics and an audit trail in object storage.
- Scheduled Cron worker to flush - Since stats can be eventually consistent flush all the data in one worker every 10 minutes. Maintain a redis set of projects that need flushing.
Edited by Allison Browne

