Optimize database queries in a number of sidekiq workers in order to improve their scalability

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

In a number of incidents on gitlab.com we noticed that during bursts of traffic, a small subset of sidekiq workers can saturate pgbouncer connection pool. This results in an increased latency for all sidekiq jobs, higher error rates and an increase in resource utilization.

Here are a few examples of recent incidents:

An increase in sidekiq latency can be caused by many things. However, this particular issue is concerned only with performance of sidekiq workers. For example, in this case:

Screenshot_from_2022-03-04_12-00-12

src: grafana panel

there was no slow down on the Postgres side: gitlab-com/gl-infra/production#6466 (comment 862424100)

Here's a view on a sum of time spent in db transactions by different sidekiq workers during the latest occurrence:

Screenshot_from_2022-03-04_12-07-15

src: thanos-query

Top workers that spent the most time in transactions:

  • PipelineProcessWorker
  • Ci::ArchiveTraceWorker
  • ProjectImportScheduleWorker
  • RepositoryUpdateMirrorWorker
  • Ci::InitialPipelineProcessWorker
Edited by 🤖 GitLab Bot 🤖