Sidekiq high-urgency-cpu-bound job BuildQueueWorker and Ci::BuildFinishedWorker not meeting performance targets
Summary
On GitLab Dedicated, running 18.9.x and 18.10.x at the time of writing (mid-upgrade week), BuildQueueWorker and Ci::BuildFinishedWorker are consistently failing to meet the execution time requirements for urgency :high workers:
The median job execution time should be less than 1 second, and 99% of jobs should complete within 10 seconds.
Both workers are classified as high urgency and CPU bound, running on the urgent-cpu-bound shard. However, without any infrastructure saturation, they are regularly exceeding the 10-second threshold — primarily due to time spent in DB calls.
Impact
Over a 7-day observation window on an example GitLab Dedicated tenant (on 18.9.x):
- ~33% of slow jobs (>10s) on the
urgent-cpu-boundshard are fromCi::BuildFinishedWorker - ~27% of slow jobs (>10s) on the
urgent-cpu-boundshard are fromBuildQueueWorker
This is causing repeated SLO violations and incidents on GitLab Dedicated, including: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/3293
The good news is this is not yet affecting the queueing SLO, only the execution apdex for this queue.
Recommendation
Per the worker attributes documentation, if a worker cannot meet the high urgency execution requirements, the options are:
- Redesign the worker to reduce execution time
- Split the work: a fast
urgency :highportion and a slowerurgency :lowportion - Reduce the urgency classification if the user-facing latency impact is acceptable
I also request a backport to release N-1 when this is resolved.
Verification
This is mostly assessed looking at the duration fields for these sidekiq workers in the logs. I've struggled to find the most accurate PromQL query for this.
