Sidekiq high-urgency-cpu-bound job BuildQueueWorker and Ci::BuildFinishedWorker not meeting performance targets

Summary

On GitLab Dedicated, running 18.9.x and 18.10.x at the time of writing (mid-upgrade week), BuildQueueWorker and Ci::BuildFinishedWorker are consistently failing to meet the execution time requirements for urgency :high workers:

The median job execution time should be less than 1 second, and 99% of jobs should complete within 10 seconds.

Both workers are classified as high urgency and CPU bound, running on the urgent-cpu-bound shard. However, without any infrastructure saturation, they are regularly exceeding the 10-second threshold — primarily due to time spent in DB calls.

Impact

Over a 7-day observation window on an example GitLab Dedicated tenant (on 18.9.x):

  • ~33% of slow jobs (>10s) on the urgent-cpu-bound shard are from Ci::BuildFinishedWorker
  • ~27% of slow jobs (>10s) on the urgent-cpu-bound shard are from BuildQueueWorker

Screenshot_2026-04-22_at_2.16.21_PM

This is causing repeated SLO violations and incidents on GitLab Dedicated, including: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/3293

The good news is this is not yet affecting the queueing SLO, only the execution apdex for this queue.

Recommendation

Per the worker attributes documentation, if a worker cannot meet the high urgency execution requirements, the options are:

  1. Redesign the worker to reduce execution time
  2. Split the work: a fast urgency :high portion and a slower urgency :low portion
  3. Reduce the urgency classification if the user-facing latency impact is acceptable

I also request a backport to release N-1 when this is resolved.

Verification

This is mostly assessed looking at the duration fields for these sidekiq workers in the logs. I've struggled to find the most accurate PromQL query for this.

Edited by 🤖 GitLab Bot 🤖