Ci::DeleteObjectsWorker jobs saturate queues when object storage unavailable
<!--IssueSummary start--> <details> <summary> Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards. </summary> - [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=458017) </details> <!--IssueSummary end--> ### Summary Issue raised in response to a customer ticket ([internal link](https://gitlab.zendesk.com/agent/tickets/519579)). The issue was noticed during/after an unplanned network outage, but we've also been able to reproduce this with simple `iptables` firewall rules. When there's a sustained disruption to object storage, the `Ci::DeleteObjectsWorker` jobs will run for hours, and more `Ci::DeleteObjectsWorker` jobs will be "stuck" concurrently the longer the disruption occurs. The affected customer experienced delays queueing/executing other jobs while Sidekiq had multiple concurrent busy `Ci::DeleteObjectsWorker` jobs. In some cases, the jobs complete after 6 hours. This occurred when 100 artifacts were expired, so the batch to delete (`ready_for_destruction`) was only 100 records. When I've reproduced this scenario in my test environment, I've seen up to 17 "busy" jobs, where the vast majority are `Ci::DeleteObjectsWorker` jobs created hours prior. The worker should fail more quickly if the underlying storage is unavailable, or we should prevent multiple concurrent `Ci::DeleteObjectsWorker` jobs from running. ### Steps to reproduce 1. Configure object storage 2. Create pipeline schedule to run frequently that creates job artifacts 3. Wait until there are at least 100 job artifacts 4. Manually expire the first 100 artifacts via Rails Console 5. Simulate outage to object storage. For example, for GCP bucket I created an `iptables` firewall rule to drop traffic to `storage.googleapis.com` ### Example Project ### What is the current _bug_ behavior? Multiple concurrent `Ci::DeleteObjectsWorker` jobs are busy for many hours. ### What is the expected _correct_ behavior? Unsure ### Relevant logs and/or screenshots ![Screenshot_2024-04-25_at_6.49.30_AM](/uploads/6b87e5c6dd8cb8f9654c94f6a4f02dd4/Screenshot_2024-04-25_at_6.49.30_AM.png) ![Screenshot_2024-04-25_at_6.49.42_AM](/uploads/8ea0561192f6d45e0858740f39718d36/Screenshot_2024-04-25_at_6.49.42_AM.png) ``` root@tmike-docker:/var/log/gitlab/sidekiq# grep "Ci::DeleteObjectsWorker" current | grep '"job_status":"done"' | jq -rc '.duration_s' | sort -rn | head -n15 21291.187087 17716.152873 15065.002304 12403.932634 9530.290858 7853.82631 6933.649235 5932.758758 4258.806258 3287.853918 2352.764378 1615.532193 693.260357 10.959507 0.3395 ``` ### Output of checks #### Results of GitLab environment info <details> <summary>Expand for output related to GitLab environment info</summary> <pre> (For installations with omnibus-gitlab package run and paste the output of: \\\\\\\`sudo gitlab-rake gitlab:env:info\\\\\\\`) (For installations from source run and paste the output of: \\\\\\\`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production\\\\\\\`) </pre> </details> #### Results of GitLab application Check <details> <summary>Expand for output related to the GitLab application check</summary> <pre> (For installations with omnibus-gitlab package run and paste the output of: \\\`sudo gitlab-rake gitlab:check SANITIZE=true\\\`) (For installations from source run and paste the output of: \\\`sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true\\\`) (we will only investigate if the tests are passing) </pre> </details> ### Workaround Add a dedicated Sidekiq queue for `Ci::DeleteObjectsWorker` to avoid Sidekiq queue saturation. ### Possible fixes
issue