Expired build artifacts are not deleted fast enough
Summary
We are running a quite big self-hosted GitLab instance and were noticing substantial storage increase when the "Keep latest artifacts" was enabled (relevant projects had a workflow that produces lots of branches and tags).
Now we noticed that even after disabling this setting and unlocking artifacts manually as suggested in #241026 (comment 465410491) the storage did not decrease.
After some analysis we found that the logic in ExpireBuildArtifactsWorker cannot delete expired artifacts quickly enough.
Quoting a colleague:
It does a query with 7 million results, which takes a long time, afterwards filters unlocked artifacts (which is actually 0 in most cases and close to 0 in all cases), deletes the filtered ones and loops for 5 minutes max. Then it starts over so items at the end of the list (and your artifacts expired long time ago, so they are at end of the list) will never get deleted...
Quick check shows that we do <500 artifacts per run and have millions to go, so this number probably grows instead of decreases I'm afraid:
Ci::JobArtifact.expired_before(start_at).each_batch(of: 100, column: :expire_at, order: :desc) do |relation, index| artifacts = relation.unlocked sum += artifacts.count() break if Time.current > start_at + 5.minutes end => nil sum => 301 Ci::JobArtifact.expired_before(Time.now).count() => 6841908 Ci::JobArtifact.expired_before(Time.now).unlocked.count() => 6305128
I understand the idea behind this optimization, but in our setup this will probably never be able to clear the queue, causing unnecessarily high storage demand and cost.
Steps to reproduce
Example Project
What is the current bug behavior?
Expired artifacts do not get deleted quickly enough.
What is the expected correct behavior?
Expired artifacts are deleted within at most a day.
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)(we will only investigate if the tests are passing)
Possible fixes
One option would be to give the deletion service more time and adapt the worker frequency accordingly. This could be a configuration option.