Offline garbage collection bring container registry down for some users

Summary

There have been a few reports where the offline garbage collection is both, throttled significantly (by AWS) while also bringing the container registry down.

Internal only ticket

In both instances, values of maxrequestspersecond had to be brought down to around 20; Which also makes it seem like the current implementation of the exponential backoff / maxrequestspersecond limitation isn't working as expected.

{"content_type":"application/json","correlation_id":"01GJ6EZBYXK4HE66096BGW1NHZ","duration_ms":0,"host":"127.0.0.1:5000","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:35636","remote_ip":"127.0.0.1","status":200,"system":"http","time":"2022-11-18T23:50:29.213+01:00","ttfb_ms":0,"uri":"/gitlab/v1/","user_agent":"GitLab/15.5.2-ee","written_bytes":2}
2022-11-18_22:50:29.22136 time="2022-11-18T23:50:29.221+01:00" level=info msg="authorized request" auth_user_name= correlation_id=01GJ6EZBZ54NAJF67NQR9W8J7R go_version=go1.17.13 version=v3.57.0-gitlab
2022-11-18_22:50:29.22147 time="2022-11-18T23:50:29.221+01:00" level=panic msg="runtime error: invalid memory address or nil pointer dereference"
2022-11-18_22:50:29.22152 {"content_type":"","correlation_id":"01GJ6EZBZ54NAJF67NQR9W8J7R","duration_ms":0,"host":"127.0.0.1:5000","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:35638","remote_ip":"127.0.0.1","status":0,"system":"http","time":"2022-11-18T23:50:29.221+01:00","uri":"/gitlab/v1/repositories/<repo>/tags/list/?n=1000","user_agent":"GitLab/15.5.2-ee","written_bytes":0}
2022-11-18_22:50:29.22170 2022/11/18 23:50:29 http: panic serving 127.0.0.1:35638: &{0xc00003e0e0 map[] 2022-11-18 23:50:29.221461217 +0100 CET m=+4.031714822 panic <nil> runtime error: invalid memory address or nil pointer dereference <nil> <nil> }

Steps to reproduce

A rather large container registry is needed for this (above 20TB for both cases so far), it needs to use S3 as a storage backend (maybe?) this may be a separate issue altogether
During the offline garbage collection, the container registry becomes unavailable

Example Project

What is the current bug behavior?

gitlab-ctl status reports registry as down

What is the expected correct behavior?

gitlab-ctl status should report service as up

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info


(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
GitLab 15.5.2-ee
Registry version v3.57.0-gitlab
S3 backend storage, 20+TB bucket size
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)

Possible fixes

Edited Nov 30, 2022 by Ulises Fierro