Offline garbage collection bring container registry down for some users
Summary
There have been a few reports where the offline garbage collection is both, throttled significantly (by AWS) while also bringing the container registry down.
In both instances, values of maxrequestspersecond
had to be brought down to around 20; Which also makes it seem like the current implementation of the exponential backoff / maxrequestspersecond limitation isn't working as expected.
{"content_type":"application/json","correlation_id":"01GJ6EZBYXK4HE66096BGW1NHZ","duration_ms":0,"host":"127.0.0.1:5000","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:35636","remote_ip":"127.0.0.1","status":200,"system":"http","time":"2022-11-18T23:50:29.213+01:00","ttfb_ms":0,"uri":"/gitlab/v1/","user_agent":"GitLab/15.5.2-ee","written_bytes":2}
2022-11-18_22:50:29.22136 time="2022-11-18T23:50:29.221+01:00" level=info msg="authorized request" auth_user_name= correlation_id=01GJ6EZBZ54NAJF67NQR9W8J7R go_version=go1.17.13 version=v3.57.0-gitlab
2022-11-18_22:50:29.22147 time="2022-11-18T23:50:29.221+01:00" level=panic msg="runtime error: invalid memory address or nil pointer dereference"
2022-11-18_22:50:29.22152 {"content_type":"","correlation_id":"01GJ6EZBZ54NAJF67NQR9W8J7R","duration_ms":0,"host":"127.0.0.1:5000","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:35638","remote_ip":"127.0.0.1","status":0,"system":"http","time":"2022-11-18T23:50:29.221+01:00","uri":"/gitlab/v1/repositories/<repo>/tags/list/?n=1000","user_agent":"GitLab/15.5.2-ee","written_bytes":0}
2022-11-18_22:50:29.22170 2022/11/18 23:50:29 http: panic serving 127.0.0.1:35638: &{0xc00003e0e0 map[] 2022-11-18 23:50:29.221461217 +0100 CET m=+4.031714822 panic <nil> runtime error: invalid memory address or nil pointer dereference <nil> <nil> }
Steps to reproduce
- A rather large container registry is needed for this (above 20TB for both cases so far), it needs to use S3 as a storage backend (maybe?) this may be a separate issue altogether
- During the offline garbage collection, the container registry becomes unavailable
Example Project
What is the current bug behavior?
gitlab-ctl status
reports registry as down
What is the expected correct behavior?
gitlab-ctl status
should report service as up
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)GitLab 15.5.2-ee Registry version v3.57.0-gitlab S3 backend storage, 20+TB bucket size
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)