Container repository cleanup causes heavy CPU usage by registry process for multi-arch manifests without actually deleting anything
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
When container images exist that are not compatible with ContainerExpirationPolicies::CleanupContainerRepositoryWorker, the cleanup worker causes hundreds of requests per second to be sent to the registry process, causing it to use an inordinate amount of CPU time for many hours/days while not performing any useful task.
Steps to reproduce
- Create a project that uses a multi-arch container image. Example files for this build are in the repository linked below.
- Turn on weekly cleanup of the registry.
- Create hundreds of builds (over time; or all at once for testing).
- Wait for them to be old enough to be cleaned up and for the cleanup process to run.
Example Project
https://gitlab.com/kohenkatz/test-bug-cleanup-multi-arch
I don't have any runners configured for this project, because my runners on my self-hosted GitLab instance are "privileged" and I have not figured out a good way to reproduce this with non-privileged runners. (I have not gotten Buildah to work on non-privileged runners, and I have not figured out a good way to build multi-arch images in GitLab CI with any other tool yet.)
What is the current bug behavior?
The /opt/gitlab/embedded/bin/registry serve ./config.yml command runs for many days, consuming many hours of CPU time, and a consistent 20% of my six-core VM (AMD EPYC 7713), and often spikes higher than that. (Even 20% is enough to cause significant slowdowns when other requests come in.) Much higher than normal Sidekiq CPU usage adds an additional 8-20% on top of that.
The registry logs (logs/registry/current) are filled with lines like this, coming in at a rate of around 50 times per second:
2024-01-24_02:34:59.95411 time="2024-01-24T02:34:59.954Z" level=info msg="router info" config_http_addr="127.0.0.1:5000" config_http_host= config_http_net= config_http_prefix= config_http_relative_urls=false correlation_id=01HMWN47ZHT83ZJN6YWAXDKN5K go_version=go1.20.12 method=GET path=/v2/examplegroup/exampleproject/snapshots/mytest/manifests/8167230a root_repo=examplegroup router=gorilla/mux vars_name=examplegroup/exampleproject/snapshots/mytest vars_reference=8167230a version=v3.88.0-gitlab
2024-01-24_02:34:59.95535 time="2024-01-24T02:34:59.955Z" level=info msg="authorized request" auth_project_paths="[examplegroup/exampleproject]" auth_user_name= auth_user_type= correlation_id=01HMWN47ZHT83ZJN6YWAXDKN5K go_version=go1.20.12 root_repo=examplegroup vars_name=examplegroup/exampleproject/snapshots/mytest vars_reference=8167230a version=v3.88.0-gitlab
2024-01-24_02:35:00.01949 time="2024-01-24T02:35:00.012Z" level=error msg="OCI index found, but accept header does not support OCI indexes" auth_project_paths="[examplegroup/exampleproject]" auth_user_name= auth_user_type= code=MANIFEST_UNKNOWN correlation_id=01HMWN47ZHT83ZJN6YWAXDKN5K detail="<nil>" error="manifest unknown: OCI index found, but accept header does not support OCI indexes" go_version=go1.20.12 host="127.0.0.1:5000" method=GET remote_addr="127.0.0.1:34798" root_repo=examplegroup uri=/v2/examplegroup/exampleproject/snapshots/mytest/manifests/8167230a user_agent=GitLab/16.8.0 vars_name=examplegroup/exampleproject/snapshots/mytest vars_reference=8167230a version=v3.88.0-gitlab
2024-01-24_02:35:00.01951 {"content_type":"application/json","correlation_id":"01HMWN47ZHT83ZJN6YWAXDKN5K","duration_ms":59,"host":"127.0.0.1:5000","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:34798","remote_ip":"127.0.0.1","status":404,"system":"http","time":"2024-01-24T02:35:00.013Z","ttfb_ms":58,"uri":"/v2/examplegroup/exampleproject/snapshots/mytest/manifests/8167230a","user_agent":"GitLab/16.8.0","written_bytes":117}
At the same time, the GitLab log (logs/gitlab-rails/application_json.log) contains hundreds of lines like this:
{"severity":"INFO","time":"2024-01-24T02:41:56.149Z","meta.caller_id":"ContainerExpirationPolicies::CleanupContainerRepositoryWorker","correlation_id":"357b0a53ace7557dc18ab80c29acd50d","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","meta.client_id":"ip/","container_repository_id":95,"container_repository_path":"examplegroup/exampleproject/snapshots/mytest","project_id":79,"third_party_cleanup_tags_service":true}
(Yes, I know these two log snippets are not the same timestamp. They are examples pulled from many thousands of lines of logs.)
The cleanup never actually happens for these images, so they just keep accumulating. With only 450 tags that should be cleaned up, the cleanup process has been running for an entire day and has not deleted anything yet. Also despite having only around 450 tags, the registry log file for only the last 3 hours already contains over 400,000 instances of the message "OCI index found, but accept header does not support OCI indexes". Searching for a specific tag (I used grep 'but accept header does not support OCI indexes' gitlab/logs/registry/current | grep '88f4223d' | grep '2024-01-24' | wc -l - I'm sure someone else can write a better way without piping through multiple grep commands, but I built this up bit by bit while troubleshooting) shows that it has been requested over 1,700 times in the same 3-hour period.
What is the expected correct behavior?
The cleanup job should either delete these, or skip them (if it is not compatible yet). It should not keep trying hundreds of times to retrieve the same information.
Relevant logs and/or screenshots
Included in the "bug behavior" section above.
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
System information System: Current User: git Using RVM: no Ruby Version: 3.1.4p223 Gem Version: 3.4.22 Bundler Version:2.5.4 Rake Version: 13.0.6 Redis Version: 7.0.14 Sidekiq Version:7.1.6 Go Version: unknown GitLab information Version: 16.8.0 Revision: 16896fceb7a Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: PostgreSQL DB Version: 14.9 URL: https://REDACTED HTTP Clone URL: https://REDACTED/some-group/some-project.git SSH Clone URL: git@REDACTED:some-group/some-project.git Using LDAP: no Using Omniauth: yes Omniauth Providers: GitLab Shell Version: 14.32.0 Repository storages: - default: unix:/var/opt/gitlab/gitaly/gitaly.socket GitLab Shell path: /opt/gitlab/embedded/service/gitlab-shell Gitaly - default Address: unix:/var/opt/gitlab/gitaly/gitaly.socket - default Version: 16.8.0 - default Git Version: 2.42.0
Results of GitLab application Check
Expand for output related to the GitLab application check
Checking GitLab subtasks ...
Checking GitLab Shell ...
GitLab Shell: ... GitLab Shell version >= 14.32.0 ? ... OK (14.32.0) Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Internal API available: OK Redis available via internal API: OK gitlab-shell self-check successful
Checking GitLab Shell ... Finished
Checking Gitaly ...
Gitaly: ... default ... OK
Checking Gitaly ... Finished
Checking Sidekiq ...
Sidekiq: ... Running? ... yes Number of Sidekiq processes (cluster/worker) ... 1/1
Checking Sidekiq ... Finished
Checking Incoming Email ...
Incoming Email: ... Reply by email is disabled in config/gitlab.yml
Checking Incoming Email ... Finished
Checking LDAP ...
LDAP: ... LDAP is disabled in config/gitlab.yml
Checking LDAP ... Finished
Checking GitLab App ...
Database config exists? ... yes Tables are truncated? ... skipped All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Cable config exists? ... yes Resque config exists? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... yes Systemd unit files or init script exist? ... skipped (omnibus-gitlab has neither init script nor systemd units) Systemd unit files or init script up-to-date? ... skipped (omnibus-gitlab has neither init script nor systemd units) Projects have namespace: ... 6/1 ... yes 6/2 ... yes 6/3 ... yes 6/4 ... yes 6/5 ... yes 6/6 ... yes 6/7 ... yes 15/8 ... yes 15/9 ... yes 6/10 ... yes 6/12 ... yes 6/14 ... yes 6/18 ... yes 6/19 ... yes 51/23 ... yes 6/24 ... yes 15/25 ... yes 15/26 ... yes 6/27 ... yes 27/28 ... yes 28/29 ... yes 27/31 ... yes 27/32 ... yes 27/33 ... yes 34/34 ... yes 34/36 ... yes 6/37 ... yes 6/38 ... yes 36/39 ... yes 34/40 ... yes 28/41 ... yes 34/43 ... yes 27/44 ... yes 53/45 ... yes 27/46 ... yes 27/47 ... yes 6/48 ... yes 34/49 ... yes 27/50 ... yes 27/51 ... yes 34/52 ... yes 51/55 ... yes 51/56 ... yes 51/57 ... yes 13/58 ... yes 27/59 ... yes 51/60 ... yes 54/61 ... yes 27/62 ... yes 27/63 ... yes 34/64 ... yes 34/66 ... yes 54/67 ... yes 34/68 ... yes 27/69 ... yes 28/70 ... yes 27/71 ... yes 27/72 ... yes 54/73 ... yes 54/74 ... yes 54/75 ... yes 27/76 ... yes 28/77 ... yes 70/79 ... yes 70/80 ... yes 27/81 ... yes 51/82 ... yes 51/83 ... yes 34/85 ... yes 51/87 ... yes 27/88 ... yes 54/89 ... yes 70/90 ... yes 54/91 ... yes 51/92 ... yes 92/93 ... yes 20/94 ... yes 70/95 ... yes 70/96 ... yes 70/97 ... yes 70/98 ... yes 20/99 ... yes Redis version >= 6.0.0? ... yes Ruby version >= 3.0.6 ? ... yes (3.1.4) Git user has default SSH configuration? ... yes Active users: ... 26 Is authorized keys file accessible? ... yes GitLab configured to store new projects in hashed storage? ... yes All projects are in hashed storage? ... yes
Checking GitLab App ... Finished
Checking GitLab subtasks ... Finished
Possible fixes
Likely related to #362914, though the symptoms are different