Container repository cleanup causes heavy CPU usage by registry process for multi-arch manifests without actually deleting anything

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

When container images exist that are not compatible with ContainerExpirationPolicies::CleanupContainerRepositoryWorker, the cleanup worker causes hundreds of requests per second to be sent to the registry process, causing it to use an inordinate amount of CPU time for many hours/days while not performing any useful task.

Steps to reproduce

  1. Create a project that uses a multi-arch container image. Example files for this build are in the repository linked below.
  2. Turn on weekly cleanup of the registry.
  3. Create hundreds of builds (over time; or all at once for testing).
  4. Wait for them to be old enough to be cleaned up and for the cleanup process to run.

Example Project

https://gitlab.com/kohenkatz/test-bug-cleanup-multi-arch

I don't have any runners configured for this project, because my runners on my self-hosted GitLab instance are "privileged" and I have not figured out a good way to reproduce this with non-privileged runners. (I have not gotten Buildah to work on non-privileged runners, and I have not figured out a good way to build multi-arch images in GitLab CI with any other tool yet.)

What is the current bug behavior?

The /opt/gitlab/embedded/bin/registry serve ./config.yml command runs for many days, consuming many hours of CPU time, and a consistent 20% of my six-core VM (AMD EPYC 7713), and often spikes higher than that. (Even 20% is enough to cause significant slowdowns when other requests come in.) Much higher than normal Sidekiq CPU usage adds an additional 8-20% on top of that.

The registry logs (logs/registry/current) are filled with lines like this, coming in at a rate of around 50 times per second:

2024-01-24_02:34:59.95411 time="2024-01-24T02:34:59.954Z" level=info msg="router info" config_http_addr="127.0.0.1:5000" config_http_host= config_http_net= config_http_prefix= config_http_relative_urls=false correlation_id=01HMWN47ZHT83ZJN6YWAXDKN5K go_version=go1.20.12 method=GET path=/v2/examplegroup/exampleproject/snapshots/mytest/manifests/8167230a root_repo=examplegroup router=gorilla/mux vars_name=examplegroup/exampleproject/snapshots/mytest vars_reference=8167230a version=v3.88.0-gitlab
2024-01-24_02:34:59.95535 time="2024-01-24T02:34:59.955Z" level=info msg="authorized request" auth_project_paths="[examplegroup/exampleproject]" auth_user_name= auth_user_type= correlation_id=01HMWN47ZHT83ZJN6YWAXDKN5K go_version=go1.20.12 root_repo=examplegroup vars_name=examplegroup/exampleproject/snapshots/mytest vars_reference=8167230a version=v3.88.0-gitlab
2024-01-24_02:35:00.01949 time="2024-01-24T02:35:00.012Z" level=error msg="OCI index found, but accept header does not support OCI indexes" auth_project_paths="[examplegroup/exampleproject]" auth_user_name= auth_user_type= code=MANIFEST_UNKNOWN correlation_id=01HMWN47ZHT83ZJN6YWAXDKN5K detail="<nil>" error="manifest unknown: OCI index found, but accept header does not support OCI indexes" go_version=go1.20.12 host="127.0.0.1:5000" method=GET remote_addr="127.0.0.1:34798" root_repo=examplegroup uri=/v2/examplegroup/exampleproject/snapshots/mytest/manifests/8167230a user_agent=GitLab/16.8.0 vars_name=examplegroup/exampleproject/snapshots/mytest vars_reference=8167230a version=v3.88.0-gitlab
2024-01-24_02:35:00.01951 {"content_type":"application/json","correlation_id":"01HMWN47ZHT83ZJN6YWAXDKN5K","duration_ms":59,"host":"127.0.0.1:5000","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:34798","remote_ip":"127.0.0.1","status":404,"system":"http","time":"2024-01-24T02:35:00.013Z","ttfb_ms":58,"uri":"/v2/examplegroup/exampleproject/snapshots/mytest/manifests/8167230a","user_agent":"GitLab/16.8.0","written_bytes":117}

At the same time, the GitLab log (logs/gitlab-rails/application_json.log) contains hundreds of lines like this:

{"severity":"INFO","time":"2024-01-24T02:41:56.149Z","meta.caller_id":"ContainerExpirationPolicies::CleanupContainerRepositoryWorker","correlation_id":"357b0a53ace7557dc18ab80c29acd50d","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","meta.client_id":"ip/","container_repository_id":95,"container_repository_path":"examplegroup/exampleproject/snapshots/mytest","project_id":79,"third_party_cleanup_tags_service":true}

(Yes, I know these two log snippets are not the same timestamp. They are examples pulled from many thousands of lines of logs.)

The cleanup never actually happens for these images, so they just keep accumulating. With only 450 tags that should be cleaned up, the cleanup process has been running for an entire day and has not deleted anything yet. Also despite having only around 450 tags, the registry log file for only the last 3 hours already contains over 400,000 instances of the message "OCI index found, but accept header does not support OCI indexes". Searching for a specific tag (I used grep 'but accept header does not support OCI indexes' gitlab/logs/registry/current | grep '88f4223d' | grep '2024-01-24' | wc -l - I'm sure someone else can write a better way without piping through multiple grep commands, but I built this up bit by bit while troubleshooting) shows that it has been requested over 1,700 times in the same 3-hour period.

What is the expected correct behavior?

The cleanup job should either delete these, or skip them (if it is not compatible yet). It should not keep trying hundreds of times to retrieve the same information.

Relevant logs and/or screenshots

Included in the "bug behavior" section above.

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info

System information
System:
Current User:   git
Using RVM:      no
Ruby Version:   3.1.4p223
Gem Version:    3.4.22
Bundler Version:2.5.4
Rake Version:   13.0.6
Redis Version:  7.0.14
Sidekiq Version:7.1.6
Go Version:     unknown

GitLab information
Version:        16.8.0
Revision:       16896fceb7a
Directory:      /opt/gitlab/embedded/service/gitlab-rails
DB Adapter:     PostgreSQL
DB Version:     14.9
URL:            https://REDACTED
HTTP Clone URL: https://REDACTED/some-group/some-project.git
SSH Clone URL:  git@REDACTED:some-group/some-project.git
Using LDAP:     no
Using Omniauth: yes
Omniauth Providers:

GitLab Shell
Version:        14.32.0
Repository storages:
- default:      unix:/var/opt/gitlab/gitaly/gitaly.socket
GitLab Shell path:              /opt/gitlab/embedded/service/gitlab-shell

Gitaly
- default Address:      unix:/var/opt/gitlab/gitaly/gitaly.socket
- default Version:      16.8.0
- default Git Version:  2.42.0


Results of GitLab application Check

Expand for output related to the GitLab application check

Checking GitLab subtasks ...

Checking GitLab Shell ...

GitLab Shell: ... GitLab Shell version >= 14.32.0 ? ... OK (14.32.0) Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Internal API available: OK Redis available via internal API: OK gitlab-shell self-check successful

Checking GitLab Shell ... Finished

Checking Gitaly ...

Gitaly: ... default ... OK

Checking Gitaly ... Finished

Checking Sidekiq ...

Sidekiq: ... Running? ... yes Number of Sidekiq processes (cluster/worker) ... 1/1

Checking Sidekiq ... Finished

Checking Incoming Email ...

Incoming Email: ... Reply by email is disabled in config/gitlab.yml

Checking Incoming Email ... Finished

Checking LDAP ...

LDAP: ... LDAP is disabled in config/gitlab.yml

Checking LDAP ... Finished

Checking GitLab App ...

Database config exists? ... yes Tables are truncated? ... skipped All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Cable config exists? ... yes Resque config exists? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... yes Systemd unit files or init script exist? ... skipped (omnibus-gitlab has neither init script nor systemd units) Systemd unit files or init script up-to-date? ... skipped (omnibus-gitlab has neither init script nor systemd units) Projects have namespace: ... 6/1 ... yes 6/2 ... yes 6/3 ... yes 6/4 ... yes 6/5 ... yes 6/6 ... yes 6/7 ... yes 15/8 ... yes 15/9 ... yes 6/10 ... yes 6/12 ... yes 6/14 ... yes 6/18 ... yes 6/19 ... yes 51/23 ... yes 6/24 ... yes 15/25 ... yes 15/26 ... yes 6/27 ... yes 27/28 ... yes 28/29 ... yes 27/31 ... yes 27/32 ... yes 27/33 ... yes 34/34 ... yes 34/36 ... yes 6/37 ... yes 6/38 ... yes 36/39 ... yes 34/40 ... yes 28/41 ... yes 34/43 ... yes 27/44 ... yes 53/45 ... yes 27/46 ... yes 27/47 ... yes 6/48 ... yes 34/49 ... yes 27/50 ... yes 27/51 ... yes 34/52 ... yes 51/55 ... yes 51/56 ... yes 51/57 ... yes 13/58 ... yes 27/59 ... yes 51/60 ... yes 54/61 ... yes 27/62 ... yes 27/63 ... yes 34/64 ... yes 34/66 ... yes 54/67 ... yes 34/68 ... yes 27/69 ... yes 28/70 ... yes 27/71 ... yes 27/72 ... yes 54/73 ... yes 54/74 ... yes 54/75 ... yes 27/76 ... yes 28/77 ... yes 70/79 ... yes 70/80 ... yes 27/81 ... yes 51/82 ... yes 51/83 ... yes 34/85 ... yes 51/87 ... yes 27/88 ... yes 54/89 ... yes 70/90 ... yes 54/91 ... yes 51/92 ... yes 92/93 ... yes 20/94 ... yes 70/95 ... yes 70/96 ... yes 70/97 ... yes 70/98 ... yes 20/99 ... yes Redis version >= 6.0.0? ... yes Ruby version >= 3.0.6 ? ... yes (3.1.4) Git user has default SSH configuration? ... yes Active users: ... 26 Is authorized keys file accessible? ... yes GitLab configured to store new projects in hashed storage? ... yes All projects are in hashed storage? ... yes

Checking GitLab App ... Finished

Checking GitLab subtasks ... Finished

Possible fixes

Likely related to #362914, though the symptoms are different

Edited by 🤖 GitLab Bot 🤖