Offline Garbage Collections exhausts all memory and gets killed

With an Omnibus installation and S3 backend on a machine with 32 GB ram and 8 Cores, a customer is trying to clean up the ~20TB of unreferenced registry image layers, but getting fatal error: runtime: out of memory, and the process gets killed.

Commenting the cache: and blobdescriptor: lines out of their config.yml, then restarting the registry, but it didn't make a difference.
Running garbage collection as a dry run, it failed after some time with error:
```
SerializationError: failed to unmarshal error message\n\tstatus code: 504
```
Reporting msg="blobs partially deleted" count=486.

The customer had 'parallelwalk' => true which is a performance improvement for garbage collection, but would also lead to higher resource usage (CPU and memory).

Setting it to false, they were able to successfully run garbage collect in dry run mode. However, it took them 2.5 days, and there are still more unreferenced registry image layers to clean up.

time="2022-03-13T04:30:59.694Z" level=debug msg="preparing to delete blob" digest="sha256:495ae7ff4eb0e762e58d84576f178a60291f4a187f9949975d1a2fb6c9c5c4ef" environment=production go_version=go1.16.12 instance_id=c3bedc6d-d4dc-4a0b-9a53-ca21e92c70c7 path=/docker/registry/v2/blobs/sha256/49/495ae7ff4eb0e762e58d84576f178a60291f4a187f9949975d1a2fb6c9c5c4ef/data service=registry
time="2022-03-13T04:30:59.694Z" level=info msg="deleting blobs" count=9045 environment=production go_version=go1.16.12 instance_id=c3bedc6d-d4dc-4a0b-9a53-ca21e92c70c7 service=registry
time="2022-03-13T04:40:16.967Z" level=info msg="blobs deleted" count=9045 duration_s=557.536166507 environment=production go_version=go1.16.12 instance_id=c3bedc6d-d4dc-4a0b-9a53-ca21e92c70c7 service=registry
time="2022-03-13T04:40:16.967Z" level=info msg="sweep stage complete" duration_s=557.536605113 environment=production go_version=go1.16.12 instance_id=c3bedc6d-d4dc-4a0b-9a53-ca21e92c70c7 service=registry

Setting parallelwalk back to true after having a bulk of the stuff deleted, and re-running garbage collector with "-m", they get out of memory again and the GC process is killed. No error is reported this time, as GC is killed before getting a chance to log anything.

The registry data is down to ~16 GB, and they tried to run the registry garbage-collect command after increasing the server memory to 48GB, but the GC execution is still getting killed due to memory issues.

[root@pnlv6232 pxxb3p]# time sudo /opt/gitlab/embedded/bin/registry garbage-collect -m /var/opt/gitlab/registry/config.yml > registry-tags-cleanup-m2.log
Killed

real 2378m45.282s
user 122m8.817s
sys 6m10.490s

Memory stats

Mem: 48138 47452 293 42 392 214
Swap: 2047 2047 0
total used free shared buff/cache available
Mem: 48138 47335 287 42 515 329
Swap: 2047 2047 0
total used free shared buff/cache available
Mem: 48138 47439 281 42 418 224
Swap: 2047 2047 0
total used free shared buff/cache available
Mem: 48138 36982 10565 41 590 10692
Swap: 2047 1062 985
total used free shared buff/cache available
Mem: 48138 6070 41122 42 945 41595

Additional notes:

The runtime output files can be found in this snippet.
It could be also related to this issue: Upload Purging Algorithm Could Result in Unacceptably High Memory Usage for Very Large Registries

Edited Apr 29, 2022 by João Pereira