Offline Garbage Collections exhausts all memory and gets killed
With an Omnibus installation and S3 backend on a machine with 32 GB ram and 8 Cores, a customer is trying to clean up the ~20TB of unreferenced registry image layers, but getting fatal error: runtime: out of memory
, and the process gets killed.
-
Commenting the cache: and blobdescriptor: lines out of their config.yml, then restarting the registry, but it didn't make a difference.
-
Running garbage collection as a dry run, it failed after some time with error:
SerializationError: failed to unmarshal error message\n\tstatus code: 504
Reporting
msg="blobs partially deleted" count=486
. -
The customer had
'parallelwalk' => true
which is a performance improvement for garbage collection, but would also lead to higher resource usage (CPU and memory).- Setting it to false, they were able to successfully run garbage collect in dry run mode. However, it took them 2.5 days, and there are still more unreferenced registry image layers to clean up.
time="2022-03-13T04:30:59.694Z" level=debug msg="preparing to delete blob" digest="sha256:495ae7ff4eb0e762e58d84576f178a60291f4a187f9949975d1a2fb6c9c5c4ef" environment=production go_version=go1.16.12 instance_id=c3bedc6d-d4dc-4a0b-9a53-ca21e92c70c7 path=/docker/registry/v2/blobs/sha256/49/495ae7ff4eb0e762e58d84576f178a60291f4a187f9949975d1a2fb6c9c5c4ef/data service=registry time="2022-03-13T04:30:59.694Z" level=info msg="deleting blobs" count=9045 environment=production go_version=go1.16.12 instance_id=c3bedc6d-d4dc-4a0b-9a53-ca21e92c70c7 service=registry time="2022-03-13T04:40:16.967Z" level=info msg="blobs deleted" count=9045 duration_s=557.536166507 environment=production go_version=go1.16.12 instance_id=c3bedc6d-d4dc-4a0b-9a53-ca21e92c70c7 service=registry time="2022-03-13T04:40:16.967Z" level=info msg="sweep stage complete" duration_s=557.536605113 environment=production go_version=go1.16.12 instance_id=c3bedc6d-d4dc-4a0b-9a53-ca21e92c70c7 service=registry
- Setting
parallelwalk
back to true after having a bulk of the stuff deleted, and re-running garbage collector with "-m", they get out of memory again and the GC process is killed. No error is reported this time, as GC is killed before getting a chance to log anything.
-
The registry data is down to ~16 GB, and they tried to run the registry garbage-collect command after increasing the server memory to 48GB, but the GC execution is still getting killed due to memory issues.
[root@pnlv6232 pxxb3p]# time sudo /opt/gitlab/embedded/bin/registry garbage-collect -m /var/opt/gitlab/registry/config.yml > registry-tags-cleanup-m2.log Killed real 2378m45.282s user 122m8.817s sys 6m10.490s Memory stats Mem: 48138 47452 293 42 392 214 Swap: 2047 2047 0 total used free shared buff/cache available Mem: 48138 47335 287 42 515 329 Swap: 2047 2047 0 total used free shared buff/cache available Mem: 48138 47439 281 42 418 224 Swap: 2047 2047 0 total used free shared buff/cache available Mem: 48138 36982 10565 41 590 10692 Swap: 2047 1062 985 total used free shared buff/cache available Mem: 48138 6070 41122 42 945 41595
Additional notes:
-
The runtime output files can be found in this snippet.
-
It could be also related to this issue: Upload Purging Algorithm Could Result in Unacceptably High Memory Usage for Very Large Registries