Disable container registry blob descriptor cache (#1878) · Issues · GitLab.com / GitLab Infrastructure Team / delivery

Disable container registry blob descriptor cache

## Context The container registry supports caching a list of known blobs (their digest and media type) using one of two backends: in-memory and Redis. This is used to alleviate the load on the storage backend (GCS bucket for us) when needing to determine if a blob exists or not. ## Problem For GitLab.com, we have always used the in-memory backend. This backend is not production-grade and comes with a few caveats: - Being a _per-instance_ in-memory cache, there is a lot of duplicated (and missing) data across instances - waste of resources and low cache hit rate. - There is no TTL/eviction, so memory usage grows unbounded. This has caused production incidents (https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1022), as registry pods consistently hit 100% of memory usage and are restarted often (https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/7). Efficiency concerns aside, while these constant restarts haven't caused any application/data issue so far, as we prepare to add another dependency (database) and rely on composite operations (transactions), the risk will increase. - Being a per-instance in-memory cache, this is not compatible with garbage collection, either offline or online (as we can't remove cached entries related to garbage-collected blobs from each instance's memory). So this will never be a long-term option for us, regardless if we can/want to fix it. We have considered using Redis as the storage backend in the past (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10868), which was now resurfaced in https://gitlab.com/gitlab-org/container-registry/-/issues/396. While this would theoretically solve the problems, there is a concern due to the extra risk of switching and adding yet another dependency for the registry near/during the GitLab.com registry upgrade and migration (https://gitlab.com/gitlab-org/container-registry/-/issues/374) when we're not certain we _absolutely_ need it. The ongoing (major) memory leak will also make it harder to spot others that may appear. As it's so pronounced, it would likely shadow them. So this is another concern when we prepare to release two major changes (metadata database and online GC). ## Proposal We have to 1) stop the ongoing problem and 2) determine if we can live without caching now and during the migration. If we find out that we can't, we have to address the switch to Redis ASAP. Otherwise, it can wait until necessary (stress on storage/database is expected to approach concerning levels soon - reactive change) and/or when we deem that the risk of introducing it is low enough to warrant the benefits of caching (proactive change). For this purpose, in https://gitlab.com/gitlab-org/container-registry/-/issues/396 and yesterday's registry sync meeting ([agenda](https://docs.google.com/document/d/1cVXt8i7N0B1uyOM5A_5V99eKn1GpQl2MHa2XxVFg7jY/edit)) we discussed the need to try and turn off caching (in all environments, progressively) to determine the above. Ruling out caching during the migration will also make our lives easier. The API <> storage/database interactions become more predictable, and bugs are easier to debug with one less place to check for data existence and consistency. To disable caching we have to remove the `cache` portion of the registry config under `storage` ([here](https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/blob/81d752bc057f469b0a4d454f41093a1bad496576/releases/gitlab-secrets/helmfile.yaml#L162)). ### Metrics The cache hit rate has been consistently floating around 20% ([source](https://dashboards.gitlab.net/explore?orgId=1&left=%5B%22now-7d%22,%22now%22,%22Global%22,%7B%22expr%22:%22sum(rate(registry_storage_cache_total%7Bcluster%3D%5C%22gprd-us-east1-b%5C%22,%20environment%3D%5C%22gprd%5C%22,%20namespace%3D%5C%22gitlab%5C%22,exported_type%3D%5C%22Hit%5C%22%7D%5B$__interval%5D))%20%2F%20sum(rate(registry_storage_cache_total%7Benvironment%3D%5C%22gprd%5C%22,exported_type%3D%5C%22Request%5C%22%7D%5B$__interval%5D))%22,%22format%22:%22time_series%22,%22interval%22:%221m%22,%22intervalFactor%22:3,%22datasource%22:%22Global%22,%22requestId%22:%22Q-56c0ffd6-b3e7-4bf7-aadf-a373f5acfa2b-0A%22,%22instant%22:false%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D)). In terms of storage performance and IO capacity, we're hovering around [1000 read req/s](https://thanos-query.ops.gitlab.net/new/graph?g0.expr=sum(rate(registry_storage_action_seconds_count%7Benv%3D%22gprd%22%2C%20action%3D\~%22List%7CStat%7CGetContent%22%7D%5B3d%5D))&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) against the GCS bucket. Even if we increase this by 20%, it still sits comfortably below the *initial* IO capacity advertised by Google, which is [5000 read req/s](https://cloud.google.com/storage/docs/request-rate). ### Expectations Considering the metrics above, the additional read load on the bucket from not using the cache shouldn't be a problem IMO. Regarding the SLIs, at least 80% of the samples (cache misses) reflect the lookups against the bucket. The current thresholds may be loose enough to accommodate the 20% increase (we've been working on breaking down the SLIs per API route in https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/476, but so far, this was done for the manifests route only, the remaining remain under a common threshold). Additionally, as the migration progresses, the database will take over some of the read load from the bucket, so the impact should decrease over time on the storage backend (increasing on the database side, so we should remain vigilant regarding that). ### Solution The cache was disabled on production with https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/1059 Besides not causing any issues so far (win), memory usage did not decrease significantly, and @hswimelar end up finding that upload purging is enabled in prod when it should not (another win), which means we have a second memory leak issue, and this one is likely the biggest culprit. We'll address that in https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1926. - comment from https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1878#note_641698001

issue