Cache top-level namespace storage usage

Context

When storage usage calculations were introduced, we decided to not implement caching on the registry side from simplicity (and because we didn't had easy access to Redis). Instead, we relied solely on caching and deduplication on the Rails side.

Related to #1233 (closed).

This is out it works currently:

Whenever a tag is created or deleted, or when a manifest is deleted, the registry emits a webhook notification;
Rails subscribes to the registry notifications and triggers a background job to refresh storage usage for the corresponding project and root namespace (source);
On the Rails side, the refresh jobs are deduplicated. The one for root namespaces has a lease of 5 minutes (source).

Problem

While investigating #1233 (closed), I realized that almost half of the invocations to the registry API were duplicated (source). We also realized that this is not a bug but rather a trait of the deduplication mechanism (docs) that is intended to avoid out of sync issues:

Also, you can pass if_deduplicated: :reschedule_once option to re-run a job once after the currently running job finished and deduplication happened at least once. This ensures that the latest result is always produced even if a race condition happened.

Solution

The storage usage calculation queries are heavy, especially the one for root namespaces. Therefore, we should try to avoid unnecessary executions of these queries. Given the identified problem with deduplication on the Rails side, we should rely on caching on the registry side for a second line of defense and optimization. To do this:

Cache calculated storage size of root namespaces in Redis with a TTL of 5m;
Invalidate cache entry for a top-level namespace whenever a tag is created or deleted OR a manifest is deleted under it.

(1) exists to auto recover from a situation where (2) fails.

Note: We've discussed in the past (link) that for top-level namespaces with high write activity, we could see their cached size be invalidated more than once per second, thus invalidating the expected return from this cache. Despite that, we should deliver this change as described and then evaluate its effectiveness instance wide, instead of focusing on a few outliers. If necessary (i.e., the database load reduction was deemed insufficient, or we want to optimize even further), we can revisit and consider using a short TTL which is not invalidated upon writes. We could do this for all top-level namespaces or just for the biggest ones (where invalidation occurs e.g. more than once per 5 minutes).

Edited Apr 29, 2024 by João Pereira