Use blob ID when deduplicating layers during usage calculations
Context
In Container Registry: Scaling limitations on top-... (#779 - closed) we have identified a performance/scaling issue that leads to failures when calculating the deduplicated registry usage with maximum precision for very large namespaces (~1%).
Among the several mitigation strategies identified in #779 (comment 1179923688), this epic focuses on delivering the third one, Option C
.
Option C
consists in assigning an internal ID to blobs and using that for the deduplication portion of the query instead of their digest. Based on the conversations in the linked issue, this could represent a major cut on the query cost due to the number of bytes we need to fit in memory with the current digest deduplication.
Task
The layers
table is already related to blobs
, but blobs are univocally identified by digest
, there is no ID for them. We'd have to:
- Add ID (
bigint
) toblobs
table; - Add blob ID column to
layers
table; - Start filling blob ID for all new layers;
- Backfill blob ID for all existing layers.
We'd better not replace the blob digest with the blob ID as we rely on the former pretty much everywhere (API and GC queries, indexes, GC triggers), so that would be a huge amount of work. In this case, the blob ID would be used for usage calculation purposes only.
The lack of support for background migrations is the main blocker here. Without that, we can't backfill at scale. Regardless, this is already a problem, so nothing new. We need to address that regardless.