Skip to content

Restore missing container repositories under existing projects (part 1/2)

Context

This is related to Restore missing container repositories under ex... (&9619). The intent is to perform a data repair to restore missing container repositories under existing projects.

The high-level strategy for the data repair is described here. The actual implementation plan was detailed here and split into two parts. This issue is for the implementation of part 1/2.

Implementation

Requirements

To make this happen we'll need a few assets:

  1. New temporary table with columns project_id (FK for projects), missing_count (int), status (text), and updated_at. For brevity, we'll refer to this table as t.

  2. A limited capacity worker to perform the data repair analysis.

  3. An application setting to control the max concurrency for the worker (default to 2).

  4. A feature flag to enable/disable the worker execution.

Logic

The background job should do the following work:

  1. "Loop over" (cron scheduling) all projects that do not appear in t (i.e. skip those that were already analyzed);

  2. For each project P:

    1. Query the container registry for the list of non-empty (at least one tag) repositories under P's full path. This should be done by calling the new List Sub Repositories API.

    2. For each repository R in the returned list (paginated response):

      1. Check if R exists on the Rails side (container_repositories table);

      2. If it is missing, increment a counter of "missing repositories" for P.

    3. Once done iterating over repositories under P, insert a row in t for P. t.missing_count should be set to the value of the above counter.

      Note: As we'll be looping over all projects (millions of rows) and inserting a record for each in t (same quantity), it can be advisable to perform a bulk insert. In this case, we can stash inserts for up to N Ps and only then flush them to the database. However, because we'll be doing 1+N network requests to the registry for each P, we must ensure that we flush any stashed inserts in case an exception occurs (e.g. network timeout). Otherwise, when the worker resumes it will pick Ps that were already analyzed but not recorded due to a previous failure.

t.missing_count will allow us to:

  • Identify how many missing repositories were found per project and in total. This will be used to assess the scale of the problem and fine-tune the approach for part 2/2 (the actual data repair);

  • Act as the filter for projects so that we can narrow down the data repair loop in part 2/2 to repositories whose t.missing_count > 0.

Edited by João Pereira