Skip to content

Use primary for internal registry migration API

Steve Abrams requested to merge 359882-registry-migration-api-primary into master

🏛 Context

We are in the process of migrating all existing container repositories to the new container registry. Rails drives the migration process making requests to the registry and then acting after receiving notifications back from the registry. This request sequence looks like:

sequenceDiagram
    Rails->>+Registry: Start the pre-import
    Registry-->>-Rails: Pre-import is complete
    Rails->>+Registry: Start the import
    Registry-->>-Rails: Import is complete

When rails receives either of the two requests from the registry, it first checks the container repository in question to make sure that it's migration_state is in fact either pre_importing or importing. If it is not, then it throws a 400 error.

Here is the bug :bug:: The worker that starts the pre-import will always use the primary database based on it having a data_consistency: :always setting. The API will use a replica when it first checks the migration_state. If the sequence of events above happens fast enough, there is a chance the API will check a replica that has not yet been updated to pre_importing or importing and throw an error.

We saw exactly this happen today:

Screen_Shot_2022-04-19_at_9.28.56_AM

The 400 error was:

{"message":"400 Bad request - Wrong migration state (default)"}

There is no way in the code for a container repository to move back to the default migration state after it has changed, so this means we must be experiencing a replication lag race condition!

We can fix this by simply telling the API to use the primary for these queries.

🔎 What does this MR do and why?

We wrap the internal registry API notification endpoint in the Gitlab::Database::LoadBalancing::Session.current.use_primary helper to ensure the initial queries use the primary and avoid a race condition with the associated background workers.

Screenshots or screen recordings

n/a

How to set up and validate locally

This is not easily validated locally due to the need for multiple databases and load balancing.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related: #359882 (closed)

Merge request reports