Drive the GitLab.com container registry migration phase 1 rollout with Rails FFs

Context

In order to control and gradually increase the number of new container repositories that we'll make use of the upcoming container registry metadata database for GitLab.com, we have to implement a series of feature flags on the Rails side.

These feature flags will then be used to modify the JWT tokens served (by Rails) to clients whenever they want to interact with the registry. The registry will then read the information within those tokens and decide if a new repository should use the database or not.

⚠ This requires changes in both Rails and Container Registry. This issue will group both.

Technical Background

Auth Tokens

We'll only list the basics here. Please see the auth spec and flow if looking for additional details.

Every request against the GitLab Container Registry needs to have a valid JWT token. This is true for both read and write requests, against public or private repositories;
JWT tokens are emitted and signed by Rails, obtained by clients, and embedded in the request header. The registry decodes and validates the token to ensure its authenticity and that the user has the required permissions to perform the request.

A token can be obtained as follows:

http -a $USERNAME:$PASSWORD https://$GITLAB_ADDRESS/jwt/auth \
    client_id=="docker" \
    service=="container_registry" \
    scope=="repository:$REPO_PATH:pull,push,delete

Format

Here's a sample JWT token:

{
  "access": [
    {
      "actions": [
        "pull",
        "push"
      ],
      "name": "gitlab-org/gitlab-test",
      "type": "repository"
    }
  ],
  "aud": "container_registry",
  "exp": 1623679323,
  "iat": 1623675723,
  "iss": "gitlab-issuer",
  "jti": "71fb41a4-6e42-4967-b584-914ea969b225",
  "nbf": 1623675718,
  "sub": "root"
}

For most cases, access will only have one element, as most requests only target one repository. Cross-repository blob mount requests are the exception, as we always have two repositories, a source (for which a user needs pull permissions) and a target (for which a user needs pull and push permissions). In these cases, a token will look as follows:

{
  "access": [
    {
      "actions": [
        "pull"
      ],
      "name": "source-repo",
      "type": "repository"
    }
    {
      "actions": [
        "pull",
        "push"
      ],
      "name": "target-repo",
      "type": "repository"
    }
  ],
  // ...
}

Rails Feature Flags (FFs)

See the documentation.

Requirements

We need a way to limit the number of new repositories that will be handled through the new code path (and thus registered in the metadata DB and have their blobs stored in the new bucket prefix) to support Phase 1 of the gradual migration plan, making sure that:

We can gradually increase the number of repositories that may be eligible (in case they are considered new on the registry side) to follow through new code path;
We can pick a few specific repositories to start with, only then expanding to a percentage-based rollout;
We can exclude all repositories from specific top-level groups until a later date, such as those from VIP customers, making sure they cannot be affected by early bugs. These will then be allowed gradually, one by one (or in small batches).

Solution

sequenceDiagram
  autonumber
  participant C as Client
  participant G as GitLab Rails
  participant R as GitLab Container Registry
  C->>G: GET /jwt/auth?repository:<repo>:<actions>&...
  G->>G: If <actions> includes `push`, does <repo><br/>belong to an eligible top-level Group and Project?
  opt Yes
  	G->>G: Set custom `access[name=<repo>].migration_eligible = true`<br/>flag on the token's body
  end
  G->>C: 200 OK<br/>{"token": "w2a4..."}
  C->>R: POST/PUT/PATCH /v2/<repo>/...<br/>Authorization: Bearer w2a4...
  R->>R: Is <repo> new<br/>AND<br/>does `migration_eligible` equals `true`?
  alt Yes
    R->>R: Process through new code path
  else No
    R->>R: Process through old code path
  end
  R->>C: 200 OK

Step (2) in the diagram above, most precisely the Group/Project validations, is where the Rails FFs come into play. This is a replacement for the registry-side validations initially planned and described in the migration plan.

Feature Flags

graph TD
    A[Validate eligibility] --> B{Is the corresponding<br/>top-level Group denied?};
    B -- "Yes" --> C[Emit unmodified token]
    B -- "No" --> D{Is the corresponding<br/>Project allowed?}
    D -- "Yes" --> E[Emit modified token]
    D -- "No" --> C

Below is a description of each FF.

`container_registry_migration_phase1`

This is a simple boolean FF to act as a global gate. In case it's off (by default), Rails behaves as it does today - it does not validate the repository scope and/or mess with the JWT token. In case it's on, it validates the eligibility of the repository based on the following FFs and modifies the token accordingly.

We'll use this FF to start the Phase 1 migration without changing any source code or registry configurations.

`container_registry_migration_phase1_allow`

This FF acts as an allow list for Rails projects. When validating <repo>, we check Feature.enabled?(:container_registry_migration_phase1_allow, repo.project).

We'll start with some specific repositories created for testing, and then we'll allow gitlab-org/* repositories, gradually. Once we're done with the initial phase, we can start a "global" percentage-based rollout using this same FF.

It's important to note that this FF applies to GitLab projects, not container repositories. Given that a project might have multiple container repositories, a 1% here does not equal 1% of the container repositories. This is not necessarily a problem, just something to keep in mind as we increase the scope.

`container_registry_migration_phase1_deny`

This FF acts as a deny list for Rails groups/namespaces. When validating <repo>, we check Feature.disabled?(:container_registry_migration_phase1_deny, repo.project.root_ancestor). Note that here we do Feature.disabled? and not Feature.enabled?. All groups are eligible by default. We can exclude a specific group by enabling this FF for it.

This allows us to exclude specific top-level groups/namespaces from the process so that they are never marked as eligible because of the _allow FF. We'll use this to exclude the groups of VIP customers from the first iteration, making sure they can't be impacted by early bugs.

Edited Aug 10, 2021 by João Pereira