Migrate `gitlab-org` to the new registry
Context
This is part of the work to upgrade and migrate the GitLab.com container registry to a new version backed by a metadata database and online garbage collection (&5523 (closed)). This will be achieved following the gradual migration plan detailed in container-registry#374 (closed).
The migration is now 90%+ complete, with just a few customers AND gitlab-org
left. This issue is to plan, coordinate and execute the migration of all gitlab-org
container repositories that exist on the old Container Registry platform to the new one.
Data
Available in this spreadsheet, including:
- List of all repositories;
- Current tag count for each repository;
- Estimated* pre-import (read/write allowed) duration;
- Estimated* final import (read-only) duration;
- Top 10 repositories in overall activity during the last 7 days and the last weekend;
- Same as above but for write activity only.
* Estimate based on the observed duration for a sample of recently migrated VIP repositories. The actual duration for each gitlab-org
repository may vary.
Highlights
- 3345 repositories in total;
- Only 227 have more than 100 tags;
- Only 84 have more than 1k tags;
- Only 13 have more than 10k tags;
- The largest is
gitlab-org/gitlab/gitlab-ee-qa
, with 34k tags; - The majority of large (20k+ tags) repositories are under
gitlab-org/build/cng-mirror
; - Only 11 repositories are expected to require more than 5 minutes for the final import step (where we block writes). None is expected to take more than 10 minutes;
- 3282 out of 3345 (~98%) are expected to require less than 30 seconds for the final import step. 30 seconds is the time that the Docker client will keep retrying an image push in case the request is rejected. So, in theory, CI pipelines trying to push images to these repositories while they are being migrated would eventually succeed without needing to be manually retried;
- All the 13 largest (10k+ tags) repositories have identical usage patterns with low to no write activity during weekends.
Largest repositories (10k+ tags)
Path | Tags | Estimated pre-import duration (read/write) | Estimated final import duration (read-only) |
---|---|---|---|
gitlab-org/gitlab/gitlab-ee-qa |
34323 | 23:50:08 | 00:08:35 |
gitlab-org/gitlab/gitlab-assets-ee |
32047 | 22:15:18 | 00:08:01 |
gitlab-org/build/omnibus-gitlab-mirror/gitlab-ee |
30159 | 20:56:38 | 00:07:32 |
gitlab-org/build/cng-mirror/gitlab-geo-logcursor |
29315 | 20:21:28 | 00:07:20 |
gitlab-org/build/cng-mirror/gitlab-rails-ee |
29310 | 20:21:15 | 00:07:20 |
gitlab-org/build/cng-mirror/gitlab-sidekiq-ee |
29299 | 20:20:48 | 00:07:19 |
gitlab-org/build/cng-mirror/gitlab-workhorse-ee |
29291 | 20:20:28 | 00:07:19 |
gitlab-org/build/cng-mirror/gitlab-webservice-ee |
28762 | 19:58:25 | 00:07:11 |
gitlab-org/build/cng-mirror/gitlab-shell |
28627 | 19:52:48 | 00:07:09 |
gitlab-org/build/cng-mirror/gitlab-toolbox-ee |
28219 | 19:35:48 | 00:07:03 |
gitlab-org/build/cng-mirror/gitlab-pages |
24493 | 17:00:33 | 00:06:07 |
gitlab-org/gitlab-runner/gitlab-runner-helper |
13473 | 09:21:23 | 00:03:22 |
gitlab-org/build/omnibus-gitlab-mirror/gitlab-ee-qa |
12631 | 08:46:18 | 00:03:09 |
Plan
- Migrate all 13 repositories above 10k tags across two weekends. None is expected to take more than 24h to fully migrate (pre+final import), and the system can migrate up to 40 in parallel. So we could do it all in a single weekend. However:
- These are likely the most critical ones for internal operations;
- Mainly: some of these repositories are responsible for a non-negligible share of the overall registry API traffic. I don't like the idea of shifting that load from the old to the new registry in one go.
- Migrate the remaining 3332 in no specific order (randomly picked) from Tuesday to Thursday in the weekdays between the two weekends above. This gives us a "silent" period of 24h on Monday and Friday (while everyone is around) to detect and react to any problems. Technically, at the maximum capacity (40), we could migrate all of them in a couple of hours. Still, we can reduce capacity to slow things down for additional peace of mind (being overcautious).
Communication
- Reach out to the team(s) that own the largest repositories to let them know about the migration and see if there are any concerns about doing it on a specific weekend;
- Post on
#whats-happening-at-gitlab
before starting the migration of the remaining 3332 during the week. This is for awareness and to give people a point of contact (#f_container-registry
) in case they see something odd;
The migration is at 94%, with over 1.5M repositories done by now, so I think we don't need something special here.
FAQ
Can we anticipate when the read-only period will occur for a specific repository?
No. It can happen at any time during the defined migration window.
What happens if we have pipelines attempting to push images to a repository while it's being migrated?
Pushes will succeed as usual during the vast majority of the migration time (read/write period), they will only be blocked for the duration of the last portion (read-only period).
If using Docker (likely), the client will keep retrying for up to 30s. If the repository remains locked against writes for longer, you'll need to retry the failed pipelines.
See the Data
section for the estimated time to migrate each of the largest repositories. See the spreadsheet for all others.
Can we pull images from repositories while they are being migrated?
Yes. Pulls should continue to succeed, even during the last portion of the migration (where we block writes). Once the migration is complete, reads are redirected to the new registry. This is fully transparent, so you won't notice anything different (except that some requests will be faster).
Schedule
Largest
Path | Owner | Date | Done |
---|---|---|---|
gitlab-org/gitlab/gitlab-ee-qa |
~"team::Quality Engineering" |
|
|
gitlab-org/build/omnibus-gitlab-mirror/gitlab-ee |
~"Distribution:Build" | August 20-21 | |
gitlab-org/build/cng-mirror/gitlab-geo-logcursor |
~"Distribution:Build" | August 20-21 | |
gitlab-org/build/cng-mirror/gitlab-sidekiq-ee |
~"Distribution:Build" |
|
|
gitlab-org/build/cng-mirror/gitlab-toolbox-ee |
~"Distribution:Build" | August 20-21 | |
gitlab-org/gitlab-runner/gitlab-runner-helper |
grouprunner | August 20-21 | |
gitlab-org/gitlab/gitlab-assets-ee |
Engineering Productivity | August 27-28 | |
gitlab-org/build/cng-mirror/gitlab-rails-ee |
~"Distribution:Build" | August 27-28 | |
gitlab-org/build/cng-mirror/gitlab-workhorse-ee |
~"Distribution:Build" | August 27-28 | |
gitlab-org/build/cng-mirror/gitlab-webservice-ee |
~"Distribution:Build" |
|
|
gitlab-org/build/cng-mirror/gitlab-shell |
~"Distribution:Build" | August 27-28 | |
gitlab-org/build/cng-mirror/gitlab-pages |
~"Distribution:Build" | August 27-28 | |
gitlab-org/build/omnibus-gitlab-mirror/gitlab-ee-qa |
~"Distribution:Build" |
|
Remaining
August 23-25.
Execution Plan
August 20
-
~"group::package" increases the GC review delay from 24h to 48h (gitlab-com/gl-infra/k8s-workloads/gitlab-com!2055 (merged)) -
~"group::package" enables the migration (if not already) and adds a comment to the rollout issue: /chatops run feature set container_registry_migration_phase2_enabled true
Note that there is no need to adjust the tag count limit or the deny list, as we're bypassing the Rails enqueuer worker where those are validated.
-
teamDelivery executes the change request to kickoff the migration of repositories scheduled for this weekend. -
~"group::package" occasionally checks the migration logs (diving into the correlation ID for any specific pre/final import if needed) and metrics to monitor progress. As an alternative or as a complement, query the registry database directly to see the list of repositories (pre)import[ing|ed] during the day, including the number of (pre)imported tags for each: Click to see query
SELECT r.top_level_namespace_id, r.id, r.path, r.migration_status, r.created_at, r.updated_at, ( SELECT COUNT(*) FROM tags AS t WHERE t.top_level_namespace_id = r.top_level_namespace_id AND t.repository_id = r.id) AS tag_count FROM repositories AS r WHERE (date_trunc('day', r.updated_at) = date_trunc('day', now()) OR date_trunc('day', r.created_at) = date_trunc('day', now())) AND r.path LIKE 'gitlab-org/%' AND r.migration_status <> 'native' ORDER BY updated_at ASC;
August 21
-
~"group::package" checks the migration logs and metrics. All repositories should have been imported by now.
August 22
-
~"group::package" reverts the increase of the GC review delay from 24h to 48h (gitlab-com/gl-infra/k8s-workloads/gitlab-com!2055 (merged)) -
~"group::package" stays alert for any reported issues with the migrated repositories. No additional gitlab-org
repositories should be migrated during this day. -
~"group::package" posts a message on #whats-happening-at-gitlab
,#development
and Eng Week in Review to let people know about the upcoming bulk migration (August 23-25). Link this issue for additional information and direct people to#f_container_registry
for reporting any potentially related issue.
We will be migrating most of the container registry repositories under `gitlab-org` throughout Aug 23-25.
The whole process should be seamless, however, there is a slight chance pushing new images to some of those repositories will fail as there is a short read-only period while a given repository is being migrated.
If a push fails, it is advised to wait a few minutes before retrying. Once the migration is done, you shouldn't have any issues but if you do, please reach out to #f_container_registry. See https://gitlab.com/gitlab-org/gitlab/-/issues/369101 for more details.
-
Edit: This was done ahead of time (#369101 (comment 1068291345)), so the plan has changed, with an additional step on August 23.team::Delivery
executes the change request to manually skip the repositories that should only be migrated on the 2nd weekend (August 27-28). This is needed to ensure these repositories won't be picked up automatically during the weekend migration.
August 23
-
~"group::package" drops the concurrent capacity from 40
to5
(to slowdown the migration pace and make it last a couple of days) and adds a comment to the rollout issue:/chatops run feature set container_registry_migration_phase2_capacity_40 false /chatops run feature set container_registry_migration_phase2_capacity_25 false /chatops run feature set container_registry_migration_phase2_capacity_10 false /chatops run feature set container_registry_migration_phase2_capacity_5 true
-
~"group::package" confirms that the previously skipped repositories (those reserved for the 2nd weekend) in gitlab-com/gl-infra/production#7605 (closed) are still marked as skipped. This can be done by looking at the count of repositories skipped due to Not in plan
in this graph. The value must be7
. Otherwise, gitlab-com/gl-infra/production#7605 (closed) need to be re-executed before proceeding. -
~"group::package" removes gitlab-org
from the deny list (to make the remaining repositories eligible for migration) and adds a comment to the rollout issue:/chatops run feature set --group=gitlab-org container_registry_phase_2_deny_list false
-
~"group::package" monitors the migration progress. -
~"group::package" adjusts concurrent capacity based on the observed hourly import rate and the number of remaining gitlab-org
repositories to be migrated. The intent is to spread the migration across August 23, 24, and 25, if possible.
August 24
-
~"group::package" monitors the migration progress; -
~"group::package" asks teamDelivery to review and schedule the change request for the second batch (for the weekend of 27-28).
August 25
-
~"group::package" monitors the migration progress. -
~"group::package" creates change request to retry any failed imports due to transitient errors (if any) and asks teamDelivery to review and execute (sample). -> Reset Skipped Container Repository Imports for ... (gitlab-com/gl-infra/production#7334 - closed)
August 26
-
~"group::package" checks the migration status. All repositories should have been imported by now. No additional gitlab-org
repositories should be migrated (besides failures from previous days, if any) during this day; -
~"group::package" increases the GC review delay from 24h to 48h (identical to gitlab-com/gl-infra/k8s-workloads/gitlab-com!2055 (merged)) late in the day. -> gitlab-com/gl-infra/k8s-workloads/gitlab-com!2075 (merged)
August 27
-
~"group::package" enables the migration (if not already) and adds a comment to the rollout issue: /chatops run feature set container_registry_migration_phase2_enabled true
-
teamDelivery executes the change request to kick off the migration of repositories scheduled for this weekend. -
~"group::package" occasionally checks the migration logs and metrics to monitor progress.
August 28
-
~"group::package" checks the migration logs and metrics. All repositories should have been imported by now.
August 29
-
~"group::package" reverts the increase of the GC review delay from 24h to 48h -
~"group::package" stays alert for any reported issues with the migrated repositories. Address any issues/failures from the weekend (if any).