Importer With Blob Transfer Test Findings

Introduction

We've recently completed work to allow the transfer of blobs from one GCS bucket to another in #246 (closed). Which is required for a gradual migration of the container registry for GitLab.com as proposed here #191 (closed).

Goals

While we've tested imports before, this is the first large-scale test of the new blob transfer feature. These tests allow us to see this feature working in more realistic conditions.

Additionally, we can start to gauge the impact of blob transfer on overall import time.

Finally, we are expanding testing to setups which should match production environments more closely. Namely, with larger hosts, hosted databases, and multi-region buckets.

Test Setup

Basic single-region and multi-region bucket tests were done with a c2-standard-4 (4 vCPUs, 16 GB memory) VM all in US East 1 with two US East 1 buckets and again with two US multi-region buckets. The database was a Postgres 12 instance running in a container directly on the VM.

Realistic tests were done with a e2-standard-8 (8 vCPUs, 32 GB memory) VM, and a 16 vCPU, 104GB memory cloud sql instance, both with SDD disks running in us-east-1, and two US multi-region buckets.

The bucket from which blobs were imported from is a copy of the registry used by dev.gitlab.org containing 10TiB of non-garbage collected data, the bucket to which they were imported was empty at the time the import began.

The following config was passed to the importer:

Registry Config
version: 0.1
log:
  fields:
    service: registry
  level: info
  formatter: text
  accesslog:
    disabled: false
    formatter: text
storage:
  delete:
    enabled: true
  gcs:
    bucket: <bucket name>
    keyfile: <key file>
  maintenance:
      uploadpurging:
          enabled: false
database:
  enabled:  true
  host:     <host>
  port:     <port>
  user:     "postgres"
  password: <pass>
  dbname:   "registry_test"
  sslmode:  "disable"
migration:
  disablemirrorfs: true
http:
  addr: :5000
  headers:
    X-Content-Type-Options: [nosniff]

Tests

Tagged Manifest Tests

The following command was run to initiate an import of the repository data associated with tagged manifests:

./registry database import --blob-transfer-destination "target-bucket"  config.yml

Given this command, only tagged images and their associated blobs will be imported. The imported blobs will be copied to the target bucket indicated via the flag.

Max 100

This test was ran using the same command as above with a modified importer which would not import more than 100 tagged manifests. This test simulates an ideal cleanup policy with 100% adoption and each preserving only the 100 most recent images.

Semver

This test was ran using the same command as above with a modified importer which would import only images tagged with semver tags, master, or latest. If this list resulted in less than 10 images (the default amount of latest images to keep), additional tags would be backfilled into the list up to the total amount of tagged images in the repository.

For reference, here is the regex used to match semver tags. It is modified from the suggested semver regex to optionally include a single v character at the beginning as well as tags matching latest and master.

^v?(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(?:-((?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$|^latest$|^master$

In Place

This test is a comparator for a single bucket migration which must import dangling blobs as well as tagged manifests and their blobs.

Results

No blob transfer errors were encountered across all runs, resulting in tens of terabytes of blobs copied without incident. At least using the GCS backend, this process seems to be quite robust.

Summary

Here is the summary of the test runs described above:

Bucket Type Completion Time Tags Manifests Blobs Repositories
basic tagged import single-region 9.8 hours 36729 28016 135143 258
basic tagged import multi-region 13.2 hours 36729 28016 135143 258
realistic tagged import 13.6 hours 36729 28016 135143 258
realistic max 100 1.9 hours 4228 3511 21087 258
realistic semver 6.9 hours 11605 11141 78999 258
realistic in place 13 hours 36728 28015 302132 258

Switching from single to multi region with no other changes resulted in a 34.69% increase in total runtime.

All the counters matched between full runs, with the exception of the in place run, which received one unexpected 503 from GCS, resulting in one less tagged manifest being imported:

time="2021-03-30T11:26:18Z" level=error msg="importing manifest" count=3713 error="error obtaining configuration payload: gcs: googleapi: got HTTP response code 503 with body: Service Unavailable" name=12.2.0-rfbranch.123335.a716fb1e-0 target="sha256:f9e698b5b577cc9b755218f17987363e054174a6558f732310ef68a7d5b3c4c9" total=4985

Repository Import Times (Seconds)

Bucket Type Min Max Average
basic tagged import single-region 0.301 7841.407 169.667
basic tagged import multi-region 0.308 11391.370 228.119
realistic tagged import 0.385 10981.155 234.49
realistic max 100 0.256 351.193 33.1276
realistic semver 0.255 10945.871 118.182
realistic in place 0.341 3711.3 66.253

Blob Transfer Times(Seconds)

Bucket Type Min Max Average
basic tagged import single-region 0.007 10.343 0.088
basic tagged import multi-region 0.007 22.831 0.134
realistic tagged import 0.007 2.181 0.136
realistic max 100 0.008 2.181 0.123
realistic semver 0.007 19.038 0.129

Discussion

Import Speed

While these tests indicated that blob transfer is quick and robust, though multi-region buckets show increased variance in blob transfer times. Additionally, the same blobs can have non-proportionate import times between runs, meaning that we don't see a flat percentage increase between blobs from the single region transfer to the multi-region transfer. The logs include the files single-import-times-by-digest.txt and mutli-import-times-by-digest.txt to ease comparison.

Even with the fast speeds we observed in this test, scaling to the production registry, which is roughly 1000 times the size of the test data imports will take anywhere from 81 in the ideal cleanup policy adoption scenario to 580 days in the zero cleanup policy adoption stage. These numbers assume a 1:1 scale in import times, a constantly running importer, and imports which always import on the first try.

I believe that we should investigate a "fan out fan in" approach to the data migration step mentioned in the gradual migration proposal: #191 (closed).

In this approach, we would send the inventory of repositories to a number of concurrent workers which will report the success or failure of the import. Each repository import relies on many sequential filesystem reads compared to database writes and repository import time can vary to a large degree as we've seen above. Given those factors, I think this is the best way to reduce the total amount of time it will take to migrate the GitLab.com registry. Additionally, we can control the rate at which the import happens by adjusting the number of workers. An issue for this service is here: #319 (closed)

We will need to ensure the import tool can tolerate concurrent imports in order to undertake this work, and we possibly should undertake this regardless since the target registry will be in use while the migration is occurring in the original proposal. An issue for this work is here: #328 (closed)

When Blob Transfer Is Faster

Looking at the results of the in place import, we achieved roughly equivalent import times compared to only imported tagged manifests and transferring their blobs. The in place import resulting in 302,132 blobs being imported, while the tagged manifest import resulted in 135,143 blob being imported, a 55% decrease. Given this data, I think we can assume that once a registry deployment has 60% dangling blobs, it's quicker to transfer the blobs to a new bucket, rather than importing all blobs in place.

Very Large Repositories

The largest repositories can take a significant time to import, the largest repository here took around two hours to import, but the registry for GitLab.com contains registries that would potentially take around one day if import speed was to scale 1:1 from this test to the real import. During the import, a repository should be read-only to prevent new writes while import is taking place, as the import may not detect them.

Since import is idempotent, it's possible for the largest repositories we run a pre-import to import the majority of objects with a second pass occurring much quicker as already-imported objects will read from the database and then skipped over. If the first pass only includes manifests and blobs, we don't have to worry about writes or deletes as we only preverve tagged data so anything that would be untagged would be garbage collection and anything written would be caught on the second pass with when write locking is in place.

I think we should determine a limit for how much downtime a single user should experience during the data migration. As well as determine a strategy to help prepare users for a partial service disruption and give them a rough estimate when their groups will be affected.

Production Registry Statistics

So far we've been using a copy of the registry from deb.gitlab.org to test imports, and while these data are realistic, we are not sure how well they match with the production registry. In particular, these data might skew towards having a much large proportion of GitLab registries, which are most likely atypically large.

Despite this skew, I think that we'll see a similar distribution on the production data, with most registries being relatively small, and a few being much larger. It's possible that we'll need to work directly with the largest registries in order to facilite their migration. Given this, we should determine an upper bound of tags for likely import success and determine how many repositories exceed this limit to ensure that there are few enough that they can be taken care of on an individual basis outside the automated process. Addtionally, we should not attempt to automatically import repositories over this limit since we can expect them to fail.

If we want address the outlying registries, then we should be able to use simple random sampling to determine general statistics on the number of tags per repository and then use that data to determine the lenght of time that the automated import will take.

Conversly or in addition, we could instead rely on statistics gathered from the imports and project forward in time based on those. For example, average repository import time * the remaining number of repositories to import. These stats would grow more accurate over time, but initially be somewhat low quality which would negatively impact long-term planning.

Tag Cleanup Policies

While we've previously dicussed the positive effects of tag cleanup policies on the import process. Looking at the differences in import times for all tagged images vs. the simulated cleanup, I think we need to make it a more central component of the overall migration plan. Both by prioritzing repositories with successfully ran cleanup policies which we can access via the imformation added in gitlab#276479 (closed) and by encouraging the largest repositories to adopt these policies as they become availible to them. This introduces a level of sophistication in the way we choose repositories we start to import in a way that I do not believe was been previously discussed, so we should open an issue to determine what kind of features we would like the system which picks repositories to import.

Action Items

  • Rerun the import with blob transfer test against multi-region buckets
  • Discuss a multiple import workers "fan out fan in" approach to data migration
  • Open an issue for a repository import dispatcher #319 (closed)
  • Determine a way to predict and/or monitor import duration for the GitLab.com registry
  • Ensure that importer is ready for concurrent operation #328 (closed)
  • Catalog the number registries we expect to be too large to import
    • Reach out to the owners of those registries to encourage them to adopt tag cleanup policies
Logs

Note the primary logs from both tests are too large to include here, even with compression.

single-import-times-by-digest.txt.tar.bz mutli-import-times-by-digest.txt.tar.bz fresh-import-no-transfer-gitlab-assets-ee-tagged-only-same-all-us-east-1.log.tar.bz

Edited by Hayley Swimelar