Importer With Blob Transfer Test Findings
Introduction
We've recently completed work to allow the transfer of blobs from one GCS bucket to another in #246 (closed). Which is required for a gradual migration of the container registry for GitLab.com
as proposed here #191 (closed).
Goals
While we've tested imports before, this is the first large-scale test of the new blob transfer feature. These tests allow us to see this feature working in more realistic conditions.
Additionally, we can start to gauge the impact of blob transfer on overall import time.
Finally, we are expanding testing to setups which should match production environments more closely. Namely, with larger hosts, hosted databases, and multi-region buckets.
Test Setup
Basic single-region and multi-region bucket tests were done with a c2-standard-4 (4 vCPUs, 16 GB memory) VM all in US East 1 with two US East 1 buckets and again with two US multi-region buckets. The database was a Postgres 12 instance running in a container directly on the VM.
Realistic tests were done with a e2-standard-8 (8 vCPUs, 32 GB memory) VM, and a 16 vCPU, 104GB memory cloud sql instance, both with SDD disks running in us-east-1, and two US multi-region buckets.
The bucket from which blobs were imported from is a copy of the registry used by dev.gitlab.org
containing 10TiB of non-garbage collected data, the bucket to which they were imported was empty at the time the import began.
The following config was passed to the importer:
Registry Config
version: 0.1
log:
fields:
service: registry
level: info
formatter: text
accesslog:
disabled: false
formatter: text
storage:
delete:
enabled: true
gcs:
bucket: <bucket name>
keyfile: <key file>
maintenance:
uploadpurging:
enabled: false
database:
enabled: true
host: <host>
port: <port>
user: "postgres"
password: <pass>
dbname: "registry_test"
sslmode: "disable"
migration:
disablemirrorfs: true
http:
addr: :5000
headers:
X-Content-Type-Options: [nosniff]
Tests
Tagged Manifest Tests
The following command was run to initiate an import of the repository data associated with tagged manifests:
./registry database import --blob-transfer-destination "target-bucket" config.yml
Given this command, only tagged images and their associated blobs will be imported. The imported blobs will be copied to the target bucket indicated via the flag.
Max 100
This test was ran using the same command as above with a modified importer which would not import more than 100 tagged manifests. This test simulates an ideal cleanup policy with 100% adoption and each preserving only the 100 most recent images.
Semver
This test was ran using the same command as above with a modified importer which would import only images tagged with semver tags, master, or latest. If this list resulted in less than 10 images (the default amount of latest images to keep), additional tags would be backfilled into the list up to the total amount of tagged images in the repository.
For reference, here is the regex used to match semver tags. It is modified from the suggested semver regex to optionally include a single v
character at the beginning as well as tags matching latest
and master
.
^v?(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(?:-((?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$|^latest$|^master$
In Place
This test is a comparator for a single bucket migration which must import dangling blobs as well as tagged manifests and their blobs.
Results
No blob transfer errors were encountered across all runs, resulting in tens of terabytes of blobs copied without incident. At least using the GCS backend, this process seems to be quite robust.
Summary
Here is the summary of the test runs described above:
Bucket Type | Completion Time | Tags | Manifests | Blobs | Repositories |
---|---|---|---|---|---|
basic tagged import single-region | 9.8 hours | 36729 | 28016 | 135143 | 258 |
basic tagged import multi-region | 13.2 hours | 36729 | 28016 | 135143 | 258 |
realistic tagged import | 13.6 hours | 36729 | 28016 | 135143 | 258 |
realistic max 100 | 1.9 hours | 4228 | 3511 | 21087 | 258 |
realistic semver | 6.9 hours | 11605 | 11141 | 78999 | 258 |
realistic in place | 13 hours | 36728 | 28015 | 302132 | 258 |
Switching from single to multi region with no other changes resulted in a 34.69% increase in total runtime.
All the counters matched between full runs, with the exception of the in place run, which received one unexpected 503 from GCS, resulting in one less tagged manifest being imported:
time="2021-03-30T11:26:18Z" level=error msg="importing manifest" count=3713 error="error obtaining configuration payload: gcs: googleapi: got HTTP response code 503 with body: Service Unavailable" name=12.2.0-rfbranch.123335.a716fb1e-0 target="sha256:f9e698b5b577cc9b755218f17987363e054174a6558f732310ef68a7d5b3c4c9" total=4985
Repository Import Times (Seconds)
Bucket Type | Min | Max | Average |
---|---|---|---|
basic tagged import single-region | 0.301 | 7841.407 | 169.667 |
basic tagged import multi-region | 0.308 | 11391.370 | 228.119 |
realistic tagged import | 0.385 | 10981.155 | 234.49 |
realistic max 100 | 0.256 | 351.193 | 33.1276 |
realistic semver | 0.255 | 10945.871 | 118.182 |
realistic in place | 0.341 | 3711.3 | 66.253 |
Blob Transfer Times(Seconds)
Bucket Type | Min | Max | Average |
---|---|---|---|
basic tagged import single-region | 0.007 | 10.343 | 0.088 |
basic tagged import multi-region | 0.007 | 22.831 | 0.134 |
realistic tagged import | 0.007 | 2.181 | 0.136 |
realistic max 100 | 0.008 | 2.181 | 0.123 |
realistic semver | 0.007 | 19.038 | 0.129 |
Discussion
Import Speed
While these tests indicated that blob transfer is quick and robust, though multi-region buckets show increased variance in blob transfer times. Additionally, the same blobs can have non-proportionate import times between runs, meaning that we don't see a flat percentage increase between blobs from the single region transfer to the multi-region transfer. The logs include the files single-import-times-by-digest.txt
and mutli-import-times-by-digest.txt
to ease comparison.
Even with the fast speeds we observed in this test, scaling to the production registry, which is roughly 1000 times the size of the test data imports will take anywhere from 81 in the ideal cleanup policy adoption scenario to 580 days in the zero cleanup policy adoption stage. These numbers assume a 1:1 scale in import times, a constantly running importer, and imports which always import on the first try.
I believe that we should investigate a "fan out fan in" approach to the data migration step mentioned in the gradual migration proposal: #191 (closed).
In this approach, we would send the inventory of repositories to a number of concurrent workers which will report the success or failure of the import. Each repository import relies on many sequential filesystem reads compared to database writes and repository import time can vary to a large degree as we've seen above. Given those factors, I think this is the best way to reduce the total amount of time it will take to migrate the GitLab.com registry. Additionally, we can control the rate at which the import happens by adjusting the number of workers. An issue for this service is here: #319 (closed)
We will need to ensure the import tool can tolerate concurrent imports in order to undertake this work, and we possibly should undertake this regardless since the target registry will be in use while the migration is occurring in the original proposal. An issue for this work is here: #328 (closed)
When Blob Transfer Is Faster
Looking at the results of the in place import, we achieved roughly equivalent import times compared to only imported tagged manifests and transferring their blobs. The in place import resulting in 302,132 blobs being imported, while the tagged manifest import resulted in 135,143 blob being imported, a 55% decrease. Given this data, I think we can assume that once a registry deployment has 60% dangling blobs, it's quicker to transfer the blobs to a new bucket, rather than importing all blobs in place.
Very Large Repositories
The largest repositories can take a significant time to import, the largest repository here took around two hours to import, but the registry for GitLab.com
contains registries that would potentially take around one day if import speed was to scale 1:1 from this test to the real import. During the import, a repository should be read-only to prevent new writes while import is taking place, as the import may not detect them.
Since import is idempotent, it's possible for the largest repositories we run a pre-import to import the majority of objects with a second pass occurring much quicker as already-imported objects will read from the database and then skipped over. If the first pass only includes manifests and blobs, we don't have to worry about writes or deletes as we only preverve tagged data so anything that would be untagged would be garbage collection and anything written would be caught on the second pass with when write locking is in place.
I think we should determine a limit for how much downtime a single user should experience during the data migration. As well as determine a strategy to help prepare users for a partial service disruption and give them a rough estimate when their groups will be affected.
Production Registry Statistics
So far we've been using a copy of the registry from deb.gitlab.org
to test imports, and while these data are realistic, we are not sure how well they match with the production registry. In particular, these data might skew towards having a much large proportion of GitLab
registries, which are most likely atypically large.
Despite this skew, I think that we'll see a similar distribution on the production data, with most registries being relatively small, and a few being much larger. It's possible that we'll need to work directly with the largest registries in order to facilite their migration. Given this, we should determine an upper bound of tags for likely import success and determine how many repositories exceed this limit to ensure that there are few enough that they can be taken care of on an individual basis outside the automated process. Addtionally, we should not attempt to automatically import repositories over this limit since we can expect them to fail.
If we want address the outlying registries, then we should be able to use simple random sampling to determine general statistics on the number of tags per repository and then use that data to determine the lenght of time that the automated import will take.
Conversly or in addition, we could instead rely on statistics gathered from the imports and project forward in time based on those. For example, average repository import time * the remaining number of repositories to import. These stats would grow more accurate over time, but initially be somewhat low quality which would negatively impact long-term planning.
Tag Cleanup Policies
While we've previously dicussed the positive effects of tag cleanup policies on the import process. Looking at the differences in import times for all tagged images vs. the simulated cleanup, I think we need to make it a more central component of the overall migration plan. Both by prioritzing repositories with successfully ran cleanup policies which we can access via the imformation added in gitlab#276479 (closed) and by encouraging the largest repositories to adopt these policies as they become availible to them. This introduces a level of sophistication in the way we choose repositories we start to import in a way that I do not believe was been previously discussed, so we should open an issue to determine what kind of features we would like the system which picks repositories to import.
Action Items
-
Rerun the import with blob transfer test against multi-region buckets -
Discuss a multiple import workers "fan out fan in" approach to data migration -
Open an issue for a repository import dispatcher #319 (closed) -
Determine a way to predict and/or monitor import duration for the GitLab.com
registry -
Ensure that importer is ready for concurrent operation #328 (closed) -
Catalog the number registries we expect to be too large to import -
Reach out to the owners of those registries to encourage them to adopt tag cleanup policies
-
Logs
Note the primary logs from both tests are too large to include here, even with compression.
single-import-times-by-digest.txt.tar.bz mutli-import-times-by-digest.txt.tar.bz fresh-import-no-transfer-gitlab-assets-ee-tagged-only-same-all-us-east-1.log.tar.bz