We're building a tool to populate a database with the metadata from a registry filesystem (#56 (closed)). Once that is done, we should use a copy of the dev.gitlab.org registry to test it and see how much space it takes on the database. Once complete, we can extrapolate the space requirements, taking into account the dev.gitlab.org registry size.
I've run the import tool against a clean (after garbage collection) copy of the dev.gitlab.org registry in GCP with some interesting results. This registry has 6.3TB in size.
The registry binary was built based on aa4de59d and it took 5h40 to import all repositories (ignoring unreferenced blobs). It ran OK, with only 22 non-fatal errors, all due to missing _manifests or _tags folders (corrupted repositories, which we skip).
In the end the database has 218 MB in size, with the following distribution:
Table
Rows
Size (kB)
Size (MB)
Avg Size per Row (kB)
repositories
271
112
0.11
0.41
manifests
27,944
38912
38.00
1.39
layers
107,202
33792
33.00
0.32
tags
36,730
6680
6.52
0.18
manifest_configurations
27,942
100352
98.00
3.59
manifest_layers
206,424
19456
19.00
0.09
repository_manifests
28,017
2720
2.66
0.10
manifest_lists
0
24
0.02
Should be similar to manifests
manifest_list_items
0
16
0.02
Should be similar to manifest_layers
repository_manifest_lists
0
16
0.02
Should be similar to repository_manifests
Analysis
There are no manifest lists in the registry. Among all 27,944 manifests, only 1 is a Docker schema v1 manifest, the rest are Docker schema v2. There are no OCI manifests in the registry.
manifest_configurations have a fairly high size per row, ideally I would like to see it closer to manifests.
I found it odd that two manifests share the same configuration hash (27,942 manifest configurations for 27,944 manifests).
Next Steps
Manifest configurations have the largest payload (json) values, so the higher storage requirements are not a surprise. We should simulate an import using a base64 encoded payload (using a text column) instead (#68 (closed)).
See why there are two manifests sharing the same configuration.
Right now we're running the import as:
Loop over repositories
Import manifests in repository
Import tags in repository
I'm curious to see if doing it like this would make a significant difference:
Loop over repositories
Loop over tags in repository
Import manifest referenced by tag
Import tag
On the current version we're using a path walk to loop over manifests. In the second option we would instead get all tags in a repository (list operation) and then grab the referenced manifest directly, without needing the path walk.
We need to run an import against a dirty copy of the dev.gitlab.org registry so that we can test a database garbage collection and compare the results. In the end the result database should have the same contents as the one created with the import of a clean copy of the registry.
Extrapolate a database query rate based on common API operations and the required underlying queries.
Later on we need to conduct a complexity analysis on the commonly used queries to see if we can spot any obvious optimizations (probably a separate issue).
Two distinct manifests shouldn't share the same configuration, because they represent two distinct images, and no two images have the same configuration.
The manifest payloads are identical, the only difference is config/mediaType. According to the Docker spec, application/octet-stream is not a valid media type for an image configuration. So, the second manifest is invalid and shouldn't exist. The corresponding repository (gitlab/omnibus-gitlab) should have referenced the first manifest instead.
After the fix for #71 (closed) the representation of the relationship between manifests and repositories has changed but we have left a configuration_id foreign key in manifests that left the door open for this issue. We have raised an issue to fix this (#91 (closed)).
There is a payload column in manifests, manifest_lists and manifest_configurations tables. The type of this column is json and not jsonb due to the fact that we need to preserve whitespaces and the order of keys (we need to serve these payloads to clients and their checksum must preserved).
We're going to look at manifest_configurations, as this is the one taking up more space, but the same conclusions should apply to the other tables as well.
The payload column of the manifest_configurations table is using 70 MB out of 98 MB (table total):
I did a quick test to compare the storage requirements when using a json type (current), a text with the JSON encoded in base64 (as done by the Docker EE DTR) and a bytea. All of these allow us to preserve whitespaces and the order of keys.
It's not a surprise that the base64 version takes more space, as the encoded strings are always longer. The Docker EE DTR uses this strategy because it uses a document database, which only supports JSON types, thus it's the only viable option for them.
Although the bytea column uses the same amount of space as the json column (the encoding escape format converts zero bytes and high-bit-set bytes to octal sequences and doubles backslashes), by using a binary type we can leverage on aggressive data compression.
Compression
JSON payloads are really good candidates for compression, as they have many repeated characters (whitespaces, quotes, duplicated keys, etc.). If we use a binary type (bytea) we can compress the data on writes and decompress on reads at the application level, using an aggressive format/algorithm, like GZIP.
Filling the new column with the compressed payload (I've used the pgsql-gzip extension for demonstration purposes, but this should actually be done at the application level):
The GZIP compressed payload requires ~30% less space than the plain json and bytea payloads.
Conclusions
json was the easiest option to store the payloads due to the transparent encoding/decoding and also because it allowed us to move faster during the initial development stage (as we could simply look at the database to debug an issue with the payload deconstruction).
We should now use bytea instead of json to store payloads for manifests, manifest_lists and manifest_configurations. This will allow us to preserve whitespaces and the order of keys but will also give us the possibility to leverage on aggressive compression algorithms at the application level if/when needed.
Using bytea will also stop us from doing queries on a json field and force us to create new columns whenever needed (which aligns with the database development guidelines). These payloads are only stored on the database so that we can serve them to clients through the API, making sure they'll match the checkshum on the client side (given that whitespaces and the order of keys were preserved). So there is no downside in not being able to query or look at a plain JSON document in this column.
We're using the Go json.RawMessage type (which is wrapper around a byte array) to represent these payload columns, so the encoding/decoding would be transparent when using bytea instead of json, without any code changes. The only change we have to do at the application level would be to compress/decompress the payload with compress/gzip.