We're building a tool to populate a database with the metadata from a registry filesystem (#56 (closed)). Once that is done, we should use a copy of the dev.gitlab.org registry to test it and see how much space it takes on the database. Once complete, we can extrapolate the space requirements, taking into account the dev.gitlab.org registry size.
I've run the import tool against a clean (after garbage collection) copy of the dev.gitlab.org registry in GCP with some interesting results. This registry has 6.3TB in size.
The registry binary was built based on aa4de59d and it took 5h40 to import all repositories (ignoring unreferenced blobs). It ran OK, with only 22 non-fatal errors, all due to missing _manifests or _tags folders (corrupted repositories, which we skip).
In the end the database has 218 MB in size, with the following distribution:
Table
Rows
Size (kB)
Size (MB)
Avg Size per Row (kB)
repositories
271
112
0.11
0.41
manifests
27,944
38912
38.00
1.39
layers
107,202
33792
33.00
0.32
tags
36,730
6680
6.52
0.18
manifest_configurations
27,942
100352
98.00
3.59
manifest_layers
206,424
19456
19.00
0.09
repository_manifests
28,017
2720
2.66
0.10
manifest_lists
0
24
0.02
Should be similar to manifests
manifest_list_items
0
16
0.02
Should be similar to manifest_layers
repository_manifest_lists
0
16
0.02
Should be similar to repository_manifests
Analysis
There are no manifest lists in the registry. Among all 27,944 manifests, only 1 is a Docker schema v1 manifest, the rest are Docker schema v2. There are no OCI manifests in the registry.
manifest_configurations have a fairly high size per row, ideally I would like to see it closer to manifests.
I found it odd that two manifests share the same configuration hash (27,942 manifest configurations for 27,944 manifests).
Next Steps
Manifest configurations have the largest payload (json) values, so the higher storage requirements are not a surprise. We should simulate an import using a base64 encoded payload (using a text column) instead (#68 (closed)).
See why there are two manifests sharing the same configuration.
Right now we're running the import as:
Loop over repositories
Import manifests in repository
Import tags in repository
I'm curious to see if doing it like this would make a significant difference:
Loop over repositories
Loop over tags in repository
Import manifest referenced by tag
Import tag
On the current version we're using a path walk to loop over manifests. In the second option we would instead get all tags in a repository (list operation) and then grab the referenced manifest directly, without needing the path walk.
We need to run an import against a dirty copy of the dev.gitlab.org registry so that we can test a database garbage collection and compare the results. In the end the result database should have the same contents as the one created with the import of a clean copy of the registry.
Extrapolate a database query rate based on common API operations and the required underlying queries.
Later on we need to conduct a complexity analysis on the commonly used queries to see if we can spot any obvious optimizations (probably a separate issue).
Two distinct manifests shouldn't share the same configuration, because they represent two distinct images, and no two images have the same configuration.
The manifest payloads are identical, the only difference is config/mediaType. According to the Docker spec, application/octet-stream is not a valid media type for an image configuration. So, the second manifest is invalid and shouldn't exist. The corresponding repository (gitlab/omnibus-gitlab) should have referenced the first manifest instead.
After the fix for #71 (closed) the representation of the relationship between manifests and repositories has changed but we have left a configuration_id foreign key in manifests that left the door open for this issue. We have raised an issue to fix this (#91 (closed)).
There is a payload column in manifests, manifest_lists and manifest_configurations tables. The type of this column is json and not jsonb due to the fact that we need to preserve whitespaces and the order of keys (we need to serve these payloads to clients and their checksum must preserved).
We're going to look at manifest_configurations, as this is the one taking up more space, but the same conclusions should apply to the other tables as well.
The payload column of the manifest_configurations table is using 70 MB out of 98 MB (table total):
I did a quick test to compare the storage requirements when using a json type (current), a text with the JSON encoded in base64 (as done by the Docker EE DTR) and a bytea. All of these allow us to preserve whitespaces and the order of keys.
It's not a surprise that the base64 version takes more space, as the encoded strings are always longer. The Docker EE DTR uses this strategy because it uses a document database, which only supports JSON types, thus it's the only viable option for them.
Although the bytea column uses the same amount of space as the json column (the encoding escape format converts zero bytes and high-bit-set bytes to octal sequences and doubles backslashes), by using a binary type we can leverage on aggressive data compression.
Compression
JSON payloads are really good candidates for compression, as they have many repeated characters (whitespaces, quotes, duplicated keys, etc.). If we use a binary type (bytea) we can compress the data on writes and decompress on reads at the application level, using an aggressive format/algorithm, like GZIP.
Filling the new column with the compressed payload (I've used the pgsql-gzip extension for demonstration purposes, but this should actually be done at the application level):
The GZIP compressed payload requires ~30% less space than the plain json and bytea payloads.
Conclusions
json was the easiest option to store the payloads due to the transparent encoding/decoding and also because it allowed us to move faster during the initial development stage (as we could simply look at the database to debug an issue with the payload deconstruction).
We should now use bytea instead of json to store payloads for manifests, manifest_lists and manifest_configurations. This will allow us to preserve whitespaces and the order of keys but will also give us the possibility to leverage on aggressive compression algorithms at the application level if/when needed.
Using bytea will also stop us from doing queries on a json field and force us to create new columns whenever needed (which aligns with the database development guidelines). These payloads are only stored on the database so that we can serve them to clients through the API, making sure they'll match the checkshum on the client side (given that whitespaces and the order of keys were preserved). So there is no downside in not being able to query or look at a plain JSON document in this column.
We're using the Go json.RawMessage type (which is wrapper around a byte array) to represent these payload columns, so the encoding/decoding would be transparent when using bytea instead of json, without any code changes. The only change we have to do at the application level would be to compress/decompress the payload with compress/gzip.
We're using a copy of the dev.gitlab.org registry for this investigation, which is 10.6TB in size.
After scanning the registry (in a dirty state, before any garbage collection) I found 48,986 manifests and the same amount of manifest configurations (the two blobs stored per manifest). The manifest payloads have a total size of 98MB and the configuration payloads 314MB. In total, they use 412MB, which is 0.004% of the overall registry size.
Note: This is the size of the payloads in memory, the corresponding file size is only slightly larger (negligible). This estimate excludes the link files stored under each repository, which are also considered metadata. However, these are rather small (they only contain an SHA256 digest) when compared to the JSON payloads of manifest and manifest configuration blobs.
Notes: We've switched the digest storage type from text to a binary hex format to save 50% in storage space. I'll continue with the investigation on #61 (comment 328329354).
MR statuses:
!158 (merged) - Save digests in a binary hex format.
Notes: I need to collect some more metric before finishing the size requirements extrapolation. We made some changes to improve the space efficiency of the database and therefore I need to rerun the import to see the exact effect that these changes have on the database size and its distribution. I expect to share these tomorrow.
MR statuses:
!158 (merged) - Save digests in a binary hex format.
!159 (merged) - 95% complete, 95% confident - Store payloads in binary format for increased space efficiency (WIP). Just need to fix a test case.
Now that we've made some improvements to storage efficiency on the database schema, I've rerun the import script against the copy of the dev.gitlab.org registry in GCP (10.6TB in size) and compared the results against the initial import (#61 (comment 328329354)).
The database has now 195MB in size, compared to 218MB on the first run. The tables have the following size:
Table
Rows
Size (kB)
Size (MB)
Avg Size per Row (kB)
manifest_layers
206,424
19456
19.00
0.09
layers
107,202
24576
24.00
0.23
tags
36,730
6680
6.52
0.18
repository_manifests
28,017
2720
2.66
0.10
manifests
27,944
36864
36.00
1.32
manifest_configurations
27,942
99328
97.00
3.55
repositories
271
112
0.11
0.41
manifest_lists
0
24
0.02
Similar to manifests
manifest_list_items
0
16
0.02
Similar to manifest_layers
repository_manifest_lists
0
16
0.02
Similar to repository_manifests
Notes
We're not using compression at this stage. As described in #61 (comment 329845180), given that we're now storing the payload of manifests, manifest_configurations and manifest_lists as a bytea, if we want compress payloads in future we should be able to reduce the size of these tables by ~30%.
The import procedure does not import any unreferenced manifests or layers. This means that the import will only create database records for tagged manifests. For the dev.gitlab.org registry, we found that only ~60% of the manifests were referenced, which means that we could ignore ~40% of the registry.
gitlab.com
Based on the results for the dev.gitlab.org registry, we can now extrapolate the size of the database for gitlab.com. Considering the gitlab.com registry size at the time of writing, we can expect its database to have ~85GB in size in a clean state (ignoring unreferenced manifests) or ~119GB in a dirty state (importing unreferenced manifests).
João Pereirachanged title from Extrapolate database query rate and size requirements to Extrapolate database size requirements
changed title from Extrapolate database query rate and size requirements to Extrapolate database size requirements
João Pereirachanged the descriptionCompare with previous version
Notes: We have now completed the extrapolation of the database size requirements. This issue was initially intended to be used to extrapolate query rates as well, but this can only be done with a good degree of confidence once we switch the registry API write and read operations from the filesystem to the database (otherwise we can't have a realistic simulation). Therefore, I've raised a separate issue for the query rate requirements (#94 (closed)) and added it to the epic tasks (&2313 (closed)).
This issue was initially intended to be used to extrapolate query rates as well, but this can only be done with a good degree of confidence once we switch the registry API write and read operations from the filesystem to the database (otherwise we can't have a realistic simulation). Therefore, I've raised a separate issue for the query rate requirements (#94 (closed)) and added it to the epic tasks (&2313 (closed)).
@trizzi is this OK for you? I tried to extrapolate the query rate now but the results would be highly speculative, I think we can get a fairly accurate rate estimate once we do &3006 (closed) at least.
We have made several changes to the DB schema since this issue was closed. Below are the updated estimate size requirements. As before, this is based on a database filled with the metadata from the dev.gitlab.org container registry (10.6TB in size).
The import procedure does not import any unreferenced manifests or blobs. This means that the import only creates database records for tagged manifests. For the dev.gitlab.org registry, we found that only ~60% of the manifests were referenced, which means that we could ignore ~40% of the registry.
SELECTtable_nameAS"Name",pg_size_pretty(pg_relation_size(table_name))AS"Relation Size",pg_size_pretty(pg_indexes_size(table_name))AS"Indexes Size",pg_size_pretty(pg_total_relation_size(table_name))AS"Total Size",stat.n_live_tupAS"Row Count",pg_size_pretty((pg_relation_size(table_name)/NULLIF(stat.n_live_tup,0)))AS"Approx. Relation Size per Row"FROMinformation_schema.tablesASinfoINNERJOINpg_stat_user_tablesASstatONinfo.table_schema=stat.schemanameANDinfo.table_name=stat.relnameWHEREtable_schema='public'ORDERBYpg_relation_size(table_name)DESC;
Name
Relation Size
Indexes Size
Total Size
Row Count
Approx. Relation Size per Row
manifests
26 MB
3352 kB
36 MB
27944
984 bytes
configurations
23 MB
1896 kB
90 MB
27942
866 bytes
blobs
18 MB
13 MB
31 MB
135144
141 bytes
manifest_layers
12 MB
20 MB
32 MB
206424
60 bytes
repository_blobs
8880 kB
15 MB
24 MB
150904
60 bytes
tags
3448 kB
6008 kB
9488 kB
36730
96 bytes
repository_manifests
1656 kB
2912 kB
4592 kB
28017
60 bytes
repositories
32 kB
64 kB
128 kB
271
120 bytes
manifest_references
0 bytes
32 kB
32 kB
0
0 bytes
Estimate Database Size for gitlab.com
We can extrapolate the size of the database for gitlab.com based on the results for the dev.gitlab.org registry. Considering the gitlab.com registry size, we can expect its database to have ~103GB in size in a clean state (ignoring unreferenced manifests) or ~144GB in a dirty state (importing unreferenced manifests).
We have made some changes to the DB schema since this issue was closed, such as scoping manifests and layers by repository to allow for partitioning and better GC efficiency (as discussed in #104 (closed)). Below are the updated estimate size requirements. As before, this is based on a database filled with the metadata from the dev.gitlab.org container registry (10.6TB in size).
The import procedure does not import any unreferenced manifests or blobs. This means that the import only creates database records for tagged manifests. For the dev.gitlab.org registry, we found that only ~60% of the manifests were referenced, which means that we could ignore ~40% of the registry.
Please note that this estimate does not include any tables used exclusively for online GC, those are being reviewed/discussed under !373 (merged). The size of these should be very small in comparison with the ones analyzed here.
Database Size
273 MB: ~16% increase when compared to the deduplicated schema version (#61 (comment 379740915)).
Tables Size
Name
Relation Size
Indexes Size
Total Size
Row Count
Approx. Relation Size per Row
manifests
38 MB
7408 kB
118 MB
28017
1439 bytes
blobs
11 MB
13 MB
24 MB
135144
84 bytes
layers
22 MB
48 MB
69 MB
206563
109 bytes
repository_blobs
13 MB
31 MB
45 MB
150904
93 bytes
tags
3448 kB
4944 kB
8424 kB
36730
96 bytes
repositories
32 kB
64 kB
128 kB
271
120 bytes
manifest_references
0 bytes
32 kB
32 kB
0
0 bytes
Estimate Database Size for gitlab.com
We can extrapolate the size of the database for gitlab.com based on the results for the dev.gitlab.org registry. Considering the gitlab.com registry size, we can expect its database to have ~120GB in size in a clean state (ignoring unreferenced manifests) or ~168GB in a dirty state (importing unreferenced manifests).