Estimate database size requirements

I've run the import tool against a clean (after garbage collection) copy of the dev.gitlab.org registry in GCP with some interesting results. This registry has 6.3TB in size.

The registry binary was built based on aa4de59d and it took 5h40 to import all repositories (ignoring unreferenced blobs). It ran OK, with only 22 non-fatal errors, all due to missing _manifests or _tags folders (corrupted repositories, which we skip).

Logs

import_2020_04_21.log.7z

Database

In the end the database has 218 MB in size, with the following distribution:

Table	Rows	Size (kB)	Size (MB)	Avg Size per Row (kB)
`repositories`	271	112	0.11	0.41
`manifests`	27,944	38912	38.00	1.39
`layers`	107,202	33792	33.00	0.32
`tags`	36,730	6680	6.52	0.18
`manifest_configurations`	27,942	100352	98.00	3.59
`manifest_layers`	206,424	19456	19.00	0.09
`repository_manifests`	28,017	2720	2.66	0.10
`manifest_lists`	0	24	0.02	Should be similar to `manifests`
`manifest_list_items`	0	16	0.02	Should be similar to `manifest_layers`
`repository_manifest_lists`	0	16	0.02	Should be similar to `repository_manifests`

Analysis

There are no manifest lists in the registry. Among all 27,944 manifests, only 1 is a Docker schema v1 manifest, the rest are Docker schema v2. There are no OCI manifests in the registry.
manifest_configurations have a fairly high size per row, ideally I would like to see it closer to manifests.
I found it odd that two manifests share the same configuration hash (27,942 manifest configurations for 27,944 manifests).

Next Steps

Manifest configurations have the largest payload (json) values, so the higher storage requirements are not a surprise. We should simulate an import using a base64 encoded payload (using a text column) instead (#68 (closed)).
See why there are two manifests sharing the same configuration.
Right now we're running the import as:
1. Loop over repositories
  1. Import manifests in repository
  2. Import tags in repository
I'm curious to see if doing it like this would make a significant difference:
1. Loop over repositories
  1. Loop over tags in repository
    1. Import manifest referenced by tag
    2. Import tag
On the current version we're using a path walk to loop over manifests. In the second option we would instead get all tags in a repository (list operation) and then grab the referenced manifest directly, without needing the path walk.
We need to run an import against a dirty copy of the dev.gitlab.org registry so that we can test a database garbage collection and compare the results. In the end the result database should have the same contents as the one created with the import of a clean copy of the registry.
Extrapolate a database query rate based on common API operations and the required underlying queries.
Later on we need to conduct a complexity analysis on the commonly used queries to see if we can spot any obvious optimizations (probably a separate issue).

cc @hswimelar

Manifests that share the same configuration

Two distinct manifests shouldn't share the same configuration, because they represent two distinct images, and no two images have the same configuration.

However, this happened on the dev registry:

SELECT
    configuration_id,
    count(*)
FROM
    manifests
GROUP BY
    configuration_id
HAVING
    count(*) > 1;

 configuration_id | count
------------------+-------
            33649 |     2
(1 row)

SELECT
    id,
    digest
FROM
    manifests
WHERE
    configuration_id = 33649;

  id   |                                 digest
-------+-------------------------------------------------------------------------
 33652 | sha256:83e3aa1a0c49573e3d76321898478f35b93416634fe4a19349c0e974802a20c8
 47208 | sha256:6880338ad09c3298e3e78fbc12a6fee08804a8dbff6c82f992c9358e308953a6
(2 rows)

SELECT
    r.id,
    r.name,
    r.path,
    r.parent_id
FROM
    repositories AS r
    JOIN repository_manifests AS rm ON r.id = rm.repository_id
WHERE
    rm.manifest_id IN (33652, 47208);

 id  |      name      |         path          | parent_id
-----+----------------+-----------------------+-----------
 544 | gitlab-ee      | gitlab/gitlab-ee      |       504
 550 | omnibus-gitlab | gitlab/omnibus-gitlab |       504
(2 rows)

So we have two manifests sharing the configuration 33649 and they belong to distinct repositories, under the same parent.

Looking at the manifest payloads:

SELECT
    id,
    payload
FROM
    manifests
WHERE
    configuration_id = 33649;

  id   |                                           payload
-------+----------------------------------------------------------------------------------------------
 33652 | {                                                                                           +
       |    "schemaVersion": 2,                                                                      +
       |    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",                     +
       |    "config": {                                                                              +
       |       "mediaType": "application/vnd.docker.container.image.v1+json",                        +
       |       "size": 1145,                                                                         +
       |       "digest": "sha256:c8c72d6b3a4d7596f783e2ddd22970fa7c23a85c10fb111dc927d086eaf3e40d"   +
       |    },                                                                                       +
       |    "layers": [                                                                              +
       |       {                                                                                     +
       |          "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",                  +
       |          "size": 2322150,                                                                   +
       |          "digest": "sha256:6d6a6a7dae296854b09c2b3c16941416f1efe6dc89cbec1dade03f84233a783f"+
       |       }                                                                                     +
       |    ]                                                                                        +
       | }
 47208 | {                                                                                           +
       |    "schemaVersion": 2,                                                                      +
       |    "mediaType": "application/vnd.docker.distribution.manifest.v2+json",                     +
       |    "config": {                                                                              +
       |       "mediaType": "application/octet-stream",                                              +
       |       "size": 1145,                                                                         +
       |       "digest": "sha256:c8c72d6b3a4d7596f783e2ddd22970fa7c23a85c10fb111dc927d086eaf3e40d"   +
       |    },                                                                                       +
       |    "layers": [                                                                              +
       |       {                                                                                     +
       |          "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",                  +
       |          "size": 2322150,                                                                   +
       |          "digest": "sha256:6d6a6a7dae296854b09c2b3c16941416f1efe6dc89cbec1dade03f84233a783f"+
       |       }                                                                                     +
       |    ]                                                                                        +
       | }
(2 rows)

The manifest payloads are identical, the only difference is config/mediaType. According to the Docker spec, application/octet-stream is not a valid media type for an image configuration. So, the second manifest is invalid and shouldn't exist. The corresponding repository (gitlab/omnibus-gitlab) should have referenced the first manifest instead.

After the fix for #71 (closed) the representation of the relationship between manifests and repositories has changed but we have left a configuration_id foreign key in manifests that left the door open for this issue. We have raised an issue to fix this (#91 (closed)).

Required space for JSON payloads

There is a payload column in manifests, manifest_lists and manifest_configurations tables. The type of this column is json and not jsonb due to the fact that we need to preserve whitespaces and the order of keys (we need to serve these payloads to clients and their checksum must preserved).

We're going to look at manifest_configurations, as this is the one taking up more space, but the same conclusions should apply to the other tables as well.

The payload column of the manifest_configurations table is using 70 MB out of 98 MB (table total):

SELECT pg_size_pretty(sum(pg_column_size(payload))) FROM manifest_configurations;

 pg_size_pretty
------------
 70 MB

`json` vs `text` vs `bytea`

I did a quick test to compare the storage requirements when using a json type (current), a text with the JSON encoded in base64 (as done by the Docker EE DTR) and a bytea. All of these allow us to preserve whitespaces and the order of keys.

Creating columns:

ALTER TABLE manifest_configurations
    ADD COLUMN payload_base64 text,
    ADD COLUMN payload_bytea bytea;

Filling the new columns:

UPDATE
    manifest_configurations
SET
    payload_base64 = encode(convert_to(payload::text, 'UTF-8'), 'base64'),
    payload_bytea = convert_to(payload::text, 'UTF-8');

Verifying that the data was encoded properly (all output columns should show the same JSON payload):

\x
SELECT
    payload,
    CONVERT_FROM(DECODE(payload_base64, 'BASE64'), 'UTF-8') AS payload_base64,
    encode(payload_bytea, 'escape') AS payload_bytea
FROM
    manifest_configurations
LIMIT 1;

Comparing column sizes:

SELECT
    pg_size_pretty(sum(pg_column_size(payload))) AS payload,
    pg_size_pretty(sum(pg_column_size(payload_base64))) AS payload_base64,
    pg_size_pretty(sum(pg_column_size(payload_bytea))) AS payload_bytea
FROM
    manifest_configurations;

 payload | payload_base64 | payload_bytea
---------+----------------+---------------
 70 MB   | 129 MB         | 70 MB

It's not a surprise that the base64 version takes more space, as the encoded strings are always longer. The Docker EE DTR uses this strategy because it uses a document database, which only supports JSON types, thus it's the only viable option for them.

Although the bytea column uses the same amount of space as the json column (the encoding escape format converts zero bytes and high-bit-set bytes to octal sequences and doubles backslashes), by using a binary type we can leverage on aggressive data compression.

Compression

JSON payloads are really good candidates for compression, as they have many repeated characters (whitespaces, quotes, duplicated keys, etc.). If we use a binary type (bytea) we can compress the data on writes and decompress on reads at the application level, using an aggressive format/algorithm, like GZIP.

Adding a new column:

ALTER TABLE manifest_configurations ADD COLUMN payload_bytea_gzip bytea;

Filling the new column with the compressed payload (I've used the pgsql-gzip extension for demonstration purposes, but this should actually be done at the application level):

UPDATE manifest_configurations SET payload_bytea_gzip = gzip(payload_bytea);

Comparing column sizes once again:

SELECT
    pg_size_pretty(sum(pg_column_size(payload))) AS payload,
    pg_size_pretty(sum(pg_column_size(payload_base64))) AS payload_base64,
    pg_size_pretty(sum(pg_column_size(payload_bytea))) AS payload_bytea,
    pg_size_pretty(sum(pg_column_size(payload_bytea_gzip))) AS payload_bytea_gzip
FROM
    manifest_configurations;

 payload | payload_base64 | payload_bytea | payload_bytea_gzip
---------+----------------+---------------+--------------------
 70 MB   | 129 MB         | 70 MB         | 48 MB

The GZIP compressed payload requires ~30% less space than the plain json and bytea payloads.

Conclusions

json was the easiest option to store the payloads due to the transparent encoding/decoding and also because it allowed us to move faster during the initial development stage (as we could simply look at the database to debug an issue with the payload deconstruction).

We should now use bytea instead of json to store payloads for manifests, manifest_lists and manifest_configurations. This will allow us to preserve whitespaces and the order of keys but will also give us the possibility to leverage on aggressive compression algorithms at the application level if/when needed.

Using bytea will also stop us from doing queries on a json field and force us to create new columns whenever needed (which aligns with the database development guidelines). These payloads are only stored on the database so that we can serve them to clients through the API, making sure they'll match the checkshum on the client side (given that whitespaces and the order of keys were preserved). So there is no downside in not being able to query or look at a plain JSON document in this column.

We're using the Go json.RawMessage type (which is wrapper around a byte array) to represent these payload columns, so the encoding/decoding would be transparent when using bytea instead of json, without any code changes. The only change we have to do at the application level would be to compress/decompress the payload with compress/gzip.

Estimate database size requirements

Context

Results

Designs

Child items ...

Activity

Async Update

Logs

Database

Analysis

Next Steps

Manifests that share the same configuration

Required space for JSON payloads

`json` vs `text` vs `bytea`

Compression

Conclusions

Estimate database size requirements

Context

Results

Activity

Async Update

Logs

Database

Analysis

Next Steps

Manifests that share the same configuration

Required space for JSON payloads

json vs text vs bytea

Compression

Conclusions

`json` vs `text` vs `bytea`