Grzegorz Bizon · d93d21d3 · 21480785 · a43ee1a9 · 7809619a · d23d036b
--- a/sites/handbook/source/handbook/engineering/architecture/blueprints/pipelines/storage/index.html.md

+ 89

− 37
+++ b/sites/handbook/source/handbook/engineering/architecture/blueprints/pipelines/storage/index.html.md

+ 89

− 37
 @@ -5,11 +5,14 @@ title: "GitLab CI/CD data storage improvements"

 ## GitLab CI/CD pipelines data storage improvements

-GitLab CI/CD is one of GitLab's most data and compute intensive components. Since its [initial release in
-November 2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/),
-the CI/CD subsystem has evolved significantly. It was [integrated into GitLab in September
-2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and has become [one of the most
-beloved CI/CD solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
+GitLab CI/CD is one of GitLab's most data and compute intensive components.
+Since its [initial release in November
+2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/),
+the CI/CD subsystem has evolved significantly. It was [integrated into GitLab
+in September
+2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and
+has become [one of the most beloved CI/CD
+solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).

 > TODO pipelines usage growth graph here

 @@ -105,9 +108,40 @@ migration that is almost impossible right now.

 ## [What] Proposal

+Top-level goals:
+
+1. Remove columns with redundant data from `ci_builds` table
+1. Separate pipeline processing data from visualization data
+1. Define data retention policy for pipeline processing data
+1. Devise strategy for `ci_builds` partitioning
+
+### Remove columns with redundant data
+
+`ci_builds` table has a long history. This table also accumulated some
+technical debt throughout the years.
+
+Two important examples to mention here are - extracting data describing
+artifacts and stages from this table.
+
+A few years ago we extracted `ci_stages` table form `ci_builds`, but we never
+managed to stop using information about stages stored in `ci_builds` table,
+notably in `ci_builds.stage` and `ci_builds.stage_idx` columns. Data stored
+there is completely redundant because we also do store it in `ci_stages.name`
+and `ci_stages.position` columns.
+
+Similarly, we do have `ci_job_artifacts` table, and a bunch of
+`ci_builds.artifacts_*` columns that are either unused or hold redundant data.
+
+Presumably we do have more columns like that, we should find them and devise
+strategy for removing them and all the data stored in these columns (after we
+confirm that the data can be removed safely in case of being entirely
+redundant).
+
+### Separate pipeline processing data
+
 We store a lot of data in `ci_builds` table, everything is related to CI/CD
 pipelines but some parts of the data are used for a different purpose and some
-elements have different affinity to a pipeline than other. In particular - we
+elements have different affinity to a pipeline than others. In particular - we
 store pipeline visualization data there and pipeline processing data.

 Pipeline visualization data and processing data can have different retention
 @@ -121,12 +155,12 @@ information about pipelines to users.
 > TODO calculate average ratio of visualization to processing data, like 40/60%
 > and support this with real numbers / graphs.

-### Pipeline visualization data
+#### Pipeline visualization data

 Pipeline visualization data is about all the things that we want to show to a
 user when someone visits pipelines / pipeline / build page.

-### Pipeline processing data
+#### Pipeline processing data

 Pipeline processing data is about all the things that we need to store in our
 database in order to process pipeline from start to end. This includes:
 @@ -141,15 +175,48 @@ We might not need to do these things for pipelines that are old - created
 months or years ago. In most cases it would be better create a new pipeline
 than to reprocess existing builds (usually by retrying them).

+### Devise strategy for `ci_builds` partitioning
+
+We can reduce the size of `ci_builds` significantly, but we do not plan to
+remove information about builds that we use to display them to users. It means
+that we are still going to keep all of them in the database, and given current
+rate of growth of this table, we might still need to explore partitioning.
+
+It might be possible to partition this table by a build creation date, but this
+requires technical evaluation to find the best way to partition the table.
+
 ## [When] Iterations

-### Devise a metric for `ci_builds` situation
+We can iterate on this in two parallel tracks.
+
+### Metric
+
+What matters most is reducing the size of the table on the primary database. We
+should have a metric that will clearly show our progress. This can be the total
+size of `ci_builds` table on a primary database, including indexes.

-We should have a metric for the `ci_builds` size / size of indices or another
-metric that can help to measure current situation and the impact of future
-improvements.
+### Track A

-### Validate build "degeneration" mechanisms
+#### 1. Unblock engineers by documenting how to store data
+
+Currently engineers can no longer add new columns to `ci_builds` table, because
+it is too large (more than 50 columns). We should write documentation how to
+workaround this limitation so that engineers are no longer blocked.
+
+#### 2. Remove legacy columns with redundant data
+
+Devise a way to remove columns with redundant data - columns related to stages
+and artifacts. Find other columns that can be removed, ensure that data is
+indeed redundant, otherwise do not remove it without a proper backup.
+
+#### 3. Resolve the problem of STI
+
+Replace STI mechanisms with integer enums. Estimate the benefit and storage
+space saved after normalizing data this way,  before doing it.
+
+### Track B
+
+#### Validate build "degeneration" mechanisms

 We currently do have a bunch of mechanisms implemented that allow us to
 "degenerate" a build and to "archive" those builds that seem to be old and have
 @@ -157,7 +224,7 @@ never been tested. These mechanisms have been implemented by a few different
 teams and are disabled on gitlab.com. We should revisit them, and figure out if
 these are aligned with the initiative described in this blueprint.

-### Soft-archive legacy builds
+#### Soft-archive legacy builds

 Once we have all the information about how builds degeneration and archival
 works, we can make a well informed decision about archiving builds that are
 @@ -170,7 +237,7 @@ builds are going to be unretriable and unprocessable and in case of making a
 wrong decision about how old the builds should be to archive them, we would
 still be able to change our minds about it.

-### Archive and remove legacy processing data
+#### Archive and remove legacy processing data

 In order to migrate data between `ci_builds` and other tables we need to remove
 old data, because there is currently too much data there to migrate it
 @@ -185,7 +252,7 @@ to revert a wrong decision might be important here.
 Removing data from the database will require a sign-off from executives and
 product people.

-### Migrate `options` column from `ci_builds` to `ci_builds_metadata`
+#### Migrate `options` column from `ci_builds` to `ci_builds_metadata`

 `ci_builds_metadata` is a table that we create to separate processing data from
 other kinds of pipeline data.  Because of entropy and time we do store
 @@ -200,34 +267,17 @@ We already do have support for writing data, that we usually write to
 has never been enabled on gitlab.com and is currently disabled under
 `ci_build_metadata_config` feature flag.

-### Ensure `ci_builds_metadata` contains only a complete set of processing data
+#### Ensure `ci_builds_metadata` contains only a complete set of processing data

 This table currently contains a bunch columns, but we should check if data in
 all of them are safe to get removed one a build gets archived.

-### Move other processing columns
-
-There are other columns that could be moved to either `ci_builds_metadata` or
-`ci_pipelines_metadata`. We should move them too.
-
 ### Rebuild archival mechanisms to remove rows from `ci_builds_metadata`

 Instead of rewriting rows in `ci_builds` it might be better remove rows from
 `ci_builds_metadata`. We can also devise partitioning mechanisms for this table
 in the future.

-### Remove legacy columns from `ci_builds`
-
-There are bunch of deprecated columns in that table, we should remove them too.
-`stage` column is known to take a lot of space, and it might not be needed
-anymore after we enable `ci_atomic_processing` FF.
-
-### Resolve the problem of STI
-
-We can replace STI mechanisms with integer enums.
-
-> TODO, elaborate
-
 ### Move and remove indices

 Once we move processing data, we might also be able to move indexes. We should
 @@ -246,8 +296,9 @@ Proposal:
 |------------------------------|-------------------------|
 | Author                       |     Grzegorz Bizon      |
 | Architecture Evolution Coach | Gerardo Lopez-Fernandez |
-| Engineering Leader           |          TBD            |
-| Domain Expert                |     Grzegorz Bizon      |
+| Engineering Leader           |  Christopher Lefelhocz  |
+| Domain Expert                |      Fabio Pitino       |
+| Domain Expert                |     Kamil Trzciński     |

 DRIs:

 @@ -256,4 +307,5 @@ DRIs:
 | Product                      |          TBD           |
 | Leadership                   |          TBD           |
 | Engineering                  |          TBD           |
-| Domain Expert                |    Grzegorz Bizon      |
+| Domain Expert                |          TBD           |
+| Domain Expert                |          TBD           |