Grzegorz Bizon · d93d21d3 · 21480785 · a43ee1a9 · 7809619a · d23d036b
--- a/sites/handbook/source/handbook/engineering/architecture/blueprints/pipelines/storage/index.html.md

+ 88

− 37
+++ b/sites/handbook/source/handbook/engineering/architecture/blueprints/pipelines/storage/index.html.md

+ 88

− 37
 @@ -5,19 +5,16 @@ title: "GitLab CI/CD data storage improvements"

 ## GitLab CI/CD pipelines data storage improvements

-GitLab CI/CD is one of these parts of GitLab product that are the most data
-and compute intensive. Since it's [initial release in November
-2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/)
-the CI/CD subsystem has evolved a lot. It has been [integrated into GitLab in
-September
-2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and it
-has become [one of the most loved CI/CD
-solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
+GitLab CI/CD is one of GitLab's most data and compute intensive components. Since its [initial release in
+November 2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/),
+the CI/CD subsystem has evolved significantly. It was [integrated into GitLab in September
+2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and has become [one of the most
+beloved CI/CD solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).

 > TODO pipelines usage growth graph here

-GitLab CI/CD has come a long way since being released to users, but the design
-of the data storage for pipeline builds remains almost the same since 2012. We
+GitLab CI/CD has come a long way since the initial release, but the design of
+the data storage for pipeline builds remains almost the same since 2012. We
 store all the builds in PostgreSQL in `ci_builds` table, and because we are
 creating more than 0.5 million builds each day on gitlab.com, we are slowly
 reaching database limits.
 @@ -34,6 +31,17 @@ issue](https://gitlab.com/gitlab-org/gitlab/-/issues/213103). These include:
 `ci_builds` is one of the largest tables we maintain in the PostgreSQL
 database. The amount of data we are storing there is significant.

+```sql
+SELECT pg_size_pretty( pg_total_relation_size('ci_builds') );
+ pg_size_pretty
+----------------
+ 1456 GB
+(1 row)
+```
+
+Currently `ci_builds` table represents around 20% of the total size of our
+PostgreSQL database on gitlab.com.
+
 > TODO elaborate
 > TODO describe current database scaling initiative
 > TODO compare ci_builds size with other tables
 @@ -44,6 +52,12 @@ We can no longer migrate within this table or from the table to a different
 table. It is not possible with regular migrations, almost impossible with
 background migrations.

+Around GitLab 9 we started working on migrating information about build stages
+from `ci_builds` to `ci_stages` table. We hit
+[multiple](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/33866)
+[problems](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/47454) along the
+way.
+
 This means that we need to maintain multiple data formats and data schemas what
 works in favor of introducing more technical debt. We also can't easily reduce
 the size of this table, because moving data between tables is difficult, even
 @@ -53,13 +67,22 @@ when using background migrations.

 ### Adding new indices

-We can longer add new indices because we already do have too many, what is
+We can no longer add new indices because we already do have too many, what is
 making writes expensive. [Adding a new index on gitlab.com might take
 hours](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/32584#note_348817243).

 > TODO elaborate
 > TODO indexes size graph

+### Large indices
+
+We currently index a lot of things that we store in `ci_builds` table, what
+makes writes expensive. There is a lot of room for improvement here, we could
+modify our indexing strategies to make writes more efficient.
+
+> TODO show indexes sizes here TODO pipelines schedules and database
+> performance issue
+
 ### Statement timeouts

 We do have so much data in the `ci_builds` table that we can't even easily
 @@ -77,28 +100,46 @@ This mechanism unnecessarily consumes too much space and is not efficient
 enough. It might be better to use integer enums, but this would require data
 migration that is almost impossible right now.

-> TODO elaborate
+> TODO mention current size of `ci_builds.type`, and compare it to how it would
+> look like if we had been using tiny int enum.

 ## [What] Proposal

-We store a lot of data in `ci_builds` table, but the data stored there have
-different affinity to a pipeline. In particular - we store pipeline
-visualization data there and pipeline processing data.
+We store a lot of data in `ci_builds` table, everything is related to CI/CD
+pipelines but some parts of the data are used for a different purpose and some
+elements have different affinity to a pipeline than other. In particular - we
+store pipeline visualization data there and pipeline processing data.

 Pipeline visualization data and processing data can have different retention
 policies too. Separating these types of data can help us vastly reduce the
 amount of stored data and split data amongst multiple tables.

+This proposal is mostly about separating pipeline processing data from generic
+pipeline data that we use for other purpose, most notably for showing
+information about pipelines to users.
+
 > TODO calculate average ratio of visualization to processing data, like 40/60%
 > and support this with real numbers / graphs.

 ### Pipeline visualization data

-> TODO elaborate
+Pipeline visualization data is about all the things that we want to show to a
+user when someone visits pipelines / pipeline / build page.

 ### Pipeline processing data

-> TODO elaborate
+Pipeline processing data is about all the things that we need to store in our
+database in order to process pipeline from start to end. This includes:
+
+* exposing builds to runner
+* recalculating statuses
+* determining an order of execution
+* determining pipeline transitions
+* retrying a pipeline and builds
+
+We might not need to do these things for pipelines that are old - created
+months or years ago. In most cases it would be better create a new pipeline
+than to reprocess existing builds (usually by retrying them).

 ## [When] Iterations

 @@ -111,15 +152,16 @@ improvements.
 ### Validate build "degeneration" mechanisms

 We currently do have a bunch of mechanisms implemented that allow us to
-"degenerate" a build and to "archive" those builds that seem to be old and
-irrelevant anymore. These mechanisms have been implemented by a few different
+"degenerate" a build and to "archive" those builds that seem to be old and have
+never been tested. These mechanisms have been implemented by a few different
 teams and are disabled on gitlab.com. We should revisit them, and figure out if
 these are aligned with the initiative described in this blueprint.

 ### Soft-archive legacy builds

-Once we have all the data regarding the points above, we can make a well
-informed decision about archiving builds that are old.
+Once we have all the information about how builds degeneration and archival
+works, we can make a well informed decision about archiving builds that are
+old.

 In the first iteration we should archive builds in a "soft" way - without
 actually removing the data from the database. This will allow us to optimize
 @@ -128,38 +170,47 @@ builds are going to be unretriable and unprocessable and in case of making a
 wrong decision about how old the builds should be to archive them, we would
 still be able to change our minds about it.

-### Remove archived data
+### Archive and remove legacy processing data

-In order to migrate data between `ci_builds` and `ci_builds_metadata` we need
-to remove old data, because there is currently too much to migrate the data
-without major problems.
+In order to migrate data between `ci_builds` and other tables we need to remove
+old data, because there is currently too much data there to migrate it
+somewhere without major problems.

-This will be a destructive action, but perhaps we can devise a way to store
-data that we are going to remove from PostgreSQL in a different type of
-storage, to make this a two-way decision, however difficult recreating the
-database state will be - being able to revert wrong decision might be important
-here.
+This would be a destructive action, but perhaps we need to devise a way to
+store data, that we are going to remove from PostgreSQL, in a different type of
+storage, for example - object storage. This will make it a two-way door
+decision, however difficult recreating the database state might be being able
+to revert a wrong decision might be important here.

-This will require a sign-off from executives and product people.
+Removing data from the database will require a sign-off from executives and
+product people.

 ### Migrate `options` column from `ci_builds` to `ci_builds_metadata`

-Prepare a background migration that will move processing data, especially
-`ci_builds.options` to `ci_build_metadata.processing_options`.
+`ci_builds_metadata` is a table that we create to separate processing data from
+other kinds of pipeline data.  Because of entropy and time we do store
+different things there now too.

-### Ensure `ci_builds_metadata` contains only processing data
+We might want to prepare a background migration that will move processing data
+that remains in the database after archiving old builds, especially
+`ci_builds.options` to `ci_builds_metadata.config_options`.
+
+We already do have support for writing data, that we usually write to
+`ci_builds.options`, to `ci_builds_metadata.config_options`, but this feature
+has never been enabled on gitlab.com and is currently disabled under
+`ci_build_metadata_config` feature flag.
+
+### Ensure `ci_builds_metadata` contains only a complete set of processing data

 This table currently contains a bunch columns, but we should check if data in
 all of them are safe to get removed one a build gets archived.

-> TODO elaborate what `ci_builds_metadata` is and why it exists
-
 ### Move other processing columns

 There are other columns that could be moved to either `ci_builds_metadata` or
 `ci_pipelines_metadata`. We should move them too.

-### Rebuild degeneration mechanisms to remove `ci_builds_metadata` entries
+### Rebuild archival mechanisms to remove rows from `ci_builds_metadata`

 Instead of rewriting rows in `ci_builds` it might be better remove rows from
 `ci_builds_metadata`. We can also devise partitioning mechanisms for this table