Grzegorz Bizon · d93d21d3 · 21480785 · a43ee1a9 · 7809619a · d23d036b
--- a/sites/handbook/source/handbook/engineering/architecture/blueprints/pipelines/storage/index.html.md 0 → 100644

+ 253

− 0
+++ b/sites/handbook/source/handbook/engineering/architecture/blueprints/pipelines/storage/index.html.md 0 → 100644

+ 253

− 0
+---
+layout: markdown_page
+title: "GitLab CI/CD data storage improvements"
+---
+
+## GitLab CI/CD pipelines data storage improvements
+
+GitLab CI/CD is one of these parts of GitLab product that are the most data
+and compute intensive. Since it's [initial release in November
+2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/)
+the CI/CD subsystem has evolved a lot. It has been [integrated into GitLab in
+September
+2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and it
+has become [one of the most loved CI/CD
+solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
+
+> TODO pipelines usage growth graph here
+
+GitLab CI/CD has come a long way since being released to users, but the design
+of the data storage for pipeline builds remains almost the same since 2012. We
+store all the builds in PostgreSQL in `ci_builds` table, and because we are
+creating more than 0.5 million builds each day on gitlab.com, we are slowly
+reaching database limits.
+
+> TODO ci_builds size graph / data growth graphs
+
+## [Why] Problems
+
+We described the most important problems in [the
+issue](https://gitlab.com/gitlab-org/gitlab/-/issues/213103). These include:
+
+### Database size
+
+`ci_builds` is one of the largest tables we maintain in the PostgreSQL
+database. The amount of data we are storing there is significant.
+
+```sql
+SELECT pg_size_pretty( pg_total_relation_size('ci_builds') );
+ pg_size_pretty
+----------------
+ 1456 GB
+(1 row)
+```
+
+> TODO elaborate
+> TODO describe current database scaling initiative
+> TODO compare ci_builds size with other tables
+
+### Data migrations
+
+We can no longer migrate within this table or from the table to a different
+table. It is not possible with regular migrations, almost impossible with
+background migrations.
+
+This means that we need to maintain multiple data formats and data schemas what
+works in favor of introducing more technical debt. We also can't easily reduce
+the size of this table, because moving data between tables is difficult, even
+when using background migrations.
+
+> TODO elaborate
+
+### Adding new indices
+
+We can no longer add new indices because we already do have too many, what is
+making writes expensive. [Adding a new index on gitlab.com might take
+hours](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/32584#note_348817243).
+
+> TODO elaborate
+> TODO indexes size graph
+
+### Large indices
+
+We currently index a lot of things that we store in `ci_builds` table, what
+makes writes expensive. There is a lot of room for improvement here, we could
+modify our indexing strategies to make writes more efficient.
+
+> TODO show indexes sizes here TODO pipelines schedules and database
+> performance issue
+
+### Statement timeouts
+
+We do have so much data in the `ci_builds` table that we can't even easily
+count rows in that table using `COUNT` statement, even when using an index.
+
+> TODO elaborate
+> TODO more examples
+
+### Using STI
+
+We are using [Single Table Inheritance in
+Rails](https://api.rubyonrails.org/classes/ActiveRecord/Inheritance.html)
+
+This mechanism unnecessarily consumes too much space and is not efficient
+enough. It might be better to use integer enums, but this would require data
+migration that is almost impossible right now.
+
+> TODO mention current size of `ci_builds.type`, and compare it to how it would
+> look like if we had been using tiny int enum.
+
+## [What] Proposal
+
+We store a lot of data in `ci_builds` table, everything is related to CI/CD
+pipelines but some parts of the data are used for a different purpose and some
+elements have different affinity to a pipeline than other. In particular - we
+store pipeline visualization data there and pipeline processing data.
+
+Pipeline visualization data and processing data can have different retention
+policies too. Separating these types of data can help us vastly reduce the
+amount of stored data and split data amongst multiple tables.
+
+This proposal is mostly about separating pipeline processing data from generic
+pipeline data that we use for other purpose, most notably for showing
+information about pipelines to users.
+
+> TODO calculate average ratio of visualization to processing data, like 40/60%
+> and support this with real numbers / graphs.
+
+### Pipeline visualization data
+
+Pipeline visualization data is about all the things that we want to show to a
+user when someone visits pipelines / pipeline / build page.
+
+### Pipeline processing data
+
+Pipeline processing data is about all the things that we need to store in our
+database in order to process pipeline from start to end. This includes:
+
+* exposing builds to runner
+* recalculating statuses
+* determining an order of execution
+* determining pipeline transitions
+* retrying a pipeline and builds
+
+We might not need to do these things for pipelines that are old - created
+months or years ago. In most cases it would be better create a new pipeline
+than to reprocess existing builds (usually by retrying them).
+
+## [When] Iterations
+
+### Devise a metric for `ci_builds` situation
+
+We should have a metric for the `ci_builds` size / size of indices or another
+metric that can help to measure current situation and the impact of future
+improvements.
+
+### Validate build "degeneration" mechanisms
+
+We currently do have a bunch of mechanisms implemented that allow us to
+"degenerate" a build and to "archive" those builds that seem to be old and have
+never been tested. These mechanisms have been implemented by a few different
+teams and are disabled on gitlab.com. We should revisit them, and figure out if
+these are aligned with the initiative described in this blueprint.
+
+### Soft-archive legacy builds
+
+Once we have all the information about how builds degeneration and archival
+works, we can make a well informed decision about archiving builds that are
+old.
+
+In the first iteration we should archive builds in a "soft" way - without
+actually removing the data from the database. This will allow us to optimize
+this iteration for the feedback from users because in the user interface the
+builds are going to be unretriable and unprocessable and in case of making a
+wrong decision about how old the builds should be to archive them, we would
+still be able to change our minds about it.
+
+### Archive and remove legacy processing data
+
+In order to migrate data between `ci_builds` and other tables we need to remove
+old data, because there is currently too much data there to migrate it
+somewhere without major problems.
+
+This would be a destructive action, but perhaps we need to devise a way to
+store data, that we are going to remove from PostgreSQL, in a different type of
+storage, for example - object storage. This will make it a two-way door
+decision, however difficult recreating the database state might be being able
+to revert a wrong decision might be important here.
+
+Removing data from the database will require a sign-off from executives and
+product people.
+
+### Migrate `options` column from `ci_builds` to `ci_builds_metadata`
+
+`ci_builds_metadata` is a table that we create to separate processing data from
+other kinds of pipeline data.  Because of entropy and time we do store
+different things there now too.
+
+We might want to prepare a background migration that will move processing data
+that remains in the database after archiving old builds, especially
+`ci_builds.options` to `ci_builds_metadata.config_options`.
+
+We already do have support for writing data, that we usually write to
+`ci_builds.options`, to `ci_builds_metadata.config_options`, but this feature
+has never been enabled on gitlab.com and is currently disabled under
+`ci_build_metadata_config` feature flag.
+
+### Ensure `ci_builds_metadata` contains only a complete set of processing data
+
+This table currently contains a bunch columns, but we should check if data in
+all of them are safe to get removed one a build gets archived.
+
+### Move other processing columns
+
+There are other columns that could be moved to either `ci_builds_metadata` or
+`ci_pipelines_metadata`. We should move them too.
+
+### Rebuild archival mechanisms to remove rows from `ci_builds_metadata`
+
+Instead of rewriting rows in `ci_builds` it might be better remove rows from
+`ci_builds_metadata`. We can also devise partitioning mechanisms for this table
+in the future.
+
+### Remove legacy columns from `ci_builds`
+
+There are bunch of deprecated columns in that table, we should remove them too.
+`stage` column is known to take a lot of space, and it might not be needed
+anymore after we enable `ci_atomic_processing` FF.
+
+### Resolve the problem of STI
+
+We can replace STI mechanisms with integer enums.
+
+> TODO, elaborate
+
+### Move and remove indices
+
+Once we move processing data, we might also be able to move indexes. We should
+never remove an index until a new one is set up.
+
+> TODO elaborate
+
+We should extend this blueprint with more ideas about how to reduce the size of
+indexes.
+
+## Who
+
+Proposal:
+
+| Role                         | Who
+|------------------------------|-------------------------|
+| Author                       |     Grzegorz Bizon      |
+| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
+| Engineering Leader           |          TBD            |
+| Domain Expert                |     Grzegorz Bizon      |
+
+DRIs:
+
+| Role                         | Who
+|------------------------------|------------------------|
+| Product                      |          TBD           |
+| Leadership                   |          TBD           |
+| Engineering                  |          TBD           |
+| Domain Expert                |    Grzegorz Bizon      |