Add CI/CD pipelines storage improvements blueprint

a284efee · Grzegorz Bizon · a0c8dd16 · a284efee
Commit a284efee authored 4 years ago by Grzegorz Bizon
--- a/sites/handbook/source/handbook/engineering/architecture/blueprints/pipelines/storage/index.html.md
+++ b/sites/handbook/source/handbook/engineering/architecture/blueprints/pipelines/storage/index.html.md
 ---
 layout: markdown_page
-title: "CI/CD data storage improvements - ci_builds table"
+title: "GitLab CI/CD data storage improvements"
 ---
  
-## GitLab CI/CD pipelines - data storage improvements
+## GitLab CI/CD pipelines data storage improvements
  
+GitLab CI/CD is one of these parts of GitLab product that are the most data
+and compute intensive. Since it's [initial release in November
+2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/)
+the CI/CD subsystem has evolved a lot. It has been [integrated into GitLab in
+September
+2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and it
+has become [one of the most loved CI/CD
+solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
  
+> TODO pipelines usage growth graph here
  
+GitLab CI/CD has come a long way since being released to users, but the design
+of the data storage for pipeline builds remains almost the same since 2012. We
+store all the builds in PostgreSQL in `ci_builds` table, and because we are
+creating more than 0.5 million builds each day on gitlab.com, we are slowly
+reaching database limits.
+
+> TODO ci_builds size graph / data growth graphs
+
+## [Why] Problems
+
+We described the most important problems in [the
+issue](https://gitlab.com/gitlab-org/gitlab/-/issues/213103). These include:
+
+### Database size
+
+`ci_builds` is one of the largest tables we maintain in the PostgreSQL
+database. The amount of data we are storing there is significant.
+
+> TODO elaborate
+> TODO describe current database scaling initiative
+> TODO compare ci_builds size with other tables
+
+### Data migrations
+
+We can no longer migrate within this table or from the table to a different
+table. It is not possible with regular migrations, almost impossible with
+background migrations.
+
+This means that we need to maintain multiple data formats and data schemas what
+works in favor of introducing more technical debt. We also can't easily reduce
+the size of this table, because moving data between tables is difficult, even
+when using background migrations.
+
+> TODO elaborate
+
+### Adding new indices
+
+We can longer add new indices because we already do have too many, what is
+making writes expensive. [Adding a new index on gitlab.com might take
+hours](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/32584#note_348817243).
+
+> TODO elaborate
+> TODO indexes size graph
+
+### Statement timeouts
+
+We do have so much data in the `ci_builds` table that we can't even easily
+count rows in that table using `COUNT` statement, even when using an index.
+
+> TODO elaborate
+> TODO more examples
+
+### Using STI
+
+We are using [Single Table Inheritance in
+Rails](https://api.rubyonrails.org/classes/ActiveRecord/Inheritance.html)
+
+This mechanism unnecessarily consumes too much space and is not efficient
+enough. It might be better to use integer enums, but this would require data
+migration that is almost impossible right now.
+
+> TODO elaborate
+
+## [What] Proposal
+
+We store a lot of data in `ci_builds` table, but the data stored there have
+different affinity to a pipeline. In particular - we store pipeline
+visualization data there and pipeline processing data.
+
+Pipeline visualization data and processing data can have different retention
+policies too. Separating these types of data can help us vastly reduce the
+amount of stored data and split data amongst multiple tables.
+
+> TODO calculate average ratio of visualization to processing data, like 40/60%
+> and support this with real numbers / graphs.
+
+### Pipeline visualization data
+
+> TODO elaborate
+
+### Pipeline processing data
+
+> TODO elaborate
+
+## [When] Iterations
+
+### Devise a metric for `ci_builds` situation
+
+We should have a metric for the `ci_builds` size / size of indices or another
+metric that can help to measure current situation and the impact of future
+improvements.
+
+### Validate build "degeneration" mechanisms
+
+We currently do have a bunch of mechanisms implemented that allow us to
+"degenerate" a build and to "archive" those builds that seem to be old and
+irrelevant anymore. These mechanisms have been implemented by a few different
+teams and are disabled on gitlab.com. We should revisit them, and figure out if
+these are aligned with the initiative described in this blueprint.
+
+### Soft-archive legacy builds
+
+Once we have all the data regarding the points above, we can make a well
+informed decision about archiving builds that are old.
+
+In the first iteration we should archive builds in a "soft" way - without
+actually removing the data from the database. This will allow us to optimize
+this iteration for the feedback from users because in the user interface the
+builds are going to be unretriable and unprocessable and in case of making a
+wrong decision about how old the builds should be to archive them, we would
+still be able to change our minds about it.
+
+### Remove archived data
+
+In order to migrate data between `ci_builds` and `ci_builds_metadata` we need
+to remove old data, because there is currently too much to migrate the data
+without major problems.
+
+This will be a destructive action, but perhaps we can devise a way to store
+data that we are going to remove from PostgreSQL in a different type of
+storage, to make this a two-way decision, however difficult recreating the
+database state will be - being able to revert wrong decision might be important
+here.
+
+This will require a sign-off from executives and product people.
+
+### Migrate `options` column from `ci_builds` to `ci_builds_metadata`
+
+Prepare a background migration that will move processing data, especially
+`ci_builds.options` to `ci_build_metadata.processing_options`.
+
+### Ensure `ci_builds_metadata` contains only processing data
+
+This table currently contains a bunch columns, but we should check if data in
+all of them are safe to get removed one a build gets archived.
+
+> TODO elaborate what `ci_builds_metadata` is and why it exists
+
+### Move other processing columns
+
+There are other columns that could be moved to either `ci_builds_metadata` or
+`ci_pipelines_metadata`. We should move them too.
+
+### Rebuild degeneration mechanisms to remove `ci_builds_metadata` entries
+
+Instead of rewriting rows in `ci_builds` it might be better remove rows from
+`ci_builds_metadata`. We can also devise partitioning mechanisms for this table
+in the future.
+
+### Remove legacy columns from `ci_builds`
+
+There are bunch of deprecated columns in that table, we should remove them too.
+`stage` column is known to take a lot of space, and it might not be needed
+anymore after we enable `ci_atomic_processing` FF.
+
+### Resolve the problem of STI
+
+We can replace STI mechanisms with integer enums.
+
+> TODO, elaborate
+
+### Move and remove indices
+
+Once we move processing data, we might also be able to move indexes. We should
+never remove an index until a new one is set up.
+
+> TODO elaborate
+
+We should extend this blueprint with more ideas about how to reduce the size of
+indexes.
+
+## Who
+
+Proposal:
+
+| Role                         | Who
+|------------------------------|-------------------------|
+| Author                       |     Grzegorz Bizon      |
+| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
+| Engineering Leader           |          TBD            |
+| Domain Expert                |     Grzegorz Bizon      |
+
+DRIs:
+
+| Role                         | Who
+|------------------------------|------------------------|
+| Product                      |          TBD           |
+| Leadership                   |          TBD           |
+| Engineering                  |          TBD           |
+| Domain Expert                |    Grzegorz Bizon      |