Grzegorz Bizon · d93d21d3 · 21480785 · a43ee1a9 · 7809619a · d23d036b
--- a/sites/handbook/source/handbook/engineering/architecture/blueprints/pipelines/storage/index.html.md 0 → 100644

+ 284

− 0
+++ b/sites/handbook/source/handbook/engineering/architecture/blueprints/pipelines/storage/index.html.md 0 → 100644

+ 284

− 0
+---
+layout: markdown_page
+title: "GitLab CI/CD data storage improvements"
+---
+
+## GitLab CI/CD pipelines data storage improvements
+
+GitLab CI/CD is one of GitLab's most data and compute intensive components.
+Since its [initial release in November
+2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/),
+the CI/CD subsystem has evolved significantly. It was [integrated into GitLab
+in September
+2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and
+has become [one of the most beloved CI/CD
+solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
+
+> TODO pipelines usage growth graph here
+
+GitLab CI/CD has come a long way since the initial release, but the design of
+the data storage for pipeline builds remains almost the same since 2012. We
+store all the builds in PostgreSQL in `ci_builds` table, and because we are
+creating more than 0.5 million builds each day on gitlab.com, we are slowly
+reaching database limits.
+
+> TODO ci_builds size graph / data growth graphs
+
+## [Why] Problems
+
+We described the most important problems in [the
+issue](https://gitlab.com/gitlab-org/gitlab/-/issues/213103). These include:
+
+### Database size
+
+`ci_builds` is one of the largest tables we maintain in the PostgreSQL
+database. The amount of data we are storing there is significant.
+
+```sql
+SELECT pg_size_pretty( pg_total_relation_size('ci_builds') );
+ pg_size_pretty
+----------------
+ 1456 GB
+(1 row)
+```
+
+Currently `ci_builds` table represents around 20% of the total size of our
+PostgreSQL database on gitlab.com.
+
+> TODO elaborate
+> TODO describe current database scaling initiative
+> TODO compare ci_builds size with other tables
+
+### Data migrations
+
+We can no longer migrate within this table or from the table to a different
+table. It is not possible with regular migrations, almost impossible with
+background migrations.
+
+Around GitLab 9 we started working on migrating information about build stages
+from `ci_builds` to `ci_stages` table. We hit
+[multiple](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/33866)
+[problems](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/47454) along the
+way.
+
+This means that we need to maintain multiple data formats and data schemas what
+works in favor of introducing more technical debt. We also can't easily reduce
+the size of this table, because moving data between tables is difficult, even
+when using background migrations.
+
+> TODO elaborate
+
+### Adding new indices
+
+We can no longer add new indices because we already do have too many, what is
+making writes expensive. [Adding a new index on gitlab.com might take
+hours](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/32584#note_348817243).
+
+> TODO elaborate
+> TODO indexes size graph
+
+### Large indices
+
+We currently index a lot of things that we store in `ci_builds` table, what
+makes writes expensive. There is a lot of room for improvement here, we could
+modify our indexing strategies to make writes more efficient.
+
+> TODO show indexes sizes here TODO pipelines schedules and database
+> performance issue
+
+### Statement timeouts
+
+We do have so much data in the `ci_builds` table that we can't even easily
+count rows in that table using `COUNT` statement, even when using an index.
+
+> TODO elaborate
+> TODO more examples
+
+### Using STI
+
+We are using [Single Table Inheritance in
+Rails](https://api.rubyonrails.org/classes/ActiveRecord/Inheritance.html)
+
+This mechanism unnecessarily consumes too much space and is not efficient
+enough. It might be better to use integer enums, but this would require data
+migration that is almost impossible right now.
+
+> TODO mention current size of `ci_builds.type`, and compare it to how it would
+> look like if we had been using tiny int enum.
+
+## [What] Proposal
+
+Top-level goals:
+
+1. Remove columns with redundant data from `ci_builds` table
+1. Separate pipeline processing data from visualization data
+1. Define data retention policy for pipeline processing data
+1. Devise strategy for `ci_builds` partitioning
+
+### Remove columns with redundant data
+
+`ci_builds` table has a long history. This table also accumulated some
+technical debt throughout the years.
+
+Two important examples to mention here are - extracting data describing
+artifacts and stages from this table.
+
+A few years ago we extracted `ci_stages` table form `ci_builds`, but we never
+managed to stop using information about stages stored in `ci_builds` table,
+notably in `ci_builds.stage` and `ci_builds.stage_idx` columns. Data stored
+there is completely redundant because we also do store it in `ci_stages.name`
+and `ci_stages.position` columns.
+
+Similarly, we do have `ci_job_artifacts` table, and a bunch of
+`ci_builds.artifacts_*` columns that are either unused or hold redundant data.
+
+Presumably we do have more columns like that, we should find them and devise
+strategy for removing them and all the data stored in these columns (after we
+confirm that the data can be removed safely in case of being entirely
+redundant).
+
+### Separate pipeline processing data
+
+We store a lot of data in `ci_builds` table, everything is related to CI/CD
+pipelines but some parts of the data are used for a different purpose and some
+elements have different affinity to a pipeline than others. In particular - we
+store pipeline visualization data there and pipeline processing data.
+
+Pipeline visualization data and processing data can have different retention
+policies too. Separating these types of data can help us vastly reduce the
+amount of stored data and split data amongst multiple tables.
+
+This proposal is mostly about separating pipeline processing data from generic
+pipeline data that we use for other purpose, most notably for showing
+information about pipelines to users.
+
+> TODO calculate average ratio of visualization to processing data, like 40/60%
+> and support this with real numbers / graphs.
+
+#### Pipeline visualization data
+
+Pipeline visualization data is about all the things that we want to show to a
+user when someone visits pipelines / pipeline / build page.
+
+#### Pipeline processing data
+
+Pipeline processing data is about all the things that we need to store in our
+database in order to process pipeline from start to end. This includes:
+
+* exposing builds to runner
+* recalculating statuses
+* determining an order of execution
+* determining pipeline transitions
+* retrying a pipeline and builds
+
+We might not need to do these things for pipelines that are old - created
+months or years ago. In most cases it would be better create a new pipeline
+than to reprocess existing builds (usually by retrying them).
+
+### Devise strategy for `ci_builds` partitioning
+
+We can reduce the size of `ci_builds` significantly, but we do not plan to
+remove information about builds that we use to display them to users. It means
+that we are still going to keep all of them in the database, and given current
+rate of growth of this table, we might still need to explore partitioning.
+
+It might be possible to partition this table by a build creation date, but this
+requires technical evaluation to find the best way to partition the table.
+
+## [When] Iterations
+
+We can iterate on this in two parallel tracks.
+
+### Metric
+
+What matters most is reducing the size of the table on the primary database. We
+should have a metric that will clearly show our progress. This can be the total
+size of `ci_builds` table on a primary database, including indexes.
+
+### Track A
+
+#### 1. Unblock engineers by documenting how to store data
+
+Currently engineers can no longer add new columns to `ci_builds` table, because
+it is too large (more than 50 columns). We should write documentation how to
+workaround this limitation so that engineers are no longer blocked.
+
+#### 2. Remove legacy columns with redundant data
+
+Devise a way to remove columns with redundant data - columns related to stages
+and artifacts. Find other columns that can be removed, ensure that data is
+indeed redundant, otherwise do not remove it without a proper backup.
+
+#### 3. Resolve the problem of STI
+
+Replace STI mechanisms with integer enums. Estimate the benefit and storage
+space saved after normalizing data this way,  before doing it.
+
+### Track B
+
+#### 1. Separate pipeline processing and presentation data
+
+In order to separate pipeline processing and presentation data we need to move
+a bunch of columns from `ci_builds` table to `ci_builds_metadata`.
+
+The first step could be enabling a feature flag to write processing data to the
+new location for new builds, without migrating data for old builds. This
+feature flag is already implemented under `ci_build_metadata_config` but it is
+important to revisit the code to ensure that processing data and presentation
+data is separated correctly. This feature flag has never been enabled in
+production environment.
+
+#### 2. Soft-archive legacy build processing data
+
+Before we remove data we should archive builds in a "soft" way - without
+actually removing processing data from the database.
+
+This will allow us to optimize this iteration for the feedback from users
+because in the user interface the builds are going to be unretriable and
+unprocessable and in case of making a wrong decision about how old the builds
+should be to archive them, we would still be able to change our minds about it.
+
+We will also restrict access to processing data, what will allow us to surface
+mistakes in data separation.
+
+#### 3. Archive legacy processing data
+
+Enable continuous processing data archival mechanisms according to retention
+policy defined for gitlab.com and move legacy processing data from primary
+database to a different place.
+
+This might be a destructive action, but perhaps we need to devise a way to
+backup data that we are going to remove from PostgreSQL. We can store it in a
+different type of storage, for example - object storage. This will make it a
+two-way door decision, however difficult recreating the database state might be
+being able to revert a wrong decision might be important here.
+
+Moving data from the database will require a sign-off from executives and
+product people.
+
+#### 4. Migrate processing data
+
+Build a migration mechanism to migrate processing data out of `ci_builds` in
+case of on-premises installations.
+
+## Who
+
+Proposal:
+
+| Role                         | Who
+|------------------------------|-------------------------|
+| Author                       |     Grzegorz Bizon      |
+| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
+| Engineering Leader           |  Christopher Lefelhocz  |
+| Domain Expert                |      Fabio Pitino       |
+| Domain Expert                |     Kamil Trzciński     |
+
+DRIs:
+
+| Role                         | Who
+|------------------------------|------------------------|
+| Product                      |          TBD           |
+| Leadership                   |          TBD           |
+| Engineering                  |          TBD           |
+| Domain Expert                |          TBD           |
+| Domain Expert                |          TBD           |