Skip to content
Snippets Groups Projects

WIP: GitLab CI/CD pipelines storage improvement

Closed Grzegorz Bizon requested to merge blueprint/gb/pipelines-storage-improvements into master
11 unresolved threads
Files
2
---
layout: markdown_page
title: "GitLab CI/CD data storage improvements"
---
## GitLab CI/CD pipelines data storage improvements
GitLab CI/CD is one of these parts of GitLab product that are the most data
and compute intensive. Since it's [initial release in November
2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/)
the CI/CD subsystem has evolved a lot. It has been [integrated into GitLab in
September
2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and it
has become [one of the most loved CI/CD
solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
> TODO pipelines usage growth graph here
GitLab CI/CD has come a long way since being released to users, but the design
of the data storage for pipeline builds remains almost the same since 2012. We
store all the builds in PostgreSQL in `ci_builds` table, and because we are
creating more than 0.5 million builds each day on gitlab.com, we are slowly
reaching database limits.
> TODO ci_builds size graph / data growth graphs
## [Why] Problems
We described the most important problems in [the
issue](https://gitlab.com/gitlab-org/gitlab/-/issues/213103). These include:
### Database size
`ci_builds` is one of the largest tables we maintain in the PostgreSQL
database. The amount of data we are storing there is significant.
```sql
SELECT pg_size_pretty( pg_total_relation_size('ci_builds') );
pg_size_pretty
----------------
1456 GB
(1 row)
```
> TODO elaborate
> TODO describe current database scaling initiative
> TODO compare ci_builds size with other tables
### Data migrations
We can no longer migrate within this table or from the table to a different
table. It is not possible with regular migrations, almost impossible with
background migrations.
This means that we need to maintain multiple data formats and data schemas what
works in favor of introducing more technical debt. We also can't easily reduce
the size of this table, because moving data between tables is difficult, even
when using background migrations.
> TODO elaborate
### Adding new indices
We can no longer add new indices because we already do have too many, what is
making writes expensive. [Adding a new index on gitlab.com might take
hours](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/32584#note_348817243).
> TODO elaborate
> TODO indexes size graph
### Large indices
We currently index a lot of things that we store in `ci_builds` table, what
makes writes expensive. There is a lot of room for improvement here, we could
modify our indexing strategies to make writes more efficient.
> TODO show indexes sizes here TODO pipelines schedules and database
> performance issue
### Statement timeouts
We do have so much data in the `ci_builds` table that we can't even easily
count rows in that table using `COUNT` statement, even when using an index.
> TODO elaborate
> TODO more examples
### Using STI
We are using [Single Table Inheritance in
Rails](https://api.rubyonrails.org/classes/ActiveRecord/Inheritance.html)
This mechanism unnecessarily consumes too much space and is not efficient
enough. It might be better to use integer enums, but this would require data
migration that is almost impossible right now.
> TODO mention current size of `ci_builds.type`, and compare it to how it would
> look like if we had been using tiny int enum.
## [What] Proposal
We store a lot of data in `ci_builds` table, everything is related to CI/CD
pipelines but some parts of the data are used for a different purpose and some
elements have different affinity to a pipeline than other. In particular - we
store pipeline visualization data there and pipeline processing data.
Pipeline visualization data and processing data can have different retention
policies too. Separating these types of data can help us vastly reduce the
amount of stored data and split data amongst multiple tables.
This proposal is mostly about separating pipeline processing data from generic
pipeline data that we use for other purpose, most notably for showing
information about pipelines to users.
> TODO calculate average ratio of visualization to processing data, like 40/60%
> and support this with real numbers / graphs.
### Pipeline visualization data
Pipeline visualization data is about all the things that we want to show to a
user when someone visits pipelines / pipeline / build page.
### Pipeline processing data
Pipeline processing data is about all the things that we need to store in our
database in order to process pipeline from start to end. This includes:
* exposing builds to runner
* recalculating statuses
* determining an order of execution
* determining pipeline transitions
* retrying a pipeline and builds
We might not need to do these things for pipelines that are old - created
months or years ago. In most cases it would be better create a new pipeline
than to reprocess existing builds (usually by retrying them).
## [When] Iterations
### Devise a metric for `ci_builds` situation
We should have a metric for the `ci_builds` size / size of indices or another
metric that can help to measure current situation and the impact of future
improvements.
### Validate build "degeneration" mechanisms
We currently do have a bunch of mechanisms implemented that allow us to
"degenerate" a build and to "archive" those builds that seem to be old and have
never been tested. These mechanisms have been implemented by a few different
teams and are disabled on gitlab.com. We should revisit them, and figure out if
these are aligned with the initiative described in this blueprint.
### Soft-archive legacy builds
Once we have all the information about how builds degeneration and archival
works, we can make a well informed decision about archiving builds that are
old.
In the first iteration we should archive builds in a "soft" way - without
actually removing the data from the database. This will allow us to optimize
this iteration for the feedback from users because in the user interface the
builds are going to be unretriable and unprocessable and in case of making a
wrong decision about how old the builds should be to archive them, we would
still be able to change our minds about it.
### Archive and remove legacy processing data
In order to migrate data between `ci_builds` and other tables we need to remove
old data, because there is currently too much data there to migrate it
somewhere without major problems.
This would be a destructive action, but perhaps we need to devise a way to
store data, that we are going to remove from PostgreSQL, in a different type of
storage, for example - object storage. This will make it a two-way door
decision, however difficult recreating the database state might be being able
to revert a wrong decision might be important here.
Removing data from the database will require a sign-off from executives and
product people.
### Migrate `options` column from `ci_builds` to `ci_builds_metadata`
`ci_builds_metadata` is a table that we create to separate processing data from
other kinds of pipeline data. Because of entropy and time we do store
different things there now too.
We might want to prepare a background migration that will move processing data
that remains in the database after archiving old builds, especially
`ci_builds.options` to `ci_builds_metadata.config_options`.
We already do have support for writing data, that we usually write to
`ci_builds.options`, to `ci_builds_metadata.config_options`, but this feature
has never been enabled on gitlab.com and is currently disabled under
`ci_build_metadata_config` feature flag.
### Ensure `ci_builds_metadata` contains only a complete set of processing data
This table currently contains a bunch columns, but we should check if data in
all of them are safe to get removed one a build gets archived.
### Move other processing columns
There are other columns that could be moved to either `ci_builds_metadata` or
`ci_pipelines_metadata`. We should move them too.
### Rebuild archival mechanisms to remove rows from `ci_builds_metadata`
Instead of rewriting rows in `ci_builds` it might be better remove rows from
`ci_builds_metadata`. We can also devise partitioning mechanisms for this table
in the future.
### Remove legacy columns from `ci_builds`
There are bunch of deprecated columns in that table, we should remove them too.
`stage` column is known to take a lot of space, and it might not be needed
anymore after we enable `ci_atomic_processing` FF.
### Resolve the problem of STI
We can replace STI mechanisms with integer enums.
> TODO, elaborate
### Move and remove indices
Once we move processing data, we might also be able to move indexes. We should
never remove an index until a new one is set up.
> TODO elaborate
We should extend this blueprint with more ideas about how to reduce the size of
indexes.
## Who
Proposal:
| Role | Who
|------------------------------|-------------------------|
| Author | Grzegorz Bizon |
| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
| Engineering Leader | TBD |
| Domain Expert | Grzegorz Bizon |
DRIs:
| Role | Who
|------------------------------|------------------------|
| Product | TBD |
| Leadership | TBD |
| Engineering | TBD |
| Domain Expert | Grzegorz Bizon |
Loading