Skip to content
Snippets Groups Projects

WIP: GitLab CI/CD pipelines storage improvement

Closed Grzegorz Bizon requested to merge blueprint/gb/pipelines-storage-improvements into master
11 unresolved threads
@@ -5,11 +5,14 @@ title: "GitLab CI/CD data storage improvements"
## GitLab CI/CD pipelines data storage improvements
GitLab CI/CD is one of GitLab's most data and compute intensive components. Since its [initial release in
November 2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/),
the CI/CD subsystem has evolved significantly. It was [integrated into GitLab in September
2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and has become [one of the most
beloved CI/CD solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
GitLab CI/CD is one of GitLab's most data and compute intensive components.
Since its [initial release in November
2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/),
the CI/CD subsystem has evolved significantly. It was [integrated into GitLab
in September
2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and
has become [one of the most beloved CI/CD
solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
> TODO pipelines usage growth graph here
@@ -105,9 +108,40 @@ migration that is almost impossible right now.
## [What] Proposal
Top-level goals:
1. Remove columns with redundant data from `ci_builds` table
1. Separate pipeline processing data from visualization data
1. Define data retention policy for pipeline processing data
1. Devise strategy for `ci_builds` partitioning
### Remove columns with redundant data
`ci_builds` table has a long history. This table also accumulated some
technical debt throughout the years.
Two important examples to mention here are - extracting data describing
artifacts and stages from this table.
A few years ago we extracted `ci_stages` table form `ci_builds`, but we never
managed to stop using information about stages stored in `ci_builds` table,
notably in `ci_builds.stage` and `ci_builds.stage_idx` columns. Data stored
there is completely redundant because we also do store it in `ci_stages.name`
and `ci_stages.position` columns.
Similarly, we do have `ci_job_artifacts` table, and a bunch of
`ci_builds.artifacts_*` columns that are either unused or hold redundant data.
Presumably we do have more columns like that, we should find them and devise
strategy for removing them and all the data stored in these columns (after we
confirm that the data can be removed safely in case of being entirely
redundant).
### Separate pipeline processing data
We store a lot of data in `ci_builds` table, everything is related to CI/CD
pipelines but some parts of the data are used for a different purpose and some
elements have different affinity to a pipeline than other. In particular - we
elements have different affinity to a pipeline than others. In particular - we
store pipeline visualization data there and pipeline processing data.
Pipeline visualization data and processing data can have different retention
@@ -121,12 +155,12 @@ information about pipelines to users.
> TODO calculate average ratio of visualization to processing data, like 40/60%
> and support this with real numbers / graphs.
### Pipeline visualization data
#### Pipeline visualization data
Pipeline visualization data is about all the things that we want to show to a
user when someone visits pipelines / pipeline / build page.
### Pipeline processing data
#### Pipeline processing data
Pipeline processing data is about all the things that we need to store in our
database in order to process pipeline from start to end. This includes:
@@ -141,15 +175,48 @@ We might not need to do these things for pipelines that are old - created
months or years ago. In most cases it would be better create a new pipeline
than to reprocess existing builds (usually by retrying them).
### Devise strategy for `ci_builds` partitioning
We can reduce the size of `ci_builds` significantly, but we do not plan to
remove information about builds that we use to display them to users. It means
that we are still going to keep all of them in the database, and given current
rate of growth of this table, we might still need to explore partitioning.
It might be possible to partition this table by a build creation date, but this
requires technical evaluation to find the best way to partition the table.
## [When] Iterations
### Devise a metric for `ci_builds` situation
We can iterate on this in two parallel tracks.
### Metric
What matters most is reducing the size of the table on the primary database. We
should have a metric that will clearly show our progress. This can be the total
size of `ci_builds` table on a primary database, including indexes.
We should have a metric for the `ci_builds` size / size of indices or another
metric that can help to measure current situation and the impact of future
improvements.
### Track A
### Validate build "degeneration" mechanisms
#### 1. Unblock engineers by documenting how to store data
Currently engineers can no longer add new columns to `ci_builds` table, because
it is too large (more than 50 columns). We should write documentation how to
workaround this limitation so that engineers are no longer blocked.
#### 2. Remove legacy columns with redundant data
Devise a way to remove columns with redundant data - columns related to stages
and artifacts. Find other columns that can be removed, ensure that data is
indeed redundant, otherwise do not remove it without a proper backup.
#### 3. Resolve the problem of STI
Replace STI mechanisms with integer enums. Estimate the benefit and storage
space saved after normalizing data this way, before doing it.
### Track B
#### Validate build "degeneration" mechanisms
We currently do have a bunch of mechanisms implemented that allow us to
"degenerate" a build and to "archive" those builds that seem to be old and have
@@ -157,7 +224,7 @@ never been tested. These mechanisms have been implemented by a few different
teams and are disabled on gitlab.com. We should revisit them, and figure out if
these are aligned with the initiative described in this blueprint.
### Soft-archive legacy builds
#### Soft-archive legacy builds
Once we have all the information about how builds degeneration and archival
works, we can make a well informed decision about archiving builds that are
@@ -170,7 +237,7 @@ builds are going to be unretriable and unprocessable and in case of making a
wrong decision about how old the builds should be to archive them, we would
still be able to change our minds about it.
### Archive and remove legacy processing data
#### Archive and remove legacy processing data
In order to migrate data between `ci_builds` and other tables we need to remove
old data, because there is currently too much data there to migrate it
@@ -185,7 +252,7 @@ to revert a wrong decision might be important here.
Removing data from the database will require a sign-off from executives and
product people.
### Migrate `options` column from `ci_builds` to `ci_builds_metadata`
#### Migrate `options` column from `ci_builds` to `ci_builds_metadata`
`ci_builds_metadata` is a table that we create to separate processing data from
other kinds of pipeline data. Because of entropy and time we do store
@@ -200,34 +267,17 @@ We already do have support for writing data, that we usually write to
has never been enabled on gitlab.com and is currently disabled under
`ci_build_metadata_config` feature flag.
### Ensure `ci_builds_metadata` contains only a complete set of processing data
#### Ensure `ci_builds_metadata` contains only a complete set of processing data
This table currently contains a bunch columns, but we should check if data in
all of them are safe to get removed one a build gets archived.
### Move other processing columns
There are other columns that could be moved to either `ci_builds_metadata` or
`ci_pipelines_metadata`. We should move them too.
### Rebuild archival mechanisms to remove rows from `ci_builds_metadata`
Instead of rewriting rows in `ci_builds` it might be better remove rows from
`ci_builds_metadata`. We can also devise partitioning mechanisms for this table
in the future.
### Remove legacy columns from `ci_builds`
There are bunch of deprecated columns in that table, we should remove them too.
`stage` column is known to take a lot of space, and it might not be needed
anymore after we enable `ci_atomic_processing` FF.
### Resolve the problem of STI
We can replace STI mechanisms with integer enums.
> TODO, elaborate
### Move and remove indices
Once we move processing data, we might also be able to move indexes. We should
@@ -246,8 +296,9 @@ Proposal:
|------------------------------|-------------------------|
| Author | Grzegorz Bizon |
| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
| Engineering Leader | TBD |
| Domain Expert | Grzegorz Bizon |
| Engineering Leader | Christopher Lefelhocz |
| Domain Expert | Fabio Pitino |
| Domain Expert | Kamil Trzciński |
DRIs:
@@ -256,4 +307,5 @@ DRIs:
| Product | TBD |
| Leadership | TBD |
| Engineering | TBD |
| Domain Expert | Grzegorz Bizon |
| Domain Expert | TBD |
| Domain Expert | TBD |
Loading