Skip to content
Snippets Groups Projects

WIP: GitLab CI/CD pipelines storage improvement

Closed Grzegorz Bizon requested to merge blueprint/gb/pipelines-storage-improvements into master
11 unresolved threads
Files
2
---
layout: markdown_page
title: "GitLab CI/CD data storage improvements"
---
## GitLab CI/CD pipelines data storage improvements
GitLab CI/CD is one of GitLab's most data and compute intensive components.
Since its [initial release in November
2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/),
the CI/CD subsystem has evolved significantly. It was [integrated into GitLab
in September
2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and
has become [one of the most beloved CI/CD
solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
> TODO pipelines usage growth graph here
GitLab CI/CD has come a long way since the initial release, but the design of
the data storage for pipeline builds remains almost the same since 2012. We
store all the builds in PostgreSQL in `ci_builds` table, and because we are
creating more than 0.5 million builds each day on gitlab.com, we are slowly
reaching database limits.
> TODO ci_builds size graph / data growth graphs
## [Why] Problems
We described the most important problems in [the
issue](https://gitlab.com/gitlab-org/gitlab/-/issues/213103). These include:
### Database size
`ci_builds` is one of the largest tables we maintain in the PostgreSQL
database. The amount of data we are storing there is significant.
```sql
SELECT pg_size_pretty( pg_total_relation_size('ci_builds') );
pg_size_pretty
----------------
1456 GB
(1 row)
```
    • Author Maintainer

      TODO add the date when the table size was checked in prod, the table grows quickly and it in a few weeks from now this will be different.

Please register or sign in to reply
Please register or sign in to reply
Currently `ci_builds` table represents around 20% of the total size of our
PostgreSQL database on gitlab.com.
> TODO elaborate
> TODO describe current database scaling initiative
> TODO compare ci_builds size with other tables
### Data migrations
We can no longer migrate within this table or from the table to a different
table. It is not possible with regular migrations, almost impossible with
background migrations.
Around GitLab 9 we started working on migrating information about build stages
from `ci_builds` to `ci_stages` table. We hit
[multiple](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/33866)
[problems](https://gitlab.com/gitlab-org/gitlab-foss/-/issues/47454) along the
way.
Please register or sign in to reply
This means that we need to maintain multiple data formats and data schemas what
works in favor of introducing more technical debt. We also can't easily reduce
the size of this table, because moving data between tables is difficult, even
when using background migrations.
> TODO elaborate
### Adding new indices
We can no longer add new indices because we already do have too many, what is
making writes expensive. [Adding a new index on gitlab.com might take
hours](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/32584#note_348817243).
> TODO elaborate
> TODO indexes size graph
### Large indices
We currently index a lot of things that we store in `ci_builds` table, what
makes writes expensive. There is a lot of room for improvement here, we could
modify our indexing strategies to make writes more efficient.
> TODO show indexes sizes here TODO pipelines schedules and database
> performance issue
### Statement timeouts
We do have so much data in the `ci_builds` table that we can't even easily
count rows in that table using `COUNT` statement, even when using an index.
> TODO elaborate
> TODO more examples
### Using STI
We are using [Single Table Inheritance in
Rails](https://api.rubyonrails.org/classes/ActiveRecord/Inheritance.html)
This mechanism unnecessarily consumes too much space and is not efficient
enough. It might be better to use integer enums, but this would require data
migration that is almost impossible right now.
> TODO mention current size of `ci_builds.type`, and compare it to how it would
> look like if we had been using tiny int enum.
## [What] Proposal
Top-level goals:
1. Remove columns with redundant data from `ci_builds` table
1. Separate pipeline processing data from visualization data
1. Define data retention policy for pipeline processing data
1. Devise strategy for `ci_builds` partitioning
### Remove columns with redundant data
`ci_builds` table has a long history. This table also accumulated some
technical debt throughout the years.
Two important examples to mention here are - extracting data describing
artifacts and stages from this table.
A few years ago we extracted `ci_stages` table form `ci_builds`, but we never
managed to stop using information about stages stored in `ci_builds` table,
notably in `ci_builds.stage` and `ci_builds.stage_idx` columns. Data stored
there is completely redundant because we also do store it in `ci_stages.name`
and `ci_stages.position` columns.
Similarly, we do have `ci_job_artifacts` table, and a bunch of
`ci_builds.artifacts_*` columns that are either unused or hold redundant data.
Presumably we do have more columns like that, we should find them and devise
strategy for removing them and all the data stored in these columns (after we
confirm that the data can be removed safely in case of being entirely
redundant).
### Separate pipeline processing data
We store a lot of data in `ci_builds` table, everything is related to CI/CD
pipelines but some parts of the data are used for a different purpose and some
elements have different affinity to a pipeline than others. In particular - we
store pipeline visualization data there and pipeline processing data.
Pipeline visualization data and processing data can have different retention
policies too. Separating these types of data can help us vastly reduce the
amount of stored data and split data amongst multiple tables.
This proposal is mostly about separating pipeline processing data from generic
pipeline data that we use for other purpose, most notably for showing
information about pipelines to users.
> TODO calculate average ratio of visualization to processing data, like 40/60%
> and support this with real numbers / graphs.
    • Author Maintainer

      @Finotto would you be able to check the total size of ci_builds.options and what percentage of total size of ci_builds it represents?

      • @grzesiek we have the following tables with 'build' in the database, and I do not see any clear options table there :

        SELECT
           relname as "Table",
           pg_size_pretty(pg_total_relation_size(relid)) As "Size"
           FROM pg_catalog.pg_statio_user_tables where relname like '%build%' ORDER BY pg_total_relation_size(relid) DESC;
                        Table                |  Size
        -------------------------------------+---------
         ci_builds                           | 1539 GB
         ci_build_trace_sections             | 459 GB
         ci_builds_metadata                  | 71 GB
         ci_build_needs                      | 9362 MB
         ci_build_trace_section_names        | 1795 MB
         ci_daily_build_group_report_results | 199 MB
         ci_build_trace_chunks               | 35 MB
         packages_build_infos                | 20 MB
         ci_builds_runner_session            | 824 kB
         ci_build_report_results             | 24 kB

        ci_builds represents at the moment 1539 GB from the total database of 7369 GB, which would be approx 21% of the database.

      • Author Maintainer

        @Finotto I mean options column in ci_builds table - what percentage of total data in ci_builds do we store in options column.

      • @grzesiek I added the info:

        gitlabhq_production=# select
        gitlabhq_production-#     pg_size_pretty(sum(pg_column_size(options))) as total_size,
        gitlabhq_production-#     pg_size_pretty(avg(pg_column_size(options))) as average_size,
        gitlabhq_production-#     sum(pg_column_size(options)) * 100.0 / pg_total_relation_size('ci_builds') as percentage
        gitlabhq_production-# from ci_builds;
         total_size |        average_size        |     percentage
        ------------+----------------------------+---------------------
         354 GB     | 667.0868096973743355 bytes | 22.8841407398832713

        So this column represents 22% of the whole table. I would suggest to normalize and create a separated table.

        cc: @glopezfernandez @albertoramos @NikolayS

        Edited by Jose Finotto
      • Author Maintainer

        I expect that we might be able to remove 90% of this data and reduce our database size by around 300 gigabytes. It still depends on how many pipelines in the database are older than 3 and 6 months, but I think that presumably more than 90%.

        @Finotto can you also check the size of ci_builds.stage? I expect that this column will be completely redundant after we wrap up our effort with atomic processing of pipelines. I think this to be a few gigabytes too, and if everything goes right, this data will not be needed at all too.

        Edited by Grzegorz Bizon
      • Author Maintainer

        @Finotto can you please take a look at ci_builds.stage size? This is the column that we might be able to remove completely, even without backup (because we already do have backup in ci_stages.name), so this might be the low hanging fruit we are looking for here too.

      • @grzesiek sorry for the late answer, sure I give you the info in few minutes.

      • @grzesiek

        the info is:

        # select
         pg_size_pretty(sum(pg_column_size(stage))) as total_size,
         pg_size_pretty(avg(pg_column_size(stage))) as average_size,
         sum(pg_column_size(stage)) * 100.0 / pg_total_relation_size('ci_builds') as percentage
        from ci_builds;;
         total_size |       average_size       |       percentage
        ------------+--------------------------+------------------------
         4590 MB    | 7.5525122452349447 bytes | 0.25371424219677802001
        (1 row)
      • @grzesiek please do not forget that this column is part of an index:

        "index_ci_builds_on_project_id_for_successfull_pages_deploy" btree (project_id) WHERE type::text = 'GenericCommitStatus'::text AND stage::text = 'deploy'::text AND name::text = 'pages:deploy'::text AND status::text = 'success'::text

      • Author Maintainer

        Thanks!

      • Please register or sign in to reply
Please register or sign in to reply
#### Pipeline visualization data
Pipeline visualization data is about all the things that we want to show to a
user when someone visits pipelines / pipeline / build page.
#### Pipeline processing data
Pipeline processing data is about all the things that we need to store in our
database in order to process pipeline from start to end. This includes:
* exposing builds to runner
* recalculating statuses
* determining an order of execution
* determining pipeline transitions
* retrying a pipeline and builds
We might not need to do these things for pipelines that are old - created
months or years ago. In most cases it would be better create a new pipeline
than to reprocess existing builds (usually by retrying them).
    • This made me think of retried builds. I'm wondering whether retried builds could be automatically archived, including associated data like (processing data, artifacts, traces, etc). Do we use retried builds for anything today? Would it be acceptable to even delete the build intrinsic data if we don't use it anywhere? This means that as we retry a build we would replace it with a new one rather than adding a new "instance" of the same build. Perhaps perhaps configurable via an instance-level setting.

Please register or sign in to reply
### Devise strategy for `ci_builds` partitioning
We can reduce the size of `ci_builds` significantly, but we do not plan to
remove information about builds that we use to display them to users. It means
that we are still going to keep all of them in the database, and given current
rate of growth of this table, we might still need to explore partitioning.
It might be possible to partition this table by a build creation date, but this
requires technical evaluation to find the best way to partition the table.
## [When] Iterations
We can iterate on this in two parallel tracks.
### Metric
What matters most is reducing the size of the table on the primary database. We
should have a metric that will clearly show our progress. This can be the total
size of `ci_builds` table on a primary database, including indexes.
### Track A
#### 1. Unblock engineers by documenting how to store data
Currently engineers can no longer add new columns to `ci_builds` table, because
it is too large (more than 50 columns). We should write documentation how to
workaround this limitation so that engineers are no longer blocked.
    • we could probably create a section in https://docs.gitlab.com/ee/development/cicd/index.html. Things we could consider are:

      • How to identify if data to be stored is processing data or intrinsic data or (optionally) metadata.
      • introduce a ci_builds_processing table if the new column is related to processing data. Add column to ci_builds_metadata if data is more identified as metadata.
      • devise on what to do if new data to be stored is intrinsic data, hence can't be currently added to ci_builds until we drop some data. Are we allowed to make exceptions while working on dropping data? Should we create a separate table for additional data to later be migrated back into ci_builds? Should we temporary use ci_builds_metadata for it?
Please register or sign in to reply
#### 2. Remove legacy columns with redundant data
Devise a way to remove columns with redundant data - columns related to stages
and artifacts. Find other columns that can be removed, ensure that data is
indeed redundant, otherwise do not remove it without a proper backup.
    • I think similar data that seems redundant is also ref, protected, tag because they are defined in pipeline too.

      Similarly to removing data we could also move data to different tables. This is different than separating processing data from presentation data. For example:

      • coverage and coverage_regex could be normalized to a different table ci_build_coverage managed by grouptesting. This is not data present for every build.
      • target_url and description seem to be used by external CI providers. We could move those to a different table as this data is not present for every build.

      I would still expect some overall reduction of data size because we would insert coverage data record only if coverage feature is used. Same for the external CI data.

      Maybe this could be a point #### 3. of Track A and move the STI replacement to a section Other ideas to explore?

Please register or sign in to reply
#### 3. Resolve the problem of STI
Replace STI mechanisms with integer enums. Estimate the benefit and storage
space saved after normalizing data this way, before doing it.
### Track B
#### 1. Separate pipeline processing and presentation data
In order to separate pipeline processing and presentation data we need to move
a bunch of columns from `ci_builds` table to `ci_builds_metadata`.
The first step could be enabling a feature flag to write processing data to the
new location for new builds, without migrating data for old builds. This
feature flag is already implemented under `ci_build_metadata_config` but it is
important to revisit the code to ensure that processing data and presentation
data is separated correctly. This feature flag has never been enabled in
production environment.
#### 2. Soft-archive legacy build processing data
Before we remove data we should archive builds in a "soft" way - without
actually removing processing data from the database.
This will allow us to optimize this iteration for the feedback from users
because in the user interface the builds are going to be unretriable and
unprocessable and in case of making a wrong decision about how old the builds
should be to archive them, we would still be able to change our minds about it.
We will also restrict access to processing data, what will allow us to surface
mistakes in data separation.
#### 3. Archive legacy processing data
Enable continuous processing data archival mechanisms according to retention
policy defined for gitlab.com and move legacy processing data from primary
database to a different place.
This might be a destructive action, but perhaps we need to devise a way to
backup data that we are going to remove from PostgreSQL. We can store it in a
different type of storage, for example - object storage. This will make it a
two-way door decision, however difficult recreating the database state might be
being able to revert a wrong decision might be important here.
Moving data from the database will require a sign-off from executives and
product people.
#### 4. Migrate processing data
Build a migration mechanism to migrate processing data out of `ci_builds` in
case of on-premises installations.
## Who
Proposal:
| Role | Who
|------------------------------|-------------------------|
| Author | Grzegorz Bizon |
| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
| Engineering Leader | Christopher Lefelhocz |
| Domain Expert | Fabio Pitino |
| Domain Expert | Kamil Trzciński |
Please register or sign in to reply
DRIs:
| Role | Who
|------------------------------|------------------------|
| Product | TBD |
| Leadership | TBD |
| Engineering | TBD |
| Domain Expert | TBD |
| Domain Expert | TBD |
Loading