Skip to content
Snippets Groups Projects

WIP: GitLab CI/CD pipelines storage improvement

Closed Grzegorz Bizon requested to merge blueprint/gb/pipelines-storage-improvements into master
11 unresolved threads
Files
2
---
layout: markdown_page
title: "GitLab CI/CD data storage improvements"
---
## GitLab CI/CD pipelines data storage improvements
GitLab CI/CD is one of these elements of GitLab product that are the most data
and compute intensive. Since it's [initial release in November
2012](https://about.gitlab.com/blog/2012/11/13/continuous-integration-server-from-gitlab/)
the CI/CD subsystem has evolved a lot. It has been [integrated into GitLab in
September
2015](https://about.gitlab.com/releases/2015/09/22/gitlab-8-0-released/) and it
has become [one of the most loved CI/CD
solutions](https://about.gitlab.com/blog/2017/09/27/gitlab-leader-continuous-integration-forrester-wave/).
> TODO pipelines usage growth graph here
GitLab CI/CD has come a long way since being released to users, but the design
of the data storage for pipeline builds remains almost the same since 2012. We
store all the builds in PostgreSQL in `ci_builds` table, and because we are
creating more than 0.5 million builds each day on gitlab.com, we are slowly
reaching database limits.
> TODO ci_builds size graph / data growth graphs
## [Why] Problems
We described the most important problems in [the
issue](https://gitlab.com/gitlab-org/gitlab/-/issues/213103). These include:
### Data migrations
We can no longer migrate within this table / to separate tables. It is not
possible with regular migrations, almost impossible with background migrations.
> TODO elaborate
### Adding new indices
We can longer add new indices because we already do have too many, and adding a
new index on gitlab.com takes hours.
    • Author Maintainer

      TODO add the date when the table size was checked in prod, the table grows quickly and it in a few weeks from now this will be different.

Please register or sign in to reply
> TODO elaborate
Please register or sign in to reply
> TODO indexes size graph
> TODO link to almost-failed-deployment issue related to adding an index
### Database size
`ci_builds` is one of the largest tables we maintain in the PostgreSQL
database. The amount of data we are storing there is significant.
> TODO elaborate
> TODO describe current database scaling initiatives
> TODO compare ci_builds size with other tables
### Statement timeouts
We do have so much data in the `ci_builds` table that we can't even easily
count rows in that table, even when using an index.
> TODO elaborate
Please register or sign in to reply
> TODO more examples
### Using STI
We are using [Single Table Inheritance in
Rails](https://api.rubyonrails.org/classes/ActiveRecord/Inheritance.html)
This mechanism unnecessarily consumes too much space and is not efficient enough.
> TODO elaborate
## [What] Proposal
We store a lot of data in `ci_builds` table, but the data stored there have
different affinity to a pipeline. In particular - we store pipeline
visualization data there and pipeline processing data.
Pipeline visualization data and processing data can have different retention
policies too. Separating these types of data can help us vastly reduce the
amount of stored data and split data amongst multiple tables.
> TODO calculate average ratio of visualization to processing data, like 40/60%
> and support this with real numbers / graphs.
### Pipeline visualization data
> TODO elaborate
### Pipeline processing data
> TODO elaborate
## [When] Iterations
### Devise a metric for `ci_builds` situation
We should have a metric for the `ci_builds` size / size of indices or another
metric that can help to measure current situation and the impact of future
improvements.
### Validate build "degeneration" mechanisms
We currently do have a bunch of mechanisms implemented that allow us to
"degenerate" a build and to "archive" those builds that seem to be old and
irrelevant anymore. These mechanisms have been implemented by a few different
teams and are disabled on gitlab.com. We should revisit them, and figure out if
these are aligned with the initiative described in this blueprint.
### Ensure `ci_builds_metadata` contains only processing data
This table currently contains a bunch columns, but we should check if data in
all of them are safe to get removed one a build gets archived.
> TODO elaborate what `ci_builds_metadata` is and why it exists
### Soft-archive legacy builds
Once we have all the data regarding the points above, we can make a well
informed decision about archiving builds that are old.
In the first iteration we should archive builds in a "soft" way - without
actually removing the data from the database. This will allow us to optimize
this iteration for the feedback from users because in the user interface the
builds are going to be unretriable and unprocessable and in case of making a
wrong decision about how old the builds should be to archive them, we would
still be able to change our minds about it.
### Remove archived data
In order to migrate data between `ci_builds` and `ci_builds_metadata` we need
to remove old data, because there is currently too much to migrate the data
without major problems.
This will be a destructive action, but perhaps we can devise a way to store
data that we are going to remove from PostgreSQL in a different type of
storage, to make this a two-way decision, however difficult recreating the
database state will be - being able to revert wrong decision might be important
here.
This will require a sign-off from executives and product people.
### Migrate `ci_builds.options` to `ci_builds_metadata`
Prepare a background migration that will move processing data, especially
`ci_builds.options` to `ci_build_metadata` table.
### Move other processing columns
There are other columns that could be moved to either `ci_builds_metadata` or
`ci_pipelines_metadata`. We should move them too.
### Remove legacy columns from `ci_builds`
There are bunch of deprecated columns in that table, we should remove them too.
    • Author Maintainer

      @Finotto would you be able to check the total size of ci_builds.options and what percentage of total size of ci_builds it represents?

      • @grzesiek we have the following tables with 'build' in the database, and I do not see any clear options table there :

        SELECT
           relname as "Table",
           pg_size_pretty(pg_total_relation_size(relid)) As "Size"
           FROM pg_catalog.pg_statio_user_tables where relname like '%build%' ORDER BY pg_total_relation_size(relid) DESC;
                        Table                |  Size
        -------------------------------------+---------
         ci_builds                           | 1539 GB
         ci_build_trace_sections             | 459 GB
         ci_builds_metadata                  | 71 GB
         ci_build_needs                      | 9362 MB
         ci_build_trace_section_names        | 1795 MB
         ci_daily_build_group_report_results | 199 MB
         ci_build_trace_chunks               | 35 MB
         packages_build_infos                | 20 MB
         ci_builds_runner_session            | 824 kB
         ci_build_report_results             | 24 kB

        ci_builds represents at the moment 1539 GB from the total database of 7369 GB, which would be approx 21% of the database.

      • Author Maintainer

        @Finotto I mean options column in ci_builds table - what percentage of total data in ci_builds do we store in options column.

      • @grzesiek I added the info:

        gitlabhq_production=# select
        gitlabhq_production-#     pg_size_pretty(sum(pg_column_size(options))) as total_size,
        gitlabhq_production-#     pg_size_pretty(avg(pg_column_size(options))) as average_size,
        gitlabhq_production-#     sum(pg_column_size(options)) * 100.0 / pg_total_relation_size('ci_builds') as percentage
        gitlabhq_production-# from ci_builds;
         total_size |        average_size        |     percentage
        ------------+----------------------------+---------------------
         354 GB     | 667.0868096973743355 bytes | 22.8841407398832713

        So this column represents 22% of the whole table. I would suggest to normalize and create a separated table.

        cc: @glopezfernandez @albertoramos @NikolayS

        Edited by Jose Finotto
      • Author Maintainer

        I expect that we might be able to remove 90% of this data and reduce our database size by around 300 gigabytes. It still depends on how many pipelines in the database are older than 3 and 6 months, but I think that presumably more than 90%.

        @Finotto can you also check the size of ci_builds.stage? I expect that this column will be completely redundant after we wrap up our effort with atomic processing of pipelines. I think this to be a few gigabytes too, and if everything goes right, this data will not be needed at all too.

        Edited by Grzegorz Bizon
      • Author Maintainer

        @Finotto can you please take a look at ci_builds.stage size? This is the column that we might be able to remove completely, even without backup (because we already do have backup in ci_stages.name), so this might be the low hanging fruit we are looking for here too.

      • @grzesiek sorry for the late answer, sure I give you the info in few minutes.

      • @grzesiek

        the info is:

        # select
         pg_size_pretty(sum(pg_column_size(stage))) as total_size,
         pg_size_pretty(avg(pg_column_size(stage))) as average_size,
         sum(pg_column_size(stage)) * 100.0 / pg_total_relation_size('ci_builds') as percentage
        from ci_builds;;
         total_size |       average_size       |       percentage
        ------------+--------------------------+------------------------
         4590 MB    | 7.5525122452349447 bytes | 0.25371424219677802001
        (1 row)
      • @grzesiek please do not forget that this column is part of an index:

        "index_ci_builds_on_project_id_for_successfull_pages_deploy" btree (project_id) WHERE type::text = 'GenericCommitStatus'::text AND stage::text = 'deploy'::text AND name::text = 'pages:deploy'::text AND status::text = 'success'::text

      • Author Maintainer

        Thanks!

      • Please register or sign in to reply
Please register or sign in to reply
`stage` column is known to take a lot of space, and it might not be needed
anymore after we enable `ci_atomic_processing` FF.
### Resolve the problem of STI
We can replace STI mechanisms with integer enums.
> TODO, elaborate
### Move and remove indices
Once we move processing data, we might also be able to move indexes. We should
never remove an index until a new one is set up.
> TODO elaborate
We should extend this blueprint with more ideas about how to reduce the size of
indexes.
## WHO
    • This made me think of retried builds. I'm wondering whether retried builds could be automatically archived, including associated data like (processing data, artifacts, traces, etc). Do we use retried builds for anything today? Would it be acceptable to even delete the build intrinsic data if we don't use it anywhere? This means that as we retry a build we would replace it with a new one rather than adding a new "instance" of the same build. Perhaps perhaps configurable via an instance-level setting.

Please register or sign in to reply
Proposal:
| Role | Who
|------------------------------|-------------------------|
| Author | Grzegorz Bizon |
| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
| Engineering Leader | TBD |
| Domain Expert | Grzegorz Bizon |
DRIs:
| Role | Who
|------------------------------|------------------------|
| Product | TBD |
| Leadership | TBD |
| Engineering | TBD |
| Domain Expert | Grzegorz Bizon |
Loading