WIP: GitLab CI/CD pipelines storage improvement
Files
2According to https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10599, this is about 20% of the database (I don't know if that percentage takes into account indexes).
@Finotto would it be possible to calculate the size of indexes too?
A more recent example: gitlab-org/gitlab#205426 (comment 332840282) - We could not migrate the data, neither last X months, due to the size and we had to introduce more technical debt to work around it.
@Finotto would you be able to check the total size of
ci_builds.options
and what percentage of total size ofci_builds
it represents?@grzesiek we have the following tables with 'build' in the database, and I do not see any clear
options
table there :SELECT relname as "Table", pg_size_pretty(pg_total_relation_size(relid)) As "Size" FROM pg_catalog.pg_statio_user_tables where relname like '%build%' ORDER BY pg_total_relation_size(relid) DESC; Table | Size -------------------------------------+--------- ci_builds | 1539 GB ci_build_trace_sections | 459 GB ci_builds_metadata | 71 GB ci_build_needs | 9362 MB ci_build_trace_section_names | 1795 MB ci_daily_build_group_report_results | 199 MB ci_build_trace_chunks | 35 MB packages_build_infos | 20 MB ci_builds_runner_session | 824 kB ci_build_report_results | 24 kB
ci_builds represents at the moment 1539 GB from the total database of 7369 GB, which would be approx 21% of the database.
@Finotto I mean
options
column inci_builds
table - what percentage of total data inci_builds
do we store inoptions
column.@grzesiek I added the info:
gitlabhq_production=# select gitlabhq_production-# pg_size_pretty(sum(pg_column_size(options))) as total_size, gitlabhq_production-# pg_size_pretty(avg(pg_column_size(options))) as average_size, gitlabhq_production-# sum(pg_column_size(options)) * 100.0 / pg_total_relation_size('ci_builds') as percentage gitlabhq_production-# from ci_builds; total_size | average_size | percentage ------------+----------------------------+--------------------- 354 GB | 667.0868096973743355 bytes | 22.8841407398832713
So this column represents 22% of the whole table. I would suggest to normalize and create a separated table.
Edited by Jose FinottoI expect that we might be able to remove 90% of this data and reduce our database size by around 300 gigabytes. It still depends on how many pipelines in the database are older than 3 and 6 months, but I think that presumably more than 90%.
@Finotto can you also check the size of
ci_builds.stage
? I expect that this column will be completely redundant after we wrap up our effort with atomic processing of pipelines. I think this to be a few gigabytes too, and if everything goes right, this data will not be needed at all too.Edited by Grzegorz Bizon@Finotto can you please take a look at
ci_builds.stage
size? This is the column that we might be able to remove completely, even without backup (because we already do have backup inci_stages.name
), so this might be the low hanging fruit we are looking for here too.@grzesiek sorry for the late answer, sure I give you the info in few minutes.
the info is:
# select pg_size_pretty(sum(pg_column_size(stage))) as total_size, pg_size_pretty(avg(pg_column_size(stage))) as average_size, sum(pg_column_size(stage)) * 100.0 / pg_total_relation_size('ci_builds') as percentage from ci_builds;; total_size | average_size | percentage ------------+--------------------------+------------------------ 4590 MB | 7.5525122452349447 bytes | 0.25371424219677802001 (1 row)
@grzesiek please do not forget that this column is part of an index:
"index_ci_builds_on_project_id_for_successfull_pages_deploy" btree (project_id) WHERE type::text = 'GenericCommitStatus'::text AND stage::text = 'deploy'::text AND name::text = 'pages:deploy'::text AND status::text = 'success'::text
This made me think of retried builds. I'm wondering whether retried builds could be automatically archived, including associated data like (processing data, artifacts, traces, etc). Do we use retried builds for anything today? Would it be acceptable to even delete the build intrinsic data if we don't use it anywhere? This means that as we retry a build we would replace it with a new one rather than adding a new "instance" of the same build. Perhaps perhaps configurable via an instance-level setting.
we could probably create a section in https://docs.gitlab.com/ee/development/cicd/index.html. Things we could consider are:
- How to identify if data to be stored is
processing data
orintrinsic data
or (optionally)metadata
. - introduce a
ci_builds_processing
table if the new column is related to processing data. Add column toci_builds_metadata
if data is more identified asmetadata
. - devise on what to do if new data to be stored is
intrinsic data
, hence can't be currently added toci_builds
until we drop some data. Are we allowed to make exceptions while working on dropping data? Should we create a separate table for additional data to later be migrated back intoci_builds
? Should we temporary useci_builds_metadata
for it?
- How to identify if data to be stored is
I think similar data that seems redundant is also
ref
,protected
,tag
because they are defined inpipeline
too.Similarly to removing data we could also move data to different tables. This is different than separating processing data from presentation data. For example:
-
coverage
andcoverage_regex
could be normalized to a different tableci_build_coverage
managed by grouptesting. This is not data present for every build. -
target_url
anddescription
seem to be used by external CI providers. We could move those to a different table as this data is not present for every build.
I would still expect some overall reduction of data size because we would insert coverage data record only if coverage feature is used. Same for the external CI data.
Maybe this could be a point
#### 3.
ofTrack A
and move the STI replacement to a sectionOther ideas to explore
?-
@fabiopitino @ayufan can you please review the blueprint as Domain Experts, especially iterations suggested here? Thanks!
Thanks @grzesiek for putting this together! I've added some thoughts
TODO add the date when the table size was checked in prod, the table grows quickly and it in a few weeks from now this will be different.