Enable GitLab admins to define a retention period after which pipeline data is archived.
Benefits
This will allow us to drop the number of records in CI database while improving reliability and having faster migrations. For self-managed this will allow admins to remove unwanted old data.
@fabiopitino@mbobin@drew I created this issue for us to update/add blueprints for data retention. A lot of great discussion, but I think we need to get clarity on what next steps look like, and update our existing design blueprint (CI Data Time Decay) or create a new one if applicable. WDYT?
cc: @tianwenchen - as this will be an upcoming topic for our team.
What does everyone think about packaging all the details and artifacts of the Builds in a Pipeline into a single PipelineArtifact? I think @mbobin had some idea like this.
I've proposed things like this before, where we roll up data from lower down in the ORM hierarchy (with Pipeline at the top) into the more high-level records. Needing to preserve the data itself doesn't mean we should be keeping it in the high fragmented data structure that we currently use.
This is something that I could see us doing with data perhaps, greater than 1y old? I'm totally making up that number right now, but I think we could do some log analysis to see how often old data is accessed and pick a smart age at which point we'd move a Pipeline to "summary" view.
What does everyone think about packaging all the details and artifacts of the Builds in a Pipeline into a single PipelineArtifact?
This means that we must keep the ci_pipelines data as it is right now and that's not really great because ci_pipelines takes 1.18TiB. Out of this data, 922.65GiB is taken by indexes on this tables and we will likely add more over time. I think it would be better to have a clear separation between active data and archived data.
80% of accesses to pipeline record is under 1 year of pipeline age
My assumption is that the 20% of cases that access pipelines older than 1 year could be caused by project deletions.
During a project deletion we fetch the project pipelines one by one in order to cascade delete all the containing artifacts. This is equivalent to accessing all project pipelines in order to destroy them.
This means that we must keep the ci_pipelines data as it is right now and that's not really great because ci_pipelines takes 1.18TiB. Out of this data, 922.65GiB is taken by indexes on this tables and we will likely add more over time. I think it would be better to have a clear separation between active data and archived data.
@mbobin I agree. I think we discussed having ci_archived_pipelines table that only contains the bare minimum columns to fetch a pipeline by project_id + id. The downside is that fetching a list of archived pipelines would either be disallowed or very limited.
We would have to keep data in ci_archived_pipelines but with data retention we could have a stricter policy after X period where we actually remove data from ci_archived_pipelines and object storage. For this it would be good to also carry the original partition_id, or a new time-based partition id, in ci_archived_pipelines so we can more efficiently delete records.
For example:
up to 6 months pipelines are kept in database in active partitions.
between 6 months and 1 year pipelines are kept in archived format (e.g. read-only and fetched by ID).
after 1 year pipeline data is destroyed.
These stages can be configured and in some case even disabled (e.g. never destroy data)
@jreporter Is there any current Product thinking about a different view/access pattern for old CI data?
We could potentially wire up the existing models to go fetch data (much more slowly) from the PipelineArtifact if it's needed, but I think it would make more sense to design a view with the explicit purpose of displaying summary data. Job Traces, if needed, could be archived together as a PipelineArtifact for audit purposes. But I think we shouldn't render a JobTrace from 5 years ago the same way we render a JobTrace from today anyway.
I think we should run some numbers and see at what point does accessing job trace data and pipeline artifact data really drop off. I imagine we could even tell users "for pipeline and job data older than x years we will email the data to you" or some other experience for auditing as you suggested.
Low hanging fruit: drop deprecated columns from ci_builds
In &4685 (comment 435280900) (4 years ago ) there is a quick analysis of the columns in ci_builds table. There are opportunities to drop data in terms of columns, contributing slowing down the data growth. Some columns are:
@drew@mbobin can we look at which of these columns we can take actions on? Should we create an issue about this?
Do we already detect unused database indices that we could potentially drop?
Would there be any indices that would become unused by changing the queries not to use redundant columns? For example: we change the query to use ci_stages.position instead of ci_build.stage_idx, allowing us to drop the index on stage_idx.
Given that ci_builds_metadata is the biggest table. If we fix and enable the pipeline archival period on Gitlab.com. Would it be safe (from data integrity perspective) to drop old records from ci_builds_metadata. If so, how could we implement a mechanism (e.g. limited capacity workers, etc.) that would do that at Gitlab.com scale?
@fabiopitino@mbobin I think it would help to understand what are the expected savings we could get by deleting these unused columns or indexes, and potentially evaluate them in order of most savings to least- WDYT?
I think we should restructure this whole page based on the strategies outlined in the summary.
Then create a subpage for each strategy. Today we only have the partition one.
In the page for strategy (3) (archiving pipelines) we need to define a few decisions we are going to make, which clarify our iteration plan. (1) The need to refactor the archival mechanism to the entire pipeline.
@drew and I were also discussing that the subepics, defined by the iterations, are also largely out of date and don't necessarily capture the amount of progress we've made on the partitioning effort. I was going to take some time to reorganize Partition CI/CD pipelines data (&5417) and simplify some of the organization.