Design CI data retention policies for gitlab.com
Ultimate goal
Enable GitLab admins to define a retention period after which pipeline data is archived.
Benefits
This will allow us to drop the number of records in CI database while improving reliability and having faster migrations. For self-managed this will allow admins to remove unwanted old data.
Problem to Solve
It has been brought up several times that we need to have a plan to implement a data retention strategy of our CI data - more recently in https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1601+
We have brainstormed ways to implement retention policies, for example:
- Discuss data retention strategy for CI data (verify-stage#440)
- Create a retention policy for job logs (#374717)
Proposal
Let's identify what are actionable next steps we can take with CI data in the next:
- 0-3 months (low hanging fruit)
- 3-6 months (might require some technical planning or waiting on bugs like Updating partitioning value deletes data from d... (#438394) to be resolved)
- 6+ months (requires UX/PM support, customer announcements, etc)
Given the work in https://docs.gitlab.com/ee/architecture/blueprints/ci_data_decay/, we could either update this blueprint, or start a new one, since the goal of this blueprint originally was to:
Implement a new architecture of CI/CD data storage to enable scaling.