Design CI data retention policies for gitlab.com

Ultimate goal

Enable GitLab admins to define a retention period after which pipeline data is archived.

Benefits

This will allow us to drop the number of records in CI database while improving reliability and having faster migrations. For self-managed this will allow admins to remove unwanted old data.

Problem to Solve

It has been brought up several times that we need to have a plan to implement a data retention strategy of our CI data - more recently in https://gitlab.com/gitlab-com/gl-infra/capacity-planning-trackers/gitlab-com/-/issues/1601+

We have brainstormed ways to implement retention policies, for example:

Proposal

Let's identify what are actionable next steps we can take with CI data in the next:

0-3 months (low hanging fruit)
3-6 months (might require some technical planning or waiting on bugs like Updating partitioning value deletes data from d... (#438394) to be resolved)
6+ months (requires UX/PM support, customer announcements, etc)

Given the work in https://docs.gitlab.com/ee/architecture/blueprints/ci_data_decay/, we could either update this blueprint, or start a new one, since the goal of this blueprint originally was to:

Implement a new architecture of CI/CD data storage to enable scaling.

Edited May 30, 2024 by Fabio Pitino