Partition CI/CD pipelines data
## Summary Part of CI/CD time-decay architecture evolution :arrow_right: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/70052 CI/CD data gitlab.com consumes 4 terabytes of data. We need to partition it. We want to explore time-based partitioning because of how relevance of pipeline data decays with time. Architecture evolution blueprint :arrow_right: https://docs.gitlab.com/ee/architecture/blueprints/ci_data_decay/ ## List-based partitioning There are a few approaches we can take to partition CI/CD data. A promising one is using list-based partitioning where a partition number is assigned to a pipeline, and gets propagated to all resources that are related to this pipeline. We assign the partition number based on when the pipeline was created or when we observed the last processing activity in it. This is very flexible because we can extend this partitioning strategy at will; for example with this strategy we can assign an arbitrary partition number based on multiple partitioning keys, combining time-decay-based partitioning with tenant-based partitioning on the application level. ## Unknowns Can we partition archived data but also active data? How can we avoid the need of moving high quantity of data each day between active and archived schema? ## Current Issues - CI/CD tables are large, partitioning would be more efficient - Really long job names are truncated, the error thrown isn't surfaced. ## Acceptance Criteria 1. Be able to split the large table into multiple 1. Split off one table 1. Identify an ideal table size for this data after decomposition is done 1. Create issues for remaining decomposition of the CI/CD pipelines data ## Iterations 1. Move archived CI/CD data to object storage 1. Partition CI/CD tables using CI/CD data retention policy 1. Partition CI/CD queuing tables using list partitioning
epic