Spike: Deduplicate intrinsic immutable data from ci_builds to Ci::JobInfo
Problem
In Introduce `Ci::JobDefinition` model (#551830 - closed) we introduce a new model to store deduplicated job data. In Ci::JobDefinition we store only immutable processing data so that we can:
- Easily deduplicate it.
- Easily drop (the old partitions) after pipelines are archived.
We could not deduplicate immutable intrinsic (long-term) data such as job names, needs, sources, etc. There is a great opportunity to leverage the lessons learned from Ci::JobDefinition and apply the same pattern and refactoring steps for a new model that stores immutable intrinsic data.
Proposal
We can introduce a new module (proposed name so far) called Ci::JobInfo to store immutable intrinsic data. Data must be immutable in order to be deduplicated at creation time and never updated. For any "mutable" data we have ci_builds.
Refer to Discussion: Where should all the columns of ci_... (#520538) for what data currently in ci_builds makes sense to be deduplicated.
We should also look at deduplicating data from other CI tables:
- Spike: Deduplicate `ci_build_needs` (#565821)
- Spike: Deduplicate `ci_build_sources` (#565806)
- Spike: Deduplicate `ci_build_names` into one of... (#567704)
Investigation
-
Assess if there are any columns we should move into a new model
Ci::JobInfo. Do they have to be indexed? Are they immutable? Can they be part of a jsob column likep_ci_job_definitions.config? -
Assess the cost of refactoring in terms of complexity and risks.
Expected outcomes
-
Investigation results outlining why we cannot/should not deduplicate this data, OR
-
A POC MR with the proposed implementation.