Spike: Deduplicate ci_build_sources

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Problem

Another case of unnecessary data duplication: https://gitlab.com/gitlab-org/gitlab/-/blob/29c470086f1a19450565d411792a1cf82c6f637c/lib/gitlab/ci/pipeline/chain/set_build_sources.rb#L18-22 - The table ci_build_sources is always populated for every job.

This feature was added to distinguish jobs coming from security policies vs jobs defined in the .gitlab-ci.yml. Security policy jobs are a great minority, yet we populate this data all the time.

Proposal

Best way would be to see if we could move this data to job definition and deduplicate it since it’s immutable but later we may remove it for archived jobs.

Alternatively we should avoid populating this table in the “default scenario” and use the already existing dynamic value: https://gitlab.com/gitlab-org/gitlab/-/blob/2c63fab5aa0b5432393a5f33d04178a5c36a9438/app/models/ci/build.rb#L1205.

We need to work closely with groupsecurity policies or delegate the work to them, to ensure that the refactoring does not break any existing functionality or expectations.

Investigation

  1. Assess if there are any columns we should move into a new model Ci::JobInfo. Do they have to be indexed? Are they immutable? Can they be part of a jsob column like p_ci_job_definitions.config?

  2. Assess the cost of refactoring in terms of complexity and risks.

Expected outcomes

  1. Investigation results outlining why we cannot/should not deduplicate this data, OR

  2. A POC MR with the proposed implementation. We already have a POC for Ci::JobInfo (Draft: POC Deduplicate intrinsic immutable data... (!211540)), so we could build on it for this spike issue.

Edited by 🤖 GitLab Bot 🤖