Spike: Deduplicate ci_build_sources

Problem

Another case of unnecessary data duplication: https://gitlab.com/gitlab-org/gitlab/-/blob/29c470086f1a19450565d411792a1cf82c6f637c/lib/gitlab/ci/pipeline/chain/set_build_sources.rb#L18-22 - The table ci_build_sources is always populated for every job.

This feature was added to distinguish jobs coming from security policies vs jobs defined in the .gitlab-ci.yml. Security policy jobs are a great minority, yet we populate this data all the time.

Proposal

Best way would be to see if we could move this data to job definition and deduplicate it since it’s immutable but later we may remove it for archived jobs.

Alternatively we should avoid populating this table in the “default scenario” and use the already existing dynamic value: https://gitlab.com/gitlab-org/gitlab/-/blob/2c63fab5aa0b5432393a5f33d04178a5c36a9438/app/models/ci/build.rb#L1205.

We need to work closely with groupsecurity policies or delegate the work to them, to ensure that the refactoring does not break any existing functionality or expectations.

Expected outcomes

  1. Investigation results outlining why we cannot/should not deduplicate this data, OR

  2. A POC MR with the proposed implementation.

Edited by Leaminn Ma