Ensure ci_builds_metadata contains only processing data

Overview

ci_builds_metadata currently contains many columns. We should ensure that data in all of them are safe to get deleted after a build gets archived.

We have work in-progress to clear data from some of the columns discussion: Delete Ci::BuildMetadata after Ci::... (#538031 - closed) which will help with space savings, but there's a decent amount of operational efficiency to be gained if we can simply delete the row, instead of updating a few columns ot null.

Columns

Column	Type	Nullable?	Mutable?	Removable on archive? #538031 (closed)	Where to move?
id	integer	no	N/A	🗑️ Delete whole record	N/A
build_id	integer	no	N/A (we may deduplicate immutable data across builds)	🗑️ Delete whole record	N/A
project_id	integer	no	N/A	🗑️ Delete whole record	N/A
partition_id	integer	no	N/A	🗑️ Delete whole record	N/A
timeout	integer	yes	yes (when job picked by runner)	❌ #538183 (closed) intrinsic	`ci_builds`
timeout_source	integer	yes	yes (when job picked by runner)	❌ #538183 (closed) intrinsic	`ci_builds`
interruptible	boolean	no (default `true`)	no	⚠️ while data is not mutable and can be removed after pipeline is archived, the data needs to be indexed.	`ci_job_prototypes` and indexed there as dedicated column.
config_options	jsonb	yes	?	⚠️ see below	`ci_job_prototypes`
config_variables	jsonb	yes	no	🗑️	`ci_job_prototypes`
has_exposed_artifacts	boolean	yes. We care if it's `true`	no	❌ Consider removing the column in favor of filtering by `artifacts:expose_as`.	✅ See also #545486 (closed) where we need to retain `artifacts:expose_as`, maybe into `p_ci_builds`. This should be sufficient.
environment_auto_stop_in	character varying(255)			⚠️ groupenvironments #545659 (closed) - Removable only after data migrated to `environments`	!194402 (closed) being moved to `environments`
expanded_environment_name	character varying(255)	yes	no	❌ groupenvironments #545659 (closed)	`ci_builds` (but attempt refactor to make it processing data)
secrets	jsonb	yes	no	🗑️ #538252 (closed) can be merged with processing data.	`ci_job_prototypes`
id_tokens	jsonb	yes	no	🗑️ #538251 (closed) can be merged with processing data.	`ci_job_prototypes`
debug_trace_enabled	boolean	no (default `false`)	yes	❌ intrinsic data	`ci_builds`
exit_code	smallint	yes	yes	❌ intrinsic	`ci_builds`

Top-level keys found in `config_options`

As of 2025-05-23:

[ gprd ] production> Ci::BuildMetadata.select(:config_options).last(300_000).flat_map { |md| md.config_options.keys }.uniq.sort

NOTE: Ideally intrinsic data should be moved to a table that best represents the data. However, due to urgency, we could introduce a column in p_ci_builds that is nullable and not indexed. For example if artifacts:expose_as is intrinsic data (non processing), we could introduce p_ci_builds.artifacts_expose_as as jsonb and move the data in there when pipeline is archived or new jobs created.

Top-level key	Nullable?	Mutable?	Removable on archive?	Where to move?
`after_script:`	yes	no	🗑️	`ci_job_prototypes`
`allow_failure_criteria:`	yes	no	🗑️	`ci_job_prototypes`
`artifacts:`	yes	no	🔒	⚠️ @fabiopitino: `artifacts:expose_as` is used when `has_exposed_artfacts: true` this should be considered intrinsic data. Consider creating a dedicated table given the low usage of this feature which may help us deprecating it if needed. Alternatively, if stored in `ci_job_artifacts`
`before_script:`	yes	no	🗑️	`ci_job_prototypes`
`bridge_needs:`	yes	no	🗑️	`ci_job_prototypes`
`cache:`	yes	no	🗑️	`ci_job_prototypes`
`cross_dependencies:`	yes	no	🗑️	`ci_job_prototypes`
`dast_configuration:`	yes	no	🗑️	`ci_job_prototypes`
`dependencies:`	yes	no	🗑️	`ci_job_prototypes`
`downstream_errors:`	yes	yes	🔒	`ci_builds` - set when bridge job runs and downstream pipeline fails without being persisted
`enqueue_immediately:`	yes	yes	🗑️	Moving to Redis
`environment:`	yes		🔒 groupenvironments	⚠️ #545659 (comment 2555709370)
`execution_policy_job:`	yes	no	🗑️groupsecurity policies	`ci_job_prototypes` or moved to dedicated table
`execution_policy_variables_override:`	yes	no	🗑️ groupsecurity policies	`ci_job_prototypes` or moved to dedicated table
`execution_policy_name`	yes	no	🗑️ groupsecurity policies	`ci_job_prototypes` or moved to dedicated table
`hooks:`	yes		🗑️
`identity:`	yes		🗑️ grouprunner	`ci_job_prototypes`
`image:`	yes	no	🗑️	`ci_job_prototypes`
`instance:`	yes	no	🗑️	`ci_job_prototypes`
`job_timeout:`	yes	no	🗑️	`ci_job_prototypes`
`manual_confirmation:`	yes	no	🗑️	`ci_job_prototypes`
`pages:`	yes	no	🗑️ groupknowledge	`ci_job_prototypes`
`parallel:`	yes	no	🗑️	`ci_job_prototypes`
`publish:`	yes	no	🗑️ groupknowledge	`ci_job_prototypes`
`release:`	yes		🗑️ groupenvironments	❓ Likely isn't used after the release is created. See #545486 (comment 2547683632).
`resource_group_key:`	yes	no	🗑️	`ci_job_prototypes`
`retry:`	yes	no	🗑️	`ci_job_prototypes`
`scoped_user_id:`	yes	no	🗑️ groupauthorization	`ci_job_processing` - It's processing data. There might be some UX to review because a job that requires `scoped_user_id` could fail with insufficient permissions error. However, this should not be runnable anyway if archived regardless of the permissions.
`script:`	yes	no	🗑️	`ci_job_prototypes`
`services:`	yes	no	🗑️	`ci_job_prototypes`
`start_in:`	yes	no	🗑️	`ci_job_prototypes`
`trigger:`	yes	no	🗑️	`ci_job_prototypes`

Edited Jun 20, 2025 by Fabio Pitino

Ensure ci_builds_metadata contains only processing data

Overview

Columns

Top-level keys found in config_options

Top-level keys found in `config_options`