Add validation to CI Pipeline to ensure unique iid across partitions

What does this MR do and why?

Context

The behaviour described in https://gitlab.com/gitlab-org/gitlab/-/issues/545167#note_2802709422 may sometimes cause duplicate pipeline IIDs to persist across different partitions (for the given project scope). Essentially what happens is that the InternalId record's last_value gets into the wrong state (last_value is reset to 0), which causes new pipelines to be created with IIDs starting from 1 again.

Previously, in !164771 (diffs), we implemented a fix to flush (delete) the InternalId record when an ActiveRecord::RecordNotUnique error occurs on iid. After it's flushed, a new pipeline creation triggers a new InternalId record, which recalculates the current maximum IID for the given scope. This resolves the duplication issue because the InternalId record's last_value is now in the correct state.

However, this fix ^ doesn't work for duplicate iids across partitions because Postgres's unique index only applies to the individual partition, so it doesn't raise ActiveRecord::RecordNotUnique when it happens.

This MR

In this MR, we add model-level validation to Ci::Pipeline to check if other partitions contain the IID. The query performance appears acceptable (see database query plan below.)

This validation is only run on after_create to minimize calls to the DB and also allow the existing unique index on (project_id, iid, partition_id) to act before we bother executing the query. Furthermore, running it on after_create is just about the same time when ActiveRecord::RecordNotUnique error is currently raised on a duplicated iid in the same partition.

Feature flag

This change is made behind a feature flag: ci_validate_uniq_pipeline_iid_across_partitions

Roll-out issue: [FF] `ci_validate_uniq_pipeline_iid_across_part... (#575604 - closed)

References

Database query

Raw SQL:

SELECT
    1 AS one
FROM
    "p_ci_pipelines"
WHERE
    "p_ci_pipelines"."project_id" = 278964
    AND "p_ci_pipelines"."partition_id" != 103
    AND "p_ci_pipelines"."iid" = 4596448
LIMIT
    1;

Query plan: https://console.postgres.ai/gitlab/gitlab-production-ci/sessions/44310/commands/135777

Observations:

  • There was originally some concern because this query must index scan every partition (minus 1), but it looks like it efficiently utilizes the existing indexes on (project_id, iid, partition_id).
  • The performance looks fairly good at this point. It uses Index Only scans and takes minimal buffers.
  • The number of buffers will increase with each new partition, however it looks like the increase would be negligible for a long time (maybe until we get closer to 100+ partitions; we only have ~7/8 right now.) Even after 100 partitions, the increase in execution time would likely not be noticeable. There are much more significant bottlenecks elsewhere that contribute to pipeline creation slowness.

How to set up and validate locally

  1. First let's validate the current behaviour. To do this, we will set up a project that has pipelines across different partitions. Ensure the feature flag is disabled to begin with:
Feature.disable(:ci_validate_uniq_pipeline_iid_across_partitions)
  1. Create a new blank project and commit this change to .gitlab-ci.yml (via the Pipeline editor):
job:
  script: echo

This should trigger a new pipeline that completes successfully.

  1. Run a second pipeline (via Pipelines page -> New pipeline).

  2. Verify the current state of pipeline IIDs. Run the following code in the Rails console to get an easy-to-read output:

project = Project.find(<YOUR_PROJECT_ID>) # project ID of the new project created in previous step
data = {}

project.reload.all_pipelines.each do |pipeline|
  partition_name = "partition_#{pipeline.partition_id}".to_sym
  data[partition_name] ||= []
  data[partition_name] << pipeline.iid
end

data.transform_values { |value| value.sort }

The result should only show the project having two pipelines with iid=1 and iid=2 on partition 100.

=> {:partition_100=>[1, 2]}
  1. Now we'll create a new pipeline on a different partition. Update the codebase with the following change and then restart your gdk.

File: app/models/concerns/ci/partitionable.rb

      def set_partition_id
        return self.partition_id = 101 # <-- Add this line
        return if partition_id_changed? && partition_id.present?
        return unless partition_scope_value

        self.partition_id = partition_scope_value
      end
  1. Run a new pipeline. Then re-run the commands from Step 4. At this time, we should see that the third pipeline was persisted successfully in the new partition:
=> {:partition_100=>[1, 2], :partition_101=>[3]}
  1. Now let's attempt to run a new pipeline that will have the duplicate iid 1. We do this by first resetting the Internal ID record to 0:
InternalId.find_by(project: project, usage: 'ci_pipelines').update!(last_value: 0)
  1. Run a new pipeline and re-run the commands from Step 4. Because there's no uniqueness validation across partitions, we now see that two different partitions have the same iid 1:
=> {:partition_100=>[1, 2], :partition_101=>[1, 3]}
  1. Now let's observe the new behaviour. Enable the feature flag:
Feature.enable(:ci_validate_uniq_pipeline_iid_across_partitions)
  1. Reset the Internal ID record to 1 so that it will attempt to insert the duplicate iid 2 which already exists on partition 100.
InternalId.find_by(project: project, usage: 'ci_pipelines').update!(last_value: 1)
  1. Run a new pipeline. We see that it shows a pipeline error on the UI. Note: This is the same error that's shown when a duplicate iid is attempted to be inserted on the same partition. See !164771 (merged).

Screenshot_2025-10-09_at_12.04.20_PM

  1. Run a new pipeline again. In this second attempt, it should succeed without error. Re-running the commands from Step 4, we see that the new pipeline is inserted with a new iid=4 that is not a duplicate.
=> {:partition_100=>[1, 2], :partition_101=>[1, 3, 4]}

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Leaminn Ma

Merge request reports

Loading