Fix long-lived CI DB transactions during MR pipeline creation (idle in transaction; bulk inserts to ci_build_needs/build_sources)

Please read down to the bottom of the description for the investigation results and implementation plan

Summary

We are observing long-lived transactions on the CI database during merge request pipeline creation. The backend session frequently shows idle in transaction with wait_event_type=Client, wait_event=ClientRead, while the last query is an INSERT into association tables such as ci_build_needs and p_ci_build_sources.

This appears to be application-side time spent between SQL statements while a transaction remains open, rather than the INSERT itself being slow.

The timing of this is correlated with the CPU saturation starting to tick up. Sessions in the idle in transaction state can contribute to CPU saturation in PostgreSQL.

Example signals (gprd)

From pg_stat_activity samples (same backend PID observed repeatedly):

  • endpoint_id: MergeRequests::CreatePipelineWorker
  • db_config_name: ci
  • state: idle in transaction
  • wait_event: ClientRead
  • long xact_age_s (example observed: ~420s)
  • queries observed within the open transaction include:
    • INSERT INTO ci_build_needs (name, artifacts, optional, build_id, partition_id, project_id) VALUES ...
    • INSERT INTO p_ci_build_sources (build_id, partition_id, project_id, source) VALUES ... RETURNING build_id

From Sidekiq structured log for the same job:

  • class: MergeRequests::CreatePipelineWorker
  • duration_s: ~1271s
  • CI DB transaction metrics:
    • db_ci_txn_count: 1
    • db_ci_txn_duration_s: ~440s (max matches)

A separate sample also showed a bare BEGIN followed by idle in transaction, suggesting at least some transactions are opened and then the app pauses before issuing the next statement.

Screenshot_2026-02-04_at_9.36.34_AM

Impact

Long-lived transactions can:

  • hold locks longer than expected
  • increase bloat / vacuum pressure
  • amplify contention in high-churn CI partitioned tables (e.g. ci_build_needs, build sources)

Notes / context

  • The observed long transaction is not explained by DB lock waits (client wait event is ClientRead).
  • Recent related work:
    • !219385 (merged) removed an explicit pipeline.transaction block from Gitlab::Ci::Pipeline::Chain::Create and removed tag bulk insertion in that area. The long transaction we see is dominated by inserts into ci_build_needs / build sources, so it may be unrelated.
    • Introduction MR referenced in !219385 (merged): !215235 (merged)

Hypotheses to validate - Done ✅

  • Transaction scope (implicit) around pipeline.save! includes significant Ruby-side work (building batches of associations, callbacks, etc.), causing gaps between statements.
    • 🔉 ✅ THIS IS THE WINNER
  • Potential edge case where a transaction is opened earlier than intended and spans non-DB work.
    • ⛔ I don't think so cause:
      • Transaction opened at the right time (during pipeline.save)
      • Problem wasn't when it opened, but what was included inside it
  • Potential exception/ensure path leaving the connection in a transaction longer than expected (less likely given txn_count=1, but still worth checking).
    • ⛔ txn_count=1 confirmed single transaction
  • Could be that ruby steps are blocked on other strained resources (rails instance CPU, redis etc.).
    • ⛔ Not sure this explains why the transaction would be open so imma discard this too
  • Feature issue referenced in !219385 (merged): #580301 (closed)
  • Rollout issue referenced in !219385 (merged) (merged): #583505 (closed)
    • ⛔ Don't think this is related at all. If anything, this removed from the transaction so it improved it 😅

Implementation

Split the transaction so it's only open during INSERTs, not during Ruby processing. I'm going to use insert_all directly (which BulkInsertSafe wraps).

The MRs

  1. Move callbacks to the chain - !222881 (merged)
  2. Move these to an async worker: !224351 (merged) - in review
  3. Move webooks to async worker (feature flagged!) - !223916 (merged) in review
  4. Association data - in dev
  5. Bulk Insert - MEGACHONK, in dev 🐈
    1. I will need to do some UI testing here and might even require some FE assistance but let's see how this goes first. There might be some good oppos for improvements

Details

The end solution will be a three-phased approach to what is now done in one single transaction:

  1. Persist a pipeline to obtain the ID for foreign key references.
  2. Prepare association data - this will be done OUTSIDE of the transaction and will include all the heavy lifting:
    • Building all stage/build attributes in memory
    • Running validations
    • Generating tokens
  3. THEN we bulk insert maintaining referential integrity. We only open it for the insert and close it immediately after.
ActiveRecord::Base.transaction do
  pipeline.save!  # Here the pipeline will have status='created'
end

builds_attrs = prepare_attributes(statuses) # the megachonk outside the transacation

ActiveRecord::Base.transaction do
  Ci::Build.insert_all!(builds_attrs)
end

# If bulk insert fails,we clean up to keep current behavior
rescue StandardError => e # or whatever, this is pseudocode
  pipeline.delete  # Remove empty pipeline
  error("Failed to persist pipeline builds: #{e}")
end

Concerns I thought about and their solutions


Failing bulk inserts - Keep current behavior, and although a pipeline will be persisted in this case for ID purposes, we will delete it. It's an excellent trade-off IMO - having to quickly delete a pipeline vs keeping a pipeline open for 10+min

Callback handling - I've been auditing these and we have three:

  1. Instrumentation - this should be moved to the pipeline chain anyway so let's get rolling on that - MR 1 - ended up moving to a worker as well
  2. Webhooks - I'm in the middle of analyzing this, but I actually think they might be better suited for the pipeline chain. This change I am going to feature flag entirely for safety - MR 2

Nested associations - for this we can use returning: [:id, :partition_id] on builds insert to get IDs for subsequent inserts

EE cross-database associations - for this we can add pipeline.stages.reload after bulk insert, before this chain step runs

This is high traffic so this will be done with feature flag and iterations. First big change would be MR 2 and that will be ff.

Edited Feb 26, 2026 by Laura Montemayor
Assignee Loading
Time tracking Loading