Fix long-lived CI DB transactions during MR pipeline creation (idle in transaction; bulk inserts to ci_build_needs/build_sources)
**Please read down to the bottom of the description for the investigation results and implementation plan**
## Summary
We are observing long-lived transactions on the **CI** database during merge request pipeline creation. The backend session frequently shows **`idle in transaction`** with `wait_event_type=Client`, `wait_event=ClientRead`, while the last query is an `INSERT` into association tables such as `ci_build_needs` and `p_ci_build_sources`.
This appears to be application-side time spent between SQL statements while a transaction remains open, rather than the `INSERT` itself being slow.
The timing of this is correlated with the CPU saturation starting to tick up. Sessions in the idle in transaction state can contribute to CPU saturation in PostgreSQL.
## Example signals (gprd)
From `pg_stat_activity` samples (same backend PID observed repeatedly):
- `endpoint_id`: `MergeRequests::CreatePipelineWorker`
- `db_config_name`: `ci`
- `state`: `idle in transaction`
- `wait_event`: `ClientRead`
- long `xact_age_s` (example observed: \~420s)
- queries observed within the open transaction include:
- `INSERT INTO ci_build_needs (name, artifacts, optional, build_id, partition_id, project_id) VALUES ...`
- `INSERT INTO p_ci_build_sources (build_id, partition_id, project_id, source) VALUES ... RETURNING build_id`
From Sidekiq structured log for the same job:
- `class`: `MergeRequests::CreatePipelineWorker`
- `duration_s`: \~1271s
- CI DB transaction metrics:
- `db_ci_txn_count`: 1
- `db_ci_txn_duration_s`: \~440s (max matches)
A separate sample also showed a bare `BEGIN` followed by `idle in transaction`, suggesting at least some transactions are opened and then the app pauses before issuing the next statement.
{width="900" height="549"}
## Impact
Long-lived transactions can:
- hold locks longer than expected
- increase bloat / vacuum pressure
- amplify contention in high-churn CI partitioned tables (e.g. `ci_build_needs`, build sources)
## Notes / context
- The observed long transaction is _not_ explained by DB lock waits (client wait event is `ClientRead`).
- Recent related work:
- [!219385 (merged)](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/219385) removed an explicit `pipeline.transaction` block from `Gitlab::Ci::Pipeline::Chain::Create` and removed tag bulk insertion in that area. The long transaction we see is dominated by inserts into `ci_build_needs` / build sources, so it may be unrelated.
- Introduction MR referenced in [!219385 (merged)](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/219385): [!215235 (merged)](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/215235)
## Hypotheses to validate - Done :white_check_mark:
- Transaction scope (implicit) around `pipeline.save!` includes significant Ruby-side work (building batches of associations, callbacks, etc.), causing gaps between statements.
* :sound: :white_check_mark: **THIS IS THE WINNER**
- Potential edge case where a transaction is opened earlier than intended and spans non-DB work.
* :no_entry: I don't think so cause:
- Transaction opened at the right time (during `pipeline.save`)
- Problem wasn't when it opened, but what was included inside it
- Potential exception/ensure path leaving the connection in a transaction longer than expected (less likely given txn_count=1, but still worth checking).
* :no_entry: `txn_count=1` confirmed single transaction
- Could be that ruby steps are blocked on other strained resources (rails instance CPU, redis etc.).
* :no_entry: Not sure this explains why the transaction would be open so imma discard this too
- Feature issue referenced in [!219385 (merged)](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/219385): https://gitlab.com/gitlab-org/gitlab/-/issues/580301
- Rollout issue referenced in !219385[ (merged)](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/219385): https://gitlab.com/gitlab-org/gitlab/-/issues/583505
* :no_entry: Don't think this is related at all. If anything, this removed from the transaction so it improved it :sweat_smile:
## Implementation
Split the transaction so it's only open during INSERTs, not during Ruby processing. I'm going to use `insert_all` directly (which `BulkInsertSafe` wraps).
The MRs
1. Move callbacks to the chain - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/222881
2. Move these to an async worker: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/224351 - in review
3. Move webooks to async worker (feature flagged!) - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/223916 in review
4. Association data - in dev
5. Bulk Insert - MEGACHONK, in dev :cat2:
1. I will need to do some UI testing here and might even require some FE assistance but let's see how this goes first. There might be some good oppos for improvements
## Details
The end solution will be a three-phased approach to what is now done in one single transaction:
1. Persist a pipeline to obtain the ID for foreign key references.
2. Prepare association data - this will be done OUTSIDE of the transaction and will include all the heavy lifting:
- Building all stage/build attributes in memory
- Running validations
- Generating tokens
3. THEN we bulk insert maintaining referential integrity. We only open it for the insert and close it immediately after.
```ruby
ActiveRecord::Base.transaction do
pipeline.save! # Here the pipeline will have status='created'
end
builds_attrs = prepare_attributes(statuses) # the megachonk outside the transacation
ActiveRecord::Base.transaction do
Ci::Build.insert_all!(builds_attrs)
end
# If bulk insert fails,we clean up to keep current behavior
rescue StandardError => e # or whatever, this is pseudocode
pipeline.delete # Remove empty pipeline
error("Failed to persist pipeline builds: #{e}")
end
```
### Concerns I thought about and their solutions
---
**Failing bulk inserts -** Keep current behavior, and although a pipeline will be persisted in this case for ID purposes, we will delete it. It's an excellent trade-off IMO - having to quickly delete a pipeline vs keeping a pipeline open for 10+min
**Callback handling -** I've been auditing these and we have three:
1. Instrumentation - this should be moved to the pipeline chain anyway so let's get rolling on that - **MR 1 - ended up moving to a worker as well**
2. Webhooks - I'm in the middle of analyzing this, but I actually think they might be better suited for the pipeline chain. This change I am going to feature flag entirely for safety - **MR 2**
**Nested associations -** for this we can use returning: `[:id, :partition_id]` on builds insert to get IDs for subsequent inserts
**EE cross-database associations -** for this we can add `pipeline.stages.reload` after bulk insert, before this chain step runs
This is high traffic so this will be done with feature flag and iterations. First big change would be MR 2 and that will be ff.
issue