Race condition: Policy evaluation via PipelineFinishedEvent runs before security findings are ingested (#594864) · Issues · GitLab.org / GitLab

Race condition: Policy evaluation via PipelineFinishedEvent runs before security findings are ingested

## Summary There is a race condition between security policy evaluation (triggered by `Ci::PipelineFinishedEvent`) and security findings ingestion (triggered by `Ci::JobSecurityScanCompletedEvent`). The policy evaluation path (`ProcessPipelineCompletionWorker`) can run before `Security::Finding` rows are written by the ingestion path (`StoreScansWorker` → `StoreGroupedScansService`), causing incorrect approval rule evaluation. This results in merge requests being added to merge trains with no policy violations detected, only to be removed later when the merge train pipeline finishes and re-evaluates with the now-ingested findings. **Reported in:** https://gitlab.com/gitlab-com/request-for-help/-/work_items/4232 **Related MRs that contributed to this behavior:** - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/204396 (removed `SyncFindingsToApprovalRulesWorker` call from `StoreScansService`) - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/204358 (sync MR approval policy approvals after pipeline completes) - https://gitlab.com/gitlab-org/gitlab/-/merge_requests/215274 (changed `StoreScansWorker` to trigger on security build completion instead of pipeline completion) ## Root Cause Analysis (From [investigation by @mcavoj](https://gitlab.com/gitlab-com/request-for-help/-/work_items/4232#note_3191423728)) Two independent paths are triggered when a pipeline with security scans completes: **PATH 2 — Pipeline Finished (Event Store):** ``` Pipeline transitions to completed/manual status │ ▼ (run_after_commit) Ci::PipelineFinishedEvent published │ ▼ (EventStore subscription) ProcessPipelineCompletionWorker │ guards: licensed? && can_store_security_reports? ▼ SyncFindingsToApprovalRulesWorker.perform_async(pipeline_id) │ ▼ SyncFindingsToApprovalRulesService#execute │ guard: complete_or_manual? || has_security_findings? ▼ SyncMergeRequestApprovalsWorker.perform_async(pipeline_id, mr_id) (for each open MR with matching HEAD SHA) No dependency on ingestion. Triggered by PipelineFinishedEvent alone. ``` **PATH 3 — Security Reports Ingested:** ``` Security job completes │ ▼ (run_after_commit) Ci::JobSecurityScanCompletedEvent │ ▼ Scans::IngestReportsWorker │ guard: all_security_jobs_complete? ▼ StoreScansWorker │ ▼ StoreScansService │ StoreGroupedScansService → writes Security::Finding rows │ schedule_store_reports_worker ▼ StoreSecurityReportsByProjectWorker │ ▼ Ingestion::IngestReportsService#execute │ step: sync_findings_to_approval_rules ▼ SyncFindingsToApprovalRulesWorker.perform_async(pipeline_id) │ ▼ SyncMergeRequestApprovalsWorker.perform_async(pipeline_id, mr_id) ``` **The race condition:** PATH 2 can evaluate policy before PATH 3 has written `Security::Finding` rows. PATH 2 sees zero findings → concludes no violations → approvals set to 0 (not required). The MR appears clean and is added to the merge train. When the merge train pipeline later finishes, PATH 2 fires again. This time, `RelatedPipelinesFinder` for the `merged_result_pipeline` searches by `source_sha`, which matches the branch pipeline's `sha`. By now, PATH 3 has completed ingestion, so `Security::Finding` rows exist → violations detected → approval required → MR is removed from the merge train. **The core design gap:** `ProcessPipelineCompletionWorker` (PATH 2) and `IngestReportsService` (PATH 3) have no ordering guarantee. PATH 2 evaluates policy based on `Security::Finding` rows that are written by the ingestion chain in PATH 3. There is nothing that makes PATH 2 wait for PATH 3's data to be present. ## Reproduction @mcavoj was able to reproduce this by adding an artificial delay to `Security::StoreScansWorker` to simulate production conditions where ingestion is slower than pipeline completion event processing. The result matches the customer-reported behavior exactly. ## Proposed Implementation (From [proposal by @sashi_kumar](https://gitlab.com/gitlab-com/request-for-help/-/work_items/4232#note_3192488493), confirmed by @mcavoj) 1. **Create a new `Security::ReportsIngestedEvent`** — published by `StoreScansService` after `Security::Finding` rows have been written by `StoreGroupedScansService`. 2. **Subscribe `ProcessPipelineCompletionWorker` to `Security::ReportsIngestedEvent`** — so policy evaluation is triggered after ingestion is complete. 3. **Guard `ProcessPipelineCompletionWorker` against premature evaluation** — when triggered by `PipelineFinishedEvent`, skip if the pipeline has security scans still being ingested. The ingestion chain will publish `ReportsIngestedEvent` when complete. ### Proposed diff ```diff diff --git a/ee/app/events/security/reports_ingested_event.rb b/ee/app/events/security/reports_ingested_event.rb new file mode 100644 --- /dev/null +++ b/ee/app/events/security/reports_ingested_event.rb @@ -0,0 +1,15 @@ +# frozen_string_literal: true + +module Security + class ReportsIngestedEvent < ::Gitlab::EventStore::Event + def schema + { + 'type' => 'object', + 'required' => %w[pipeline_id], + 'properties' => { + 'pipeline_id' => { 'type' => 'integer' } + } + } + end + end +end diff --git a/ee/app/services/security/store_scans_service.rb b/ee/app/services/security/store_scans_service.rb --- a/ee/app/services/security/store_scans_service.rb +++ b/ee/app/services/security/store_scans_service.rb @@ publish after Security::Finding rows have been written + publish_reports_ingested_event + + def publish_reports_ingested_event + ::Gitlab::EventStore.publish( + Security::ReportsIngestedEvent.new(data: { pipeline_id: pipeline.id }) + ) + end diff --git a/ee/app/workers/security/scan_result_policies/process_pipeline_completion_worker.rb --- a/ee/app/workers/security/scan_result_policies/process_pipeline_completion_worker.rb +++ b/ee/app/workers/security/scan_result_policies/process_pipeline_completion_worker.rb @@ guard against premature evaluation + return if event.is_a?(::Ci::PipelineFinishedEvent) && Security::Scan.processing?(pipeline) + diff --git a/ee/lib/gitlab/event_store/subscriptions/security_subscriptions.rb --- a/ee/lib/gitlab/event_store/subscriptions/security_subscriptions.rb +++ b/ee/lib/gitlab/event_store/subscriptions/security_subscriptions.rb @@ subscribe to new event + store.subscribe ::Security::ScanResultPolicies::ProcessPipelineCompletionWorker, + to: ::Security::ReportsIngestedEvent ``` ### Open questions from @mcavoj - Do we still need to trigger `ProcessPipelineCompletionWorker` based on `PipelineFinishedEvent`? - Is there any scenario where we need to trigger `SyncFindingsToApprovalRulesWorker` when `Security::Scan.processing?(pipeline)`? - If a pipeline has scans, it may be sufficient to trigger only based on `ReportsIngestedEvent`. If it has no scans, `UnenforceablePolicyRulesPipelineNotificationWorker` (triggered [here](https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/models/ee/ci/pipeline.rb#L82)) already updates approval rules and unblocks the merge check. That part could potentially be moved to `ProcessPipelineCompletionWorker`. ## Impact - Merge requests are intermittently removed from merge trains after the merge train pipeline succeeds, with the error "Merge request is not mergeable" - There is also a window where a fast merge could bypass security policy (false negative on policy evaluation) - Affects customers using merge trains with merge request approval policies - Reported by multiple users across multiple projects

issue