Spike: Race condition in ProcessScanResultPolicyWorker
There is a race condition in the ProcessScanResultPolicyWorker
. See #383235 (closed) for details on the condition.
In !110020 (merged) we have attempted to mitigate the condition by locking on (project, policy_configuration)
combinations. This has failed for the following reason.
If a project in a group hierarchy is targeted by multiple security configurations, multiple workers can execute concurrently for the same project, e.g. (project, policy_config_1)
, (project, policy_config_2)
etc.
When ProcessScanResultPolicyWorker
executes, the following happens then in different transactions:
- all existing approval project/merge request rules are deleted for the project (ScanResultPolicy#delete_scan_finding_rules_for_project)
- new project approval rules are created
- for every open MR in the project, merge request approval rules are created from the project rules
In step (3) we use AR's first_or_initialize to update existing merge request approval rules. In one transaction this succeeds, but another transaction does not see the already existing rule and instead tries to recreate it, which leads to ActiveRecord::RecordInvalid
.
If we locked on project instead of on (project, policy_configuration) combinations, we would need to increase the lock acquisition timeout. This would in turn increase the average worker duration. As the worker already has a rather high duration, both would increase by a lot.
The outcome of this spike should be a recommendation for removing the race condition.