Cannot obtain an exclusive lease for ci/pipeline_processing/atomic_processing_service::pipeline_id:xxxxx

Summary

GitLab generates the following error, and it should not generate it.

Cannot obtain an exclusive lease for
ci/pipeline_processing/atomic_processing_service::pipeline_id:79707.
There must be another instance already in execution.

It has been investigated several times as the potential root cause of a GitLab issue, and in all cases this has only delayed resolution of the issue.

Ignore this error. This bug issue has been raised to request that the product stop generating this error.

Additional details

During a customer emergency call, one of the log entries which we acted upon was:

{
  "severity": "ERROR",
  "time": "2021-05-06T08:09:46.007Z",
  "correlation_id": "01F50BKFPPTX822RCGGAB7V37B",
  "message": "Cannot obtain an exclusive lease for ci/pipeline_processing/atomic_processing_service::pipeline_id:79707. There must be another instance already in execution."
}

These appear in application.log / application_json.log

Merge requests were failing, and I thought this could be related, as MRs need to know pipeline status.

We shut down Rails and Sidekiq and ran:

sudo gitlab-rake gitlab:exclusive_lease:clear

This wasn't the cause of the issue, which was completely unrelated to Category:Continuous Integration. Post-emergency, the customer's instance is working OK, with no report of issues with pipelines, but the log entries continue.

Purpose of this issue is to establish if there is an issue to address within ~"group::continuous integration"

GitLab team members can read more in: the ticket relating to the emergency - which was related to an upgrade from %13.7 to %13.11 and some other customers' tickets (links provided for GitLab team members) and messages where we found the same errors:

To try and track down when it might have started

related issues

This customer is running Geo, and so #212756 (closed) looks tempting, however that is for Geo::MetricsUpdateWorker, not for i/pipeline_processing/atomic_processing_service.

There's also a problem with similar errors on gitlab.com, raised in #326030 (closed). However, the log entry is different (it doesn't include the entity with the lease conflict) and cross checking the correlation ID, I see AuthorizedProjectsWorker. This seems consistent with the analysis on the issue.

Steps to reproduce

Unknown

Example Project

n/a - doesn't seem project specific.

What is the current bug behavior?

This lease log entry doesn't appear to adversely affect the product, but similarly to the similar Geo example #212756 (closed) as it's flagged ERROR both customers and support engineers have focused on it during multiple investigations.

Resolution of the outage that triggered this customer's emergency, and a number of other investigations, have been delayed by investigating this error.

What is the expected correct behavior?

This lease log entries is not produced.

Relevant logs and/or screenshots

See above.

Output of checks

This happens on gitlab.com

https://log.gprd.gitlab.net/goto/0e7b90ccee2a1a3ea83164a51b4c2253

Possible fixes

Edited by Ben Prescott (ex-GitLab)