Skip to content

Rescue stuck resource groups

Shinya Maeda requested to merge rescue-stuck-resource-group into master

What does this MR do?

As we discovered in the bug ticket, resource group can be stuck when a build is failed with data_integrity_failure, which doesn't execute the Ci::Build state machine hooks to release the resources in a regular operation.

This MR introduces the check mechanism to release stale builds in the sidekiq worker.

stuck resource groups on gitlab.com

There are 31 jobs currently struck in the gitlab.com:

[ gprd ] production> Ci::Resource.joins(:processable).where('ci_builds.status IN (?)', Ci::HasStatus::COM
PLETED_STATUSES).where('ci_builds.updated_at < ?', 1.day.ago).count
=> 31

All jobs are failed by the data_integrity_failure.

[ gprd ] production> Ci::Resource.joins(:processable).where('ci_builds.status IN (?)', Ci::HasStatus::COM
PLETED_STATUSES).where('ci_builds.updated_at < ?', 1.day.ago).distinct.pluck('ci_builds.failure_reason')
=> ["data_integrity_failure"]

Related #335537 (closed)

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by Shinya Maeda

Merge request reports