Properly process stale container repository cleanups (!62005) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 330322-unstuck-stuck-cleanup-policies into master May 18, 2021

🌵 Context

Cleanup policies are executed by background workers. Their execution can be summarized as getting the tags for the linked container images, apply the parameters of the cleanup policy and destroy all the tags that are excluded by the policy. This whole process is done using the container registry API (container image tags are not persisted in the rails backend database).

To be able to monitor what is happening with the cleanups, we introduced a cleanup status on container images (those are persisted in the rails backend database) that are processed by cleanup policies:

scheduled: cleanup will be done shortly
ongoing: cleanup is being executed
unfinished: cleanup took so long that it hit a limit and has been stopped

Since the cleanup process can go through different type of errors, such as network timeouts, we introduced a rescue block to make sure that the cleanup status was reset to something else than ongoing. This is important because ongoing cleanups are not considered by background workers when they pick the next work.

The problem is that Sidekiq can be shutdown in a rather abruptly way. The job is effectively "killed" and as such, the rescue block is not executed. This leads to stale ongoing cleanups.

This is issue #330322 (closed).

As part of the background jobs for cleanup policies, we have one that is roughly running each hour. This is the perfect place to add a change: detect stale ongoing cleanups and put them in the unfinished state.

🤔 What does this MR do?

Detect stale ongoing container repository cleanups and put them in the unfinished
The "stale" detection is done as the following:
- The cleanup was started before T + 30.minutes. T is a limit we have for the service destroying all the tags. It is set in the application settings.
  - Let's illustrate this with gitlab.com. There we have a limit of 5.minutes for the destruction of all tags. Stale cleanups will therefore be all the cleanups that started 35 minutes ago. That's a period long enough to be confident that no job is actually executing the cleanup.

🖼 Screenshots (strongly suggested)

n / a

⛓ Does this MR meet the acceptance criteria?

Conformity

I have included a changelog entry, or it's not needed. (Does this MR need a changelog?)
[-] I have added/updated documentation, or it's not needed. (Is documentation required?)
I have properly separated EE content from FOSS, or this MR is FOSS only. (Where should EE code go?)
I have added information for database reviewers in the MR description, or it's not needed. (Does this MR have database related changes?)
I have self-reviewed this MR per code review guidelines.
This MR does not harm performance, or I have asked a reviewer to help assess the performance impact. (Merge request performance guidelines)
I have followed the style guides.

Availability and Testing

[-] I have added/updated tests following the Testing Guide, or it's not needed. (Consider all test levels. See the Test Planning Process.)
[-] I have tested this MR in all supported browsers, or it's not needed.
[-] I have informed the Infrastructure department of a default or new setting change per definition of done, or it's not needed.

Security

Does this MR contain changes to processing or storing of credentials or tokens, authorization and authentication methods or other items described in the security review guidelines? If not, then delete this Security section.

[-] Label as security and @ mention @gitlab-com/gl-security/appsec
[-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
[-] Security reports checked/validated by a reviewer from the AppSec team

💽 Database review

We have a single UPDATE query: !62005 (comment 578411092)

Edited May 18, 2021 by David Fernandez

Properly process stale container repository cleanups