Container image cleanup status can get stuck in the ongoing status

🔥 Problem

Cleanup statuses are important for the background workers processing them. They use those statuses to properly locate the next cleanup to execute.

We've seen a situation where a cleanup was stuck in the "ongoing" state. This state indicates to workers that the cleanup is being done by another worker so that cleanups in those states are never considered for pick up.

The root cause is not clear but from #328860 (comment 568575485):

We do stop Sidekiq sometimes: during a deploy, or when Sidekiq exceeds a certain memory limit. We don't exactly do a super graceful shutdown of sidekiq (gitlab-com/gl-infra/delivery#603), and jobs don't always finish so a long running job like this one could suffer from a hard shutdown.

We currently have a global rescue, so that workers will put the cleanup in a resumable state in case of any error. What is described above is that basically, an ongoing cleanup process can be killed without any prior notice. As such, the rescue is not executed.

🚒 Solution

Instead, we could probably cleanup after ourselves in a next run, perhaps in the cron?
ContainerRepository.cleanup_ongoing.where(expiration_policy_started_at < ?, 30.minutes.ago).update_all(expiration_policy_cleanup_status: :cleanup_unfinished)
I think this will be okay to do in a single update statement, since the number of container repositories in this state would be at most the configured capacity.

Basically, use the cron job to do a coherence check and "fix" cleanups in the wrong state.

Edited May 18, 2021 by David Fernandez