Container repository cleanups can get stuck in the ongoing state

From #328860 (comment 568539611):

Do you have an idea on how it ended up in this state? We'll need to account for those jobs getting killed for whatever reason, wouldn't we?

@reprazent I'm currently investigating that part.

The current code flow is:

worker select a container image and update the cleanup status to ongoing. Here

the selected container image is sent to related service for cleanup. Here

the service for cleanup has a rescue to put the container image cleanup in the unfinished state. Here

I think that they are only two possible explanations:

(3.) did not rescue the error properly. This sounds a bit far fetched as we are rescuing StandardError. From my search on Sentry, I don't see any error that is not a StandardError one, so I'm not convinced that we are in this case. In addition, I don't see any fail message in Kibana for the last day that the cleanup ran.

We could rescue everything here. (3.) is a rescue and raise block so we wouldn't silence all the errors but simply detect them.

Something bad happened between (1.) and (3.). I don't see anything outstanding in Sentry or Kibana but we could imagine having situations where the worker crashed or was killed.

We could move the begin + rescue block to the worker so that it is around (1.) - (3.)

I'm more confident in this solution for solving the issue.

and here is @reprazent's answer:

@10io Do we have a jid for the last time this container-repository was picked up? We could see if we find the start and done/fail messages. My suspicion is that the job started, but never got the chance to finish cleanly. We do stop Sidekiq sometimes: during a deploy, or when Sidekiq exceeds a certain memory limit. We don't exactly do a super graceful shutdown of sidekiq (gitlab-com/gl-infra/delivery#603), and jobs don't always finish so a long running job like this one could suffer from a hard shutdown.

When Sidekiq gets hard-killed, all jobs will receive Sidekiq::Shutdown, which is an Interrupt not a StandardError. So the rescue block will not get executed. We should not try to rescue this.

Instead, we could probably cleanup after ourselves in a next run, perhaps in the cron?
ContainerRepository.cleanup_ongoing.where(expiration_policy_started_at < ?, 30.minutes.ago).update_all(expiration_policy_cleanup_status: :cleanup_unscheduled)
I think this will be okay to do in a single update statement, since the number of container repositories in this state would be at most the configured capacity. Could you create a separate issue for this?

We need to work on a nicer shutdown for sidekiq jobs, but the work should also be resilient to this, who knows what can happen in the ☁ 🚎.

In simple terms: select all the cleanups that started more than 30.minutes ago and reset their status.

Admin message

Container repository cleanups can get stuck in the ongoing state