Container repository cleanups can get stuck in the ongoing state
From #328860 (comment 568539611):
Do you have an idea on how it ended up in this state? We'll need to account for those jobs getting killed for whatever reason, wouldn't we?
@reprazent I'm currently investigating that part.
The current code flow is:
- worker select a container image and update the cleanup status to
ongoing
. Here- the selected container image is sent to related service for cleanup. Here
- the service for cleanup has a
rescue
to put the container image cleanup in theunfinished
state. HereI think that they are only two possible explanations:
- (3.) did not rescue the error properly. This sounds a bit far fetched as we are rescuing
StandardError
. From my search on Sentry, I don't see any error that is not aStandardError
one, so I'm not convinced that we are in this case. In addition, I don't see anyfail
message in Kibana for the last day that the cleanup ran.
- We could
rescue
everything here. (3.) is arescue
andraise
block so we wouldn't silence all the errors but simply detect them.- Something bad happened between (1.) and (3.). I don't see anything outstanding in Sentry or Kibana but we could imagine having situations where the worker crashed or was killed.
- We could move the
begin
+rescue
block to the worker so that it is around (1.) - (3.)- I'm more confident in this solution for solving the issue.
and here is @reprazent's answer:
@10io Do we have a jid for the last time this container-repository was picked up? We could see if we find the
start
anddone
/fail
messages. My suspicion is that the job started, but never got the chance to finish cleanly. We do stop Sidekiq sometimes: during a deploy, or when Sidekiq exceeds a certain memory limit. We don't exactly do a super graceful shutdown of sidekiq (gitlab-com/gl-infra/delivery#603), and jobs don't always finish so a long running job like this one could suffer from a hard shutdown.When Sidekiq gets hard-killed, all jobs will receive
Sidekiq::Shutdown
, which is anInterrupt
not aStandardError
. So the rescue block will not get executed. We should not try to rescue this.Instead, we could probably cleanup after ourselves in a next run, perhaps in the cron?
ContainerRepository.cleanup_ongoing.where(expiration_policy_started_at < ?, 30.minutes.ago).update_all(expiration_policy_cleanup_status: :cleanup_unscheduled)
I think this will be okay to do in a single update statement, since the number of container repositories in this state would be at most the configured capacity. Could you create a separate issue for this?
We need to work on a nicer shutdown for sidekiq jobs, but the work should also be resilient to this, who knows what can happen in the
☁ 🚎 .
In simple terms: select all the cleanups that started more than 30.minutes
ago and reset their status.