Online GC worker failing to postpone tasks in case of DB error on delete due to aborted transaction

Context

For a detailed description of how online GC works please see https://gitlab.com/gitlab-org/container-registry/-/blob/master/docs-gitlab/db/online-garbage-collection.md.

Problem

If deleting a manifest from the database fails due to a database error, such as when there are no DELETE permissions on manifests (should never happen, but this was what we tested in gitlab#336916 (closed)) or if the GC worker considered a manifest to be dangling when it is not (should never happen, but in this case, this would lead to an integrity violation error as there was still a tag or manifest list pointing to the manifest row), the worker will attempt to postpone the next review of the corresponding task (an UPDATE on gc_manifest_review_queue).

There is no point in letting other workers to continue to process this task in quick succession because they are most likely going to run into the same problem. That's the main reason why we try to postpone the next review.

The problem is that currently both of these operations (preceding queries + the postpone query) occur within a transaction, so once the former fails the transaction is automatically aborted, and therefore trying to execute the postpone leads to a current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02) error.

Example: https://sentry.gitlab.net/gitlab/container-registry/issues/2774378/?environment=pre

Solution

If a database operation fails at T and it's not a temporary or unknown error (e.g. network timeout, which is likely to be solved on the next worker run) we must use a new transaction to postpone the task at T+N.

As a caveat, as soon as the main transaction is aborted at T, the corresponding task row will be unlocked, so it can be picked up by another worker between T and T+N, which can run into the same problem and try to postpone the task as well (unlikely).

To counter this, once at T+N, we should start a new transaction and try to lock the task row for update. Once we acquire the lock, we should then check if the review_after of the task has been updated. If so, it means that another worker has run into the same problem and managed to postpone the task, so we can skip it, otherwise, we postpone it.

Edited Sep 01, 2021 by João Pereira