Add max retries and backoff delay for delete container repository failures

What does this MR do and why?

Summary

  1. We add failed_delete_count and next_delete_attempt_at to ContainerRepository to implement max retries and backoff delay.
  2. We set the container repository status to delete_failed when it has reached the ContainerRepository::MAX_DELETION_FAILURES, which is set to 10. When the container repository status is delete_failed, it won't be picked up for deletion anymore
  3. We put the change if (2) behind a feature flag. We put the marking to delete_failed behind a feature flag as we want to gradually release this to minimize risk.

Details

From #480652 (closed), we implement max retries and a backoff delay in this MR by introducing failed_delete_count and next_delete_attempt_at.

There are three instances where we increase the failed_delete_count:

  • Error during tags cleanup (DeleteContainerRepositoryWorker:L31)
  • When there are tags left after tags cleanup (DeleteContainerRepositoryWorker:L31)
  • Error happens during .destroy! (DeleteContainerRepositoryWorker:L38)

Together with increasing failed_delete_count, we also set next_delete_attempt_at as an exponent of 2 minutes, depending how many failed_delete_count there is. This is to give enough time for the error to recover, if it is a transient one. The more failure happens, the bigger is the time gap between the current and the next delete attempt.

Failed delete counts and the respective minutes that next_delete_attempt_at will be set after:

Failed Delete Count Minutes Hours
1 1
2 2
3 4
4 8
5 16
6 32
7 64 1.07
8 128 2.1
9 256 4.2
10 512 8.5

ContainerRepository::MAX_DELETION_FAILURES is currently set to 10 but feel free to give suggestions on this as well as the backoff delay.

When the max has been reached, we set the status to delete_failed, if the feature flag is enabled. We put this change behind a feature flag as this is a big change since setting this status would exclude the repository from being deleted. And would require manual intervention to put it back to the delete_scheduled status so it can be picked up by the DeleteContainerRepositoryWorker.

Migration Results

UP

adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:redo:ci VERSION=20240916075908
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 31383
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 31383
ci: == [advisory_lock_connection] object_id: 129260, pg_backend_pid: 33136
ci: == 20240916075908 AddFailedDeletionCountAndNextDeleteAtToContainerRepositories: migrating
ci: -- add_column(:container_repositories, :failed_deletion_count, :integer, {:default=>0, :null=>false, :if_not_exists=>true})
ci:    -> 0.0430s
ci: -- add_column(:container_repositories, :next_delete_attempt_at, :datetime_with_timezone, {:if_not_exists=>true})
ci:    -> 0.0025s
ci: == 20240916075908 AddFailedDeletionCountAndNextDeleteAtToContainerRepositories: migrated (0.0756s)

ci: == [advisory_lock_connection] object_id: 129260, pg_backend_pid: 33136
adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:redo:ci VERSION=20240916081252
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 44135
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 44135
ci: == [advisory_lock_connection] object_id: 129260, pg_backend_pid: 45640
ci: == 20240916081252 AddIndexOnStatusAndNextDeleteAttemptAtForContainerRepositories: migrating
ci: -- transaction_open?(nil)
ci:    -> 0.0000s
ci: -- view_exists?(:postgres_partitions)
ci:    -> 0.0195s
ci: -- index_exists?(:container_repositories, :next_delete_attempt_at, {:name=>:index_container_repositories_on_next_delete_attempt_at, :where=>"status = 0", :algorithm=>:concurrently})
ci:    -> 0.0079s
ci: -- execute("SET statement_timeout TO 0")
ci:    -> 0.0006s
ci: -- add_index(:container_repositories, :next_delete_attempt_at, {:name=>:index_container_repositories_on_next_delete_attempt_at, :where=>"status = 0", :algorithm=>:concurrently})
ci:    -> 0.0045s
ci: -- execute("RESET statement_timeout")
ci:    -> 0.0012s
ci: == 20240916081252 AddIndexOnStatusAndNextDeleteAttemptAtForContainerRepositories: migrated (0.0625s)

ci: == [advisory_lock_connection] object_id: 129260, pg_backend_pid: 45640
adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:redo:ci VERSION=20240924175515
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 56954
ci: == 20240924175515 AddPositiveConstraintOnFailedDeletionCount: migrating =======
ci: -- transaction_open?(nil)
ci:    -> 0.0000s
ci: -- transaction_open?(nil)
ci:    -> 0.0000s
ci: -- execute("ALTER TABLE container_repositories\nADD CONSTRAINT check_container_repositories_non_negative_failed_deletion_count\nCHECK ( failed_deletion_count >= 0 )\nNOT VALID;\n")
ci:    -> 0.0013s
ci: -- execute("SET statement_timeout TO 0")
ci:    -> 0.0004s
ci: -- execute("ALTER TABLE container_repositories VALIDATE CONSTRAINT check_container_repositories_non_negative_failed_deletion_count;")
ci:    -> 0.0004s
ci: -- execute("RESET statement_timeout")
ci:    -> 0.0005s
ci: == 20240924175515 AddPositiveConstraintOnFailedDeletionCount: migrated (0.0455s)

ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 56954

DOWN
adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:down:ci VERSION=20240924175515
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 85283
ci: == 20240924175515 AddPositiveConstraintOnFailedDeletionCount: reverting =======
ci: -- transaction_open?(nil)
ci:    -> 0.0000s
ci: -- transaction_open?(nil)
ci:    -> 0.0000s
ci: -- execute("            ALTER TABLE container_repositories\n            DROP CONSTRAINT IF EXISTS check_container_repositories_non_negative_failed_deletion_count\n")
ci:    -> 0.0009s
ci: == 20240924175515 AddPositiveConstraintOnFailedDeletionCount: reverted (0.0257s)

ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 85283
adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:down:ci VERSION=20240916081252
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 918
ci: == 20240916081252 AddIndexOnStatusAndNextDeleteAttemptAtForContainerRepositories: reverting
ci: -- transaction_open?(nil)
ci:    -> 0.0000s
ci: -- view_exists?(:postgres_partitions)
ci:    -> 0.0184s
ci: -- index_exists?(:container_repositories, :next_delete_attempt_at, {:name=>:index_container_repositories_on_next_delete_attempt_at, :algorithm=>:concurrently})
ci:    -> 0.0041s
ci: == 20240916081252 AddIndexOnStatusAndNextDeleteAttemptAtForContainerRepositories: reverted (0.0358s)

ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 918
adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:down:ci VERSION=20240916075908
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 8034
ci: == 20240916075908 AddFailedDeletionCountAndNextDeleteAtToContainerRepositories: reverting
ci: -- remove_column(:container_repositories, :next_delete_attempt_at, :datetime_with_timezone, {:if_not_exists=>true})
ci:    -> 0.0032s
ci: -- remove_column(:container_repositories, :failed_deletion_count, :integer, {:default=>0, :null=>false, :if_not_exists=>true})
ci:    -> 0.0011s
ci: == 20240916075908 AddFailedDeletionCountAndNextDeleteAtToContainerRepositories: reverted (0.0199s)

ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 8034
Create INDEX CONCURRENTLY on database lab

exec CREATE INDEX CONCURRENTLY container_repositories_next_delete_attempt_at ON container_repositories USING btree (next_delete_attempt_at) WHERE (status = 0);

Session: webui-i31738

The query has been executed. Duration: 9.224 s

Completed

Query Plans

Changing ContainerRepository.pending_destruction affects the query ContainerRepository.next_pending_destruction that we use to get the next repository.

a. Before the change: No filtering on next_delete_attempt_at

SQL Query

SELECT
    "container_repositories".*
FROM
    "container_repositories"
WHERE
    "container_repositories"."status" = 0
LIMIT 1
FOR UPDATE
    SKIP LOCKED
Execution Plan (link)
Limit  (cost=0.13..3.15 rows=1 width=101) (actual time=0.073..0.074 rows=1 loops=1)
   Buffers: shared hit=3 dirtied=1
   I/O Timings: read=0.000 write=0.000
   ->  LockRows  (cost=0.13..3.15 rows=1 width=101) (actual time=0.072..0.072 rows=1 loops=1)
         Buffers: shared hit=3 dirtied=1
         I/O Timings: read=0.000 write=0.000
         ->  Index Scan using container_repositories_next_delete_attempt_at on public.container_repositories  (cost=0.13..3.15 rows=1 width=101) (actual time=0.018..0.018 rows=1 loops=1)
               Filter: (container_repositories.status = 0)
               Rows Removed by Filter: 0
               Buffers: shared hit=2
               I/O Timings: read=0.000 write=0.000
Recommendations:
✅ Looks good

Summary:
  
Time: 1.692 ms  
  - planning: 1.551 ms  
  - execution: 0.141 ms  
    - I/O read: 0.000 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 3 (~24.00 KiB) from the buffer pool  
  - reads: 0 from the OS file cache, including disk I/O  
  - dirtied: 1 (~8.00 KiB)  
  - writes: 0  

b. After the change: Additional filtering on next_delete_attempt_at

SQL Query
SELECT
    "container_repositories".*
FROM
    "container_repositories"
WHERE
    "container_repositories"."status" = 0
    AND (next_delete_attempt_at IS NULL
        OR next_delete_attempt_at < '2024-09-22 14:49:44.766069')
LIMIT 1
FOR UPDATE
    SKIP LOCKED
Execution Plan (link)
 Limit  (cost=0.13..3.16 rows=1 width=101) (actual time=0.064..0.064 rows=1 loops=1)
   Buffers: shared hit=3 dirtied=1
   I/O Timings: read=0.000 write=0.000
   ->  LockRows  (cost=0.13..3.16 rows=1 width=101) (actual time=0.062..0.063 rows=1 loops=1)
         Buffers: shared hit=3 dirtied=1
         I/O Timings: read=0.000 write=0.000
         ->  Index Scan using container_repositories_next_delete_attempt_at on public.container_repositories  (cost=0.13..3.15 rows=1 width=101) (actual time=0.016..0.017 rows=1 loops=1)
               Filter: (((container_repositories.next_delete_attempt_at IS NULL) OR (container_repositories.next_delete_attempt_at < '2024-09-22 14:49:44.766069+00'::timestamp with time zone)) AND (container_repositories.status = 0))
               Rows Removed by Filter: 0
               Buffers: shared hit=2
               I/O Timings: read=0.000 write=0.000
Recommendations:
✅ Looks good

Summary:
Time: 1.309 ms  
  - planning: 1.196 ms  
  - execution: 0.113 ms  
    - I/O read: 0.000 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 3 (~24.00 KiB) from the buffer pool  
  - reads: 0 from the OS file cache, including disk I/O  
  - dirtied: 1 (~8.00 KiB)  
  - writes: 0  

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

How to set up and validate locally

🛵 Prerequisites:

A. Container Repository to test on

  1. Have a ContainerRepository ready.
repository = ContainerRepository.last
  1. Make sure that its status is set to :delete_scheduled so it can be picked up by our worker.
repository.update(status: :delete_scheduled)
  1. Verify that your container repository can be picked up by the worker. It should be returned by:
ContainerRepository.next_pending_destruction(order_by: nil)

NOTE: It would be easier to test if you only have one container repository with status: :delete_scheduled, so ContainerRepository.next_pending_destruction(order_by: nil) can return nil or repository.

B. An error so the delete will fail

To simulate an error so our failed counts can increase, you can quit the container registry if you are running it separately. If you are running the container registry in the GDK, you can use the following:

gdk stop registry

🗺 Testing Points

1. Validating the failed_delete_count.

a. Run the DeleteContainerRepositoryWorker

repository.failed_deletion_count # => 0
ContainerRegistry::DeleteContainerRepositoryWorker.new.perform_work
repository.reload.failed_deletion_count # => 1

Running the worker would return a => [Gitlab::ErrorTracking::Logger] when the deletion failed.

You can try to run the worker again and the failed_deletion_count will increase again.

repository.reload.failed_deletion_count # => 1
ContainerRegistry::DeleteContainerRepositoryWorker.new.perform_work
repository.reload.failed_deletion_count # => 2

2. Validating next_delete_attempt_at and ContainerRepository.next_pending_destruction

We set next_delete_attempt_at to 2**(failed_deletion_count - 1), so:

  • failed_deletion_count is 1, we set next_delete_attempt_at to now + 2**0 minutes (1 minute from now)
  • failed_deletion_count is 2, we set next_delete_attempt_at to now + 2**1 minutes (2 minutes from now)

So from Step 1, if the time has not passed yet, you can call ContainerRepository.next_pending_destruction(order_by: nil) and it should be nil (if repository is the only one with delete_scheduled).

Now, we can either wait for next_delete_attempt_at to pass or we can also update the value for testing purposes:

repository.update(next_delete_attempt_at: Time.zone.now)
ContainerRepository.next_pending_destruction(order_by: nil)

repository then should be returned.

3. Validating marking the container repository as delete_failed.

We have introduced ContainerRepository::MAX_DELETION_FAILURES to set a limit on how many failures can happen to a container repository deletion before we mark it as failed, instead of rescheduling it again.

a. When the feature flag is disabled

Feature.disable(:set_delete_failed_container_repository)

When the feature flag is disabled, the amount of failures can surpass ContainerRepository::MAX_DELETION_FAILURES, which is set to 10. So the failure count can reach 11, 12, and even 100.

repository.update(next_delete_attempt_at: Time.zone.now)
ContainerRepository.next_pending_destruction(order_by: nil)
#=> should return the repository so the worker has something to work on

repository.failed_deletion_count # => x
ContainerRegistry::DeleteContainerRepositoryWorker.new.perform_work
repository.reload.failed_deletion_count # => should be x + 1

b. When the feature flag is enabled

Feature.enable(:set_delete_failed_container_repository)

When the feature flag is enabled, if the failed_deletion_count is more than ContainerRepository::MAX_DELETION_FAILURES, then the repository is marked as delete_failed and won't be picked up again.

repository.update(next_delete_attempt_at: Time.zone.now)
ContainerRepository.next_pending_destruction(order_by: nil)
#=> should return the repository so the worker has something to work on

repository.failed_deletion_count # => x
ContainerRegistry::DeleteContainerRepositoryWorker.new.perform_work
repository.reload.status
# => delete_failed if x is above 10

And it should no longer be picked up by next_pending_destruction even if we update next_delete_attempt_at:

repository.update(next_delete_attempt_at: Time.zone.now)
repository.reload.status
# => delete_failed

ContainerRepository.next_pending_destruction(order_by: nil)
#=> should NOT return the repository

Related to #480652 (closed)

Edited by Adie (she/her)

Merge request reports

Loading