Add max retries and backoff delay for delete container repository failures
What does this MR do and why?
Summary
- We add
failed_delete_countandnext_delete_attempt_attoContainerRepositoryto implement max retries and backoff delay. - We set the container repository status to
delete_failedwhen it has reached theContainerRepository::MAX_DELETION_FAILURES, which is set to 10. When the container repository status isdelete_failed, it won't be picked up for deletion anymore - We put the change if (2) behind a feature flag. We put the marking to
delete_failedbehind a feature flag as we want to gradually release this to minimize risk.
Details
From #480652 (closed), we implement max retries and a backoff delay in this MR by introducing failed_delete_count and next_delete_attempt_at.
There are three instances where we increase the failed_delete_count:
- Error during tags cleanup (
DeleteContainerRepositoryWorker:L31) - When there are tags left after tags cleanup (
DeleteContainerRepositoryWorker:L31) - Error happens during
.destroy!(DeleteContainerRepositoryWorker:L38)
Together with increasing failed_delete_count, we also set next_delete_attempt_at as an exponent of 2 minutes, depending how many failed_delete_count there is. This is to give enough time for the error to recover, if it is a transient one. The more failure happens, the bigger is the time gap between the current and the next delete attempt.
Failed delete counts and the respective minutes that next_delete_attempt_at will be set after:
| Failed Delete Count | Minutes | Hours |
|---|---|---|
| 1 | 1 | |
| 2 | 2 | |
| 3 | 4 | |
| 4 | 8 | |
| 5 | 16 | |
| 6 | 32 | |
| 7 | 64 | 1.07 |
| 8 | 128 | 2.1 |
| 9 | 256 | 4.2 |
| 10 | 512 | 8.5 |
ContainerRepository::MAX_DELETION_FAILURES is currently set to 10 but feel free to give suggestions on this as well as the backoff delay.
When the max has been reached, we set the status to delete_failed, if the feature flag is enabled. We put this change behind a feature flag as this is a big change since setting this status would exclude the repository from being deleted. And would require manual intervention to put it back to the delete_scheduled status so it can be picked up by the DeleteContainerRepositoryWorker.
Migration Results
UP
adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:redo:ci VERSION=20240916075908
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 31383
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 31383
ci: == [advisory_lock_connection] object_id: 129260, pg_backend_pid: 33136
ci: == 20240916075908 AddFailedDeletionCountAndNextDeleteAtToContainerRepositories: migrating
ci: -- add_column(:container_repositories, :failed_deletion_count, :integer, {:default=>0, :null=>false, :if_not_exists=>true})
ci: -> 0.0430s
ci: -- add_column(:container_repositories, :next_delete_attempt_at, :datetime_with_timezone, {:if_not_exists=>true})
ci: -> 0.0025s
ci: == 20240916075908 AddFailedDeletionCountAndNextDeleteAtToContainerRepositories: migrated (0.0756s)
ci: == [advisory_lock_connection] object_id: 129260, pg_backend_pid: 33136
adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:redo:ci VERSION=20240916081252
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 44135
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 44135
ci: == [advisory_lock_connection] object_id: 129260, pg_backend_pid: 45640
ci: == 20240916081252 AddIndexOnStatusAndNextDeleteAttemptAtForContainerRepositories: migrating
ci: -- transaction_open?(nil)
ci: -> 0.0000s
ci: -- view_exists?(:postgres_partitions)
ci: -> 0.0195s
ci: -- index_exists?(:container_repositories, :next_delete_attempt_at, {:name=>:index_container_repositories_on_next_delete_attempt_at, :where=>"status = 0", :algorithm=>:concurrently})
ci: -> 0.0079s
ci: -- execute("SET statement_timeout TO 0")
ci: -> 0.0006s
ci: -- add_index(:container_repositories, :next_delete_attempt_at, {:name=>:index_container_repositories_on_next_delete_attempt_at, :where=>"status = 0", :algorithm=>:concurrently})
ci: -> 0.0045s
ci: -- execute("RESET statement_timeout")
ci: -> 0.0012s
ci: == 20240916081252 AddIndexOnStatusAndNextDeleteAttemptAtForContainerRepositories: migrated (0.0625s)
ci: == [advisory_lock_connection] object_id: 129260, pg_backend_pid: 45640
adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:redo:ci VERSION=20240924175515
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 56954
ci: == 20240924175515 AddPositiveConstraintOnFailedDeletionCount: migrating =======
ci: -- transaction_open?(nil)
ci: -> 0.0000s
ci: -- transaction_open?(nil)
ci: -> 0.0000s
ci: -- execute("ALTER TABLE container_repositories\nADD CONSTRAINT check_container_repositories_non_negative_failed_deletion_count\nCHECK ( failed_deletion_count >= 0 )\nNOT VALID;\n")
ci: -> 0.0013s
ci: -- execute("SET statement_timeout TO 0")
ci: -> 0.0004s
ci: -- execute("ALTER TABLE container_repositories VALIDATE CONSTRAINT check_container_repositories_non_negative_failed_deletion_count;")
ci: -> 0.0004s
ci: -- execute("RESET statement_timeout")
ci: -> 0.0005s
ci: == 20240924175515 AddPositiveConstraintOnFailedDeletionCount: migrated (0.0455s)
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 56954
DOWN
adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:down:ci VERSION=20240924175515
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 85283
ci: == 20240924175515 AddPositiveConstraintOnFailedDeletionCount: reverting =======
ci: -- transaction_open?(nil)
ci: -> 0.0000s
ci: -- transaction_open?(nil)
ci: -> 0.0000s
ci: -- execute(" ALTER TABLE container_repositories\n DROP CONSTRAINT IF EXISTS check_container_repositories_non_negative_failed_deletion_count\n")
ci: -> 0.0009s
ci: == 20240924175515 AddPositiveConstraintOnFailedDeletionCount: reverted (0.0257s)
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 85283
adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:down:ci VERSION=20240916081252
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 918
ci: == 20240916081252 AddIndexOnStatusAndNextDeleteAttemptAtForContainerRepositories: reverting
ci: -- transaction_open?(nil)
ci: -> 0.0000s
ci: -- view_exists?(:postgres_partitions)
ci: -> 0.0184s
ci: -- index_exists?(:container_repositories, :next_delete_attempt_at, {:name=>:index_container_repositories_on_next_delete_attempt_at, :algorithm=>:concurrently})
ci: -> 0.0041s
ci: == 20240916081252 AddIndexOnStatusAndNextDeleteAttemptAtForContainerRepositories: reverted (0.0358s)
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 918
adie@macbook~: gitlab (480652-add-max-retry-and-backoff) $ bin/rails db:migrate:down:ci VERSION=20240916075908
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 8034
ci: == 20240916075908 AddFailedDeletionCountAndNextDeleteAtToContainerRepositories: reverting
ci: -- remove_column(:container_repositories, :next_delete_attempt_at, :datetime_with_timezone, {:if_not_exists=>true})
ci: -> 0.0032s
ci: -- remove_column(:container_repositories, :failed_deletion_count, :integer, {:default=>0, :null=>false, :if_not_exists=>true})
ci: -> 0.0011s
ci: == 20240916075908 AddFailedDeletionCountAndNextDeleteAtToContainerRepositories: reverted (0.0199s)
ci: == [advisory_lock_connection] object_id: 128680, pg_backend_pid: 8034
Create INDEX CONCURRENTLY on database lab
exec CREATE INDEX CONCURRENTLY container_repositories_next_delete_attempt_at ON container_repositories USING btree (next_delete_attempt_at) WHERE (status = 0);
Session: webui-i31738
The query has been executed. Duration: 9.224 s
Completed
Query Plans
Changing ContainerRepository.pending_destruction affects the query ContainerRepository.next_pending_destruction that we use to get the next repository.
a. Before the change: No filtering on next_delete_attempt_at
SQL Query
SELECT
"container_repositories".*
FROM
"container_repositories"
WHERE
"container_repositories"."status" = 0
LIMIT 1
FOR UPDATE
SKIP LOCKED
Execution Plan (link)
Limit (cost=0.13..3.15 rows=1 width=101) (actual time=0.073..0.074 rows=1 loops=1)
Buffers: shared hit=3 dirtied=1
I/O Timings: read=0.000 write=0.000
-> LockRows (cost=0.13..3.15 rows=1 width=101) (actual time=0.072..0.072 rows=1 loops=1)
Buffers: shared hit=3 dirtied=1
I/O Timings: read=0.000 write=0.000
-> Index Scan using container_repositories_next_delete_attempt_at on public.container_repositories (cost=0.13..3.15 rows=1 width=101) (actual time=0.018..0.018 rows=1 loops=1)
Filter: (container_repositories.status = 0)
Rows Removed by Filter: 0
Buffers: shared hit=2
I/O Timings: read=0.000 write=0.000
Recommendations:
✅ Looks good
Summary:
Time: 1.692 ms
- planning: 1.551 ms
- execution: 0.141 ms
- I/O read: 0.000 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 3 (~24.00 KiB) from the buffer pool
- reads: 0 from the OS file cache, including disk I/O
- dirtied: 1 (~8.00 KiB)
- writes: 0
b. After the change: Additional filtering on next_delete_attempt_at
SQL Query
SELECT
"container_repositories".*
FROM
"container_repositories"
WHERE
"container_repositories"."status" = 0
AND (next_delete_attempt_at IS NULL
OR next_delete_attempt_at < '2024-09-22 14:49:44.766069')
LIMIT 1
FOR UPDATE
SKIP LOCKED
Execution Plan (link)
Limit (cost=0.13..3.16 rows=1 width=101) (actual time=0.064..0.064 rows=1 loops=1)
Buffers: shared hit=3 dirtied=1
I/O Timings: read=0.000 write=0.000
-> LockRows (cost=0.13..3.16 rows=1 width=101) (actual time=0.062..0.063 rows=1 loops=1)
Buffers: shared hit=3 dirtied=1
I/O Timings: read=0.000 write=0.000
-> Index Scan using container_repositories_next_delete_attempt_at on public.container_repositories (cost=0.13..3.15 rows=1 width=101) (actual time=0.016..0.017 rows=1 loops=1)
Filter: (((container_repositories.next_delete_attempt_at IS NULL) OR (container_repositories.next_delete_attempt_at < '2024-09-22 14:49:44.766069+00'::timestamp with time zone)) AND (container_repositories.status = 0))
Rows Removed by Filter: 0
Buffers: shared hit=2
I/O Timings: read=0.000 write=0.000
Recommendations:
✅ Looks good
Summary:
Time: 1.309 ms
- planning: 1.196 ms
- execution: 0.113 ms
- I/O read: 0.000 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 3 (~24.00 KiB) from the buffer pool
- reads: 0 from the OS file cache, including disk I/O
- dirtied: 1 (~8.00 KiB)
- writes: 0
MR acceptance checklist
How to set up and validate locally
🛵 Prerequisites:
A. Container Repository to test on
- Have a
ContainerRepositoryready.
repository = ContainerRepository.last
- Make sure that its status is set to
:delete_scheduledso it can be picked up by our worker.
repository.update(status: :delete_scheduled)
- Verify that your container repository can be picked up by the worker. It should be returned by:
ContainerRepository.next_pending_destruction(order_by: nil)
NOTE: It would be easier to test if you only have one container repository with status: :delete_scheduled, so ContainerRepository.next_pending_destruction(order_by: nil) can return nil or repository.
B. An error so the delete will fail
To simulate an error so our failed counts can increase, you can quit the container registry if you are running it separately. If you are running the container registry in the GDK, you can use the following:
gdk stop registry
🗺 Testing Points
1. Validating the failed_delete_count.
a. Run the DeleteContainerRepositoryWorker
repository.failed_deletion_count # => 0
ContainerRegistry::DeleteContainerRepositoryWorker.new.perform_work
repository.reload.failed_deletion_count # => 1
Running the worker would return a => [Gitlab::ErrorTracking::Logger] when the deletion failed.
You can try to run the worker again and the failed_deletion_count will increase again.
repository.reload.failed_deletion_count # => 1
ContainerRegistry::DeleteContainerRepositoryWorker.new.perform_work
repository.reload.failed_deletion_count # => 2
2. Validating next_delete_attempt_at and ContainerRepository.next_pending_destruction
We set next_delete_attempt_at to 2**(failed_deletion_count - 1), so:
-
failed_deletion_countis 1, we setnext_delete_attempt_attonow + 2**0 minutes(1 minute from now) -
failed_deletion_countis 2, we setnext_delete_attempt_attonow + 2**1 minutes(2 minutes from now)
So from Step 1, if the time has not passed yet, you can call ContainerRepository.next_pending_destruction(order_by: nil) and it should be nil (if repository is the only one with delete_scheduled).
Now, we can either wait for next_delete_attempt_at to pass or we can also update the value for testing purposes:
repository.update(next_delete_attempt_at: Time.zone.now)
ContainerRepository.next_pending_destruction(order_by: nil)
repository then should be returned.
3. Validating marking the container repository as delete_failed.
We have introduced ContainerRepository::MAX_DELETION_FAILURES to set a limit on how many failures can happen to a container repository deletion before we mark it as failed, instead of rescheduling it again.
a. When the feature flag is disabled
Feature.disable(:set_delete_failed_container_repository)
When the feature flag is disabled, the amount of failures can surpass ContainerRepository::MAX_DELETION_FAILURES, which is set to 10. So the failure count can reach 11, 12, and even 100.
repository.update(next_delete_attempt_at: Time.zone.now)
ContainerRepository.next_pending_destruction(order_by: nil)
#=> should return the repository so the worker has something to work on
repository.failed_deletion_count # => x
ContainerRegistry::DeleteContainerRepositoryWorker.new.perform_work
repository.reload.failed_deletion_count # => should be x + 1
b. When the feature flag is enabled
Feature.enable(:set_delete_failed_container_repository)
When the feature flag is enabled, if the failed_deletion_count is more than ContainerRepository::MAX_DELETION_FAILURES, then the repository is marked as delete_failed and won't be picked up again.
repository.update(next_delete_attempt_at: Time.zone.now)
ContainerRepository.next_pending_destruction(order_by: nil)
#=> should return the repository so the worker has something to work on
repository.failed_deletion_count # => x
ContainerRegistry::DeleteContainerRepositoryWorker.new.perform_work
repository.reload.status
# => delete_failed if x is above 10
And it should no longer be picked up by next_pending_destruction even if we update next_delete_attempt_at:
repository.update(next_delete_attempt_at: Time.zone.now)
repository.reload.status
# => delete_failed
ContainerRepository.next_pending_destruction(order_by: nil)
#=> should NOT return the repository
Related to #480652 (closed)