Fix registry client to allow for 404s and Enqueuer to not abort on all errors
Problem
The error
In testing the phase 2 container registry migration on staging, we started an import that resulted in an aborted migration:
gitlabhq_production=> select id, updated_at, migration_state, migration_aborted_in_state, migration_retries_count from container_repositories where migration_state <> 'default' order by updated_at desc;
id | updated_at | migration_state | migration_aborted_in_state | migration_retries_count
-------+----------------------------+-----------------+----------------------------+-------------------------
35796 | 2022-03-16 20:45:08.650119 | import_aborted | default | 1
35791 | 2022-03-16 14:45:02.717885 | import_aborted | default | 1
48115 | 2022-03-16 03:50:24.267328 | import_aborted | importing | 2
28273 | 2022-03-02 15:45:02.880413 | import_aborted | default | 1
42704 | 2022-02-11 10:01:06.334941 | import_aborted | importing | 1
77881 | 2022-02-11 10:01:06.321232 | import_aborted | importing | 1
77891 | 2022-02-11 10:01:06.30645 | import_aborted | pre_importing | 1
77898 | 2022-02-11 10:01:06.290414 | import_aborted | pre_import_done | 1
(8 rows)
Repository ID 35796 was the one that failed. Looking through the code, we see that the only obvious reason something would go from default
to import_aborted
state is due to an exception: https://gitlab.com/gitlab-org/gitlab/-/blob/72e85657fb606f1c53e0b357cade2211aecdad47/app/workers/container_registry/migration/enqueuer_worker.rb#L29
The error can be found in the logs by searching json.extra.next_repository_id: 35796
: https://nonprod-log.gitlab.net/goto/23b5efe0-a577-11ec-b3a6-472d0398dd6e
The error is:
NoMethodError
undefined method `[]' for nil:NilClass
lib/container_registry/gitlab_api_client.rb:57:in `block in import_status',
lib/container_registry/gitlab_api_client.rb:88:in `with_import_token_faraday',
lib/container_registry/gitlab_api_client.rb:55:in `import_status',
app/models/container_repository.rb:327:in `block in external_import_status',
lib/gitlab/utils/strong_memoize.rb:30:in `strong_memoize',
app/models/container_repository.rb:326:in `external_import_status',
app/models/container_repository.rb:279:in `retry_aborted_migration',
app/workers/container_registry/migration/enqueuer_worker.rb:35:in `handle_aborted_migration',
app/workers/container_registry/migration/enqueuer_worker.rb:21:in `perform',
...
Looking at the list of container_repositories above and at the log output, we see there is a value json.extra.next_aborted_repository_id: 28273
.
Before starting the next import, the Enqueuer first tries any previously aborted imports, so it found 28273
, and then requested the import status from the registry. The response did not include a body, so body_hash['status']
threw the error.
Looking at the registry side of things, we can see the log shows it received this request and returned a 404
: https://nonprod-log.gitlab.net/app/discover#/doc/pubsub-registry-inf-gstg/pubsub-registry-inf-gstg-000154?id=r1cwk38BIz3g9xmlwhJD
If the import has not yet started, the import status endpoint returns a 404 with no body, rails did not know how to handle this response.
The additional aborted imports
The error was thrown while handling an aborted repository, but you can see in the rescue clause, regardless of what error is thrown, we automatically abort the next qualified repository. This means that if we continued to let this run, every hour, another new repository would be aborted, further contributing to the problem.
Solutions
Proposed solution for the error:
Add a guard to GitlabApiClient#import_status
to handle the 404 error: https://gitlab.com/gitlab-org/gitlab/-/blob/6b02a49422145199ca53821acab0ffcb872b9275/lib/container_registry/gitlab_api_client.rb#L57
When a 404 is received, return 'pre_import_failed' which will trigger a pre-import retry.
This will resolve the problem, but hardcoding the response is not necessarily the best long term solution.
Alternative solution for the error:
Update the registry endpoint to respond when queried about a repository that has not yet started importing.
Proposed solution for the rescue:
Remove next_repository&.abort_import
from the rescue
in EnqueuerWorker
.
** Alternative solution for the rescue:**
Update the logic in the rescue to abort the next_repository
conditionally.