Imports stuck in the "pre_import_complete" state due to a retry race condition
Problem
Yesterday we spotted several repositories on the registry side that have a migration status of pre_import_complete. We then realized these had a migration status of import_done on the Rails side, and we found evidence that their final import did happen on the registry side.
As an example:
id | 1262995
top_level_namespace_id | 36816
parent_id |
created_at | 2022-05-10 09:10:41.184244+00
updated_at | 2022-05-10 09:38:03.39809+00
name | master
path | z...s/z...l/master
migration_status | pre_import_complete
deleted_at |
migration_error |
Here we can see that the first pre-import took more than 10 minutes, and therefore that triggered a cancelation request, which happened at 09:30:03.152, followed by a status check at 09:30:03.360, and then another pre-import starting at 09:30:03.477. So just a few milliseconds apart from the cancelation.
Cancelations on the registry take up to 5 seconds. When we receive a cancellation request, we immediately update the database's status to (pre_)_import_canceled. However, the ongoing (pre)import goroutine on the background is not canceled immediately. There is a periodic check every 5s to see if the migration was canceled (a select to see if the migration status has changed), and if so, the goroutine will stop right there. But there was not enough time between the cancellation and the actual time when the goroutine stopped. This means that the final import has been completed and set the migration status to import_complete on the database. One of the in-progress pre-imports ended up completed after the final import, which caused the override of the migration status from import_complete to pre_import_complete.
Solution
We have two problems here:
-
We should not accept a pre-import if the target repository status is
pre_import_canceledand itsupdated_atattribute is set to a timestamp that's less than 5 seconds ago. This will guarantee that we always give enough time for any ongoing pre-import goroutine to detect the cancelation and stop before accepting a new one. -
At the end of (pre) imports we should check the migration status of the repository on the database and only set it to
(pre_)import_completeif it hasn't been canceled.
