Add AdvanceStageWorker unsticking logic (fail fast(er) or continue to next stage)
About
Some of our importers (Jira issues, GitHub, BitBucket Server) have an AdvanceStageWorker
that polls for job completion notifications and advances to the next stage of the import when all notifications have been collected.
Problem
Sometimes AdvanceStageWorker
can get stuck waiting for jobs that it knows were queued, but due to some bug it never collects the notification that the job was completed (for example #422976 (closed)).
When this happens, AdvanceStageWorker
continues to requeue every 30 seconds until eventually being classed as "stuck" by something like StuckProjectImportJobsWorker
and the import is failed.
While it is stuck for hours, the import appears as importing... in the UI until it eventually will fail.
In #416306 we've found that sometimes (at least) all jobs did complete successfully, but just due to a bug the notification was not collected. If the AdvanceStageWorker
would just advance to the next stage, the import would eventually finish successfully.
Proposal
The AdvanceStageWorker
would realise it was stuck if its waiting job count hadn't decreased for 2 hours.
It would then do either of the strategies:
- Pessimistic - fail the import now rather than wait another 22 hours for
StuckProjectImportJobsWorker
to do it. - Optimistic - advance to the next stage.
The particular strategy would be passed via the API endpoints for Jira issues, GitHub, and BitBucket Server importers. The default strategy could be optimistic
as based on tests in #416306 often stuck AdvanceStageWorker
s can happen due to a bug related to collecting the notifications and the stage actually has finished successfully.
It would write an error log when it realised it was stuck, logging the stuck job waiter cache keys, and the number of jobs it was expecting.
It would also clear the job waiter cache by calling #clear_waiter_caches
. This will help mitigate the effect of a longer job waiter cache expiry in #422976 (closed).
The logic would go in Gitlab::Import::AdvanceStage
.
We would feature flag the change.
Documentation
We should also update all the documentation for the importers describing this strategy and the way it works