Skip to content

Add AdvanceStageWorker unsticking logic (fail fast(er) or continue to next stage)

About

Some of our importers (Jira issues, GitHub, BitBucket Server) have an AdvanceStageWorker that polls for job completion notifications and advances to the next stage of the import when all notifications have been collected.

Problem

Sometimes AdvanceStageWorker can get stuck waiting for jobs that it knows were queued, but due to some bug it never collects the notification that the job was completed (for example #422976 (closed)).

When this happens, AdvanceStageWorker continues to requeue every 30 seconds until eventually being classed as "stuck" by something like StuckProjectImportJobsWorker and the import is failed.

While it is stuck for hours, the import appears as importing... in the UI until it eventually will fail.

In #416306 (closed) we've found that sometimes (at least) all jobs did complete successfully, but just due to a bug the notification was not collected. If the AdvanceStageWorker would just advance to the next stage, the import would eventually finish successfully.

Proposal

The AdvanceStageWorker would realise it was stuck if its waiting job count hadn't decreased for 2 hours.

It would then do either of the strategies:

  • Pessimistic - fail the import now rather than wait another 22 hours for StuckProjectImportJobsWorker to do it.
  • Optimistic - advance to the next stage.

The particular strategy would be passed via the API endpoints for Jira issues, GitHub, and BitBucket Server importers. The default strategy could be optimistic as based on tests in #416306 (closed) often stuck AdvanceStageWorkers can happen due to a bug related to collecting the notifications and the stage actually has finished successfully.

It would write an error log when it realised it was stuck, logging the stuck job waiter cache keys, and the number of jobs it was expecting.

It would also clear the job waiter cache by calling #clear_waiter_caches. This will help mitigate the effect of a longer job waiter cache expiry in #422976 (closed).

The logic would go in Gitlab::Import::AdvanceStage.

We would feature flag the change.

Documentation

We should also update all the documentation for the importers describing this strategy and the way it works

Edited by Luke Duncalfe