Skip to content

JobWaiter.notify key expiration time causes large-scale GitHub imports to get stuck

About

This issue was created from #416306 (comment 1517705434).

JobWaiter is a class we use to track the completion of a collection of queued workers:

Click to expand to see how job waiter is used
  1. Our importers will queue a number of workers to process data fetched from the GitHub API. For each worker queued we increment a JobWaiter counter to track the count of jobs queued.
  2. As a worker completes, it calls JobWaiter.notify to report that it has completed. This pushes its Sidekiq Job ID to a redis set (what's pushed isn't important, but rather just that an element is pushed to the set).
  3. Gitlab::GithubImport::AdvanceStageWorker is queued with the count of jobs queued. It calls JobWaiter.wait to clear the redis set and decrement the count. It does this periodically until the count is 0 (when all jobs have been notified that they completed).

Problem

In the issue_and_diff_notes stage of the import, a large-scale GitHub import will get stuck if the time it takes for the Importer::SingleEndpointDiffNotesImporter to page through many pages of a GitHub API endpoint takes longer than 6 hours.

This happens because the stage uses two importers, and the second importer can take longer than 6 hours to complete:

  1. The first importer, IssuesImporter, will spread the work across many workers.
  2. Then the second importer, SingleEndpointDiffNotesImporter, will run.
  3. Workers queued from IssuesImporter will in the meantime be completing and each calling JobWaiter.notify when they complete their work. This pushes some data to Redis and resets the notification key expiry to 6 hours.
  4. SingleEndpointDiffNotesImporter completes. If it takes longer than 6 hours to complete, the job notifications from workers queued by the IssuesImporter can have expired already.
  5. We queue an AdvanceStageWorker to start receiving the notifications. It will continue to check for the expired notifications for a day before being killed by StuckProjectImportJobsWorker.

We see in a test import of rails/rails #416306 (comment 1516333932) the following timings:

  • The final Gitlab::Github::ImportIssueWorker completed at 14:30 (log link).
  • AdvanceStageWorker with the matching job key cache key wasn't queued until 21:30 (log link).
  • The AdvanceStageWorker continues to poll Redis but its job count never decreases.

Proposal

Increase the JobWaiter.notify expiry time for importers (at least the GitHub imports, but probably others).

Allow our importers to set a higher expiry time.

A suggested expiry time is 7 days. For jobs that aren't stuck, the data is popped off the cache as normal so the increase in expiry time won't affect the size of data in Redis. Only for any jobs that get stuck will the data be persisted for the 7 days. In #422979 (closed) we could have AdvanceStageWorker clear the cache when it realises it's stuck.

Edited by Luke Duncalfe