JobWaiter.notify key expiration time causes large-scale GitHub imports to get stuck
About
This issue was created from #416306 (comment 1517705434).
JobWaiter
is a class we use to track the completion of a collection of queued workers:
Click to expand to see how job waiter is used
- Our importers will queue a number of workers to process data fetched from the GitHub API. For each worker queued we increment a
JobWaiter
counter to track the count of jobs queued. - As a worker completes, it calls
JobWaiter.notify
to report that it has completed. This pushes its Sidekiq Job ID to a redis set (what's pushed isn't important, but rather just that an element is pushed to the set). -
Gitlab::GithubImport::AdvanceStageWorker
is queued with the count of jobs queued. It callsJobWaiter.wait
to clear the redis set and decrement the count. It does this periodically until the count is 0 (when all jobs have been notified that they completed).
Problem
In the issue_and_diff_notes
stage of the import, a large-scale GitHub import will get stuck if the time it takes for the Importer::SingleEndpointDiffNotesImporter
to page through many pages of a GitHub API endpoint takes longer than 6 hours.
This happens because the stage uses two importers, and the second importer can take longer than 6 hours to complete:
- The first importer,
IssuesImporter
, will spread the work across many workers. - Then the second importer,
SingleEndpointDiffNotesImporter
, will run. - Workers queued from
IssuesImporter
will in the meantime be completing and each callingJobWaiter.notify
when they complete their work. This pushes some data to Redis and resets the notification key expiry to 6 hours. -
SingleEndpointDiffNotesImporter
completes. If it takes longer than 6 hours to complete, the job notifications from workers queued by theIssuesImporter
can have expired already. - We queue an
AdvanceStageWorker
to start receiving the notifications. It will continue to check for the expired notifications for a day before being killed byStuckProjectImportJobsWorker
.
We see in a test import of rails/rails
#416306 (comment 1516333932) the following timings:
- The final
Gitlab::Github::ImportIssueWorker
completed at 14:30 (log link). -
AdvanceStageWorker
with the matching job key cache key wasn't queued until 21:30 (log link). - The
AdvanceStageWorker
continues to poll Redis but its job count never decreases.
Proposal
Increase the JobWaiter.notify
expiry time for importers (at least the GitHub imports, but probably others).
Allow our importers to set a higher expiry time.
A suggested expiry time is 7 days. For jobs that aren't stuck, the data is popped off the cache as normal so the increase in expiry time won't affect the size of data in Redis. Only for any jobs that get stuck will the data be persisted for the 7 days. In #422979 (closed) we could have AdvanceStageWorker
clear the cache when it realises it's stuck.