Spike: Determine why unpausing indexing stopped half way through and make it more robust

Problem

We observed an instance where unpausing the indexing didn't actually requeue all the indexing jobs that were queued up while paused. If we didn't have good monitoring on this we might have missed that and some indexing would have been missed.

Details

In gitlab-com/gl-infra/production#2408 (comment 386531952) I unpaused the indexing and I noticed that it didn't fully finish the unpausing. I couldn't tell why since I only see a start log for ElasticIndexingControlWorker .

Perhaps this processed was somehow killed in the middle of resuming all the jobs. Graphs indicate it just stopped at some point:

Since there were still ~260k jobs to be requeued I went ahead and kicked it off again:

[ gprd ] production> Elastic::IndexingControlService.new(ElasticCommitIndexerWorker).queue_size
=> 259652
[ gprd ] production> ElasticIndexingControlWorker.new.perform
=> true
[ gprd ] production> Elastic::IndexingControlService.new(ElasticCommitIndexerWorker).queue_size
=> 0

Solution

Figure out why it could have failed like this and make it more robust. If we cannot figure out why we may choose to make it more robust in a different way. For example we could check periodically if there are paused jobs and kick off the worker again if necessary.

Edited Dec 09, 2020 by John McGuire