Skip to content

Draft: JobWaiter#wait to report its progress after a Redis error

What does this MR do and why?

JobWaiter#wait can encounter a Redis error while it is inside its main loop of work. This happens particularly on staging-ref, where we see errors thrown during the BLPOP operation:

Redis::CommandError
LOADING Redis is loading the dataset in memory

When this happens, the calling worker never receives an updated jobs_remaining count.

When the calling worker restarts later, it will become stuck waiting for jobs that were popped off the Redis list before the Redis error happened and were never known about by the calling worker.

This change has JobWaiter#wait recover from a Redis error in order to report the progress that was made.

#427674 (comment 1609270970)

Screenshots or screen recordings

Recording of the performing the QA steps on master and this branch. In this branch, we see "All jobs collected". On master the worker gets stuck.

Before After
qa-master qa-branch

How to set up and validate locally

In a rails console, mock a scenario where 200,000 jobs have notified their completion. We use a large number like 200,000 to give us enough time to restart Redis mid-way through JobWaiter#wait collecting them:

INITIAL_COUNT = 200_000
KEY = Gitlab::JobWaiter.generate_key

INITIAL_COUNT.times { |i| Gitlab::JobWaiter.notify(KEY, i) }

Initialize a mock worker that will collect the jobs:

worker = Proc.new do |jobs_remaining|
  while jobs_remaining > 0
    puts "Waiting for #{jobs_remaining} jobs"

    waiter = Gitlab::JobWaiter.new(jobs_remaining, KEY)
    waiter.wait

    jobs_remaining = waiter.jobs_remaining
  end

  puts "All jobs collected" if jobs_remaining == 0
# Locally I get an "Errno::EMFILE Too many open files - socket(2)" error when Redis is restarted. 
# On `master`, you'll get a Redis::CannotConnectError so we handle that too.
rescue Errno::EMFILE, Redis::CannotConnectError
  sleep 1
  retry
end

Now, be start the worker loop:

worker.call(INITIAL_COUNT)

Immediately gdk restart redis before the mock worker has collected the 200,000 job completion notifications, so Redis restarts midway through the JobWaiter#wait loop.

The worker should eventually report "All jobs collected".

If we do the same steps on master the worker will get stuck.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #427674

Edited by Luke Duncalfe

Merge request reports