Draft: JobWaiter#wait to report its progress after a Redis error
What does this MR do and why?
JobWaiter#wait
can encounter a Redis error while it is inside its main
loop of work. This happens particularly on staging-ref, where we see
errors thrown during the BLPOP
operation:
Redis::CommandError
LOADING Redis is loading the dataset in memory
When this happens, the calling worker never receives an updated
jobs_remaining
count.
When the calling worker restarts later, it will become stuck waiting for jobs that were popped off the Redis list before the Redis error happened and were never known about by the calling worker.
This change has JobWaiter#wait
recover from a Redis error in order
to report the progress that was made.
Screenshots or screen recordings
Recording of the performing the QA steps on master
and this branch. In this branch, we see "All jobs collected"
. On master
the worker gets stuck.
Before | After |
---|---|
qa-master | qa-branch |
How to set up and validate locally
In a rails console, mock a scenario where 200,000 jobs have notified their completion. We use a large number like 200,000 to give us enough time to restart Redis mid-way through JobWaiter#wait
collecting them:
INITIAL_COUNT = 200_000
KEY = Gitlab::JobWaiter.generate_key
INITIAL_COUNT.times { |i| Gitlab::JobWaiter.notify(KEY, i) }
Initialize a mock worker that will collect the jobs:
worker = Proc.new do |jobs_remaining|
while jobs_remaining > 0
puts "Waiting for #{jobs_remaining} jobs"
waiter = Gitlab::JobWaiter.new(jobs_remaining, KEY)
waiter.wait
jobs_remaining = waiter.jobs_remaining
end
puts "All jobs collected" if jobs_remaining == 0
# Locally I get an "Errno::EMFILE Too many open files - socket(2)" error when Redis is restarted.
# On `master`, you'll get a Redis::CannotConnectError so we handle that too.
rescue Errno::EMFILE, Redis::CannotConnectError
sleep 1
retry
end
Now, be start the worker loop:
worker.call(INITIAL_COUNT)
Immediately gdk restart redis
before the mock worker has collected the 200,000 job completion notifications, so Redis restarts midway through the JobWaiter#wait
loop.
The worker should eventually report "All jobs collected"
.
If we do the same steps on master
the worker will get stuck.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #427674