Draft: JobWaiter#wait to report its progress after a Redis error (!134585) · Merge requests · GitLab.org / GitLab

Luke Duncalfe requested to merge 427674-job-waiter-resilient-to-command-errors-in-loop into master Oct 18, 2023

What does this MR do and why?

JobWaiter#wait can encounter a Redis error while it is inside its main loop of work. This happens particularly on staging-ref, where we see errors thrown during the BLPOP operation:

Redis::CommandError
LOADING Redis is loading the dataset in memory

When this happens, the calling worker never receives an updated jobs_remaining count.

When the calling worker restarts later, it will become stuck waiting for jobs that were popped off the Redis list before the Redis error happened and were never known about by the calling worker.

This change has JobWaiter#wait recover from a Redis error in order to report the progress that was made.

#427674 (comment 1609270970)

Screenshots or screen recordings

Recording of the performing the QA steps on master and this branch. In this branch, we see "All jobs collected". On master the worker gets stuck.

Before	After
qa-master	qa-branch

How to set up and validate locally

In a rails console, mock a scenario where 200,000 jobs have notified their completion. We use a large number like 200,000 to give us enough time to restart Redis mid-way through JobWaiter#wait collecting them:

INITIAL_COUNT = 200_000
KEY = Gitlab::JobWaiter.generate_key

INITIAL_COUNT.times { |i| Gitlab::JobWaiter.notify(KEY, i) }

Initialize a mock worker that will collect the jobs:

worker = Proc.new do |jobs_remaining|
  while jobs_remaining > 0
    puts "Waiting for #{jobs_remaining} jobs"

    waiter = Gitlab::JobWaiter.new(jobs_remaining, KEY)
    waiter.wait

    jobs_remaining = waiter.jobs_remaining
  end

  puts "All jobs collected" if jobs_remaining == 0
# Locally I get an "Errno::EMFILE Too many open files - socket(2)" error when Redis is restarted. 
# On `master`, you'll get a Redis::CannotConnectError so we handle that too.
rescue Errno::EMFILE, Redis::CannotConnectError
  sleep 1
  retry
end

Now, be start the worker loop:

worker.call(INITIAL_COUNT)

Immediately gdk restart redis before the mock worker has collected the 200,000 job completion notifications, so Redis restarts midway through the JobWaiter#wait loop.

The worker should eventually report "All jobs collected".

If we do the same steps on master the worker will get stuck.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #427674

Edited Oct 19, 2023 by Luke Duncalfe

Draft: JobWaiter#wait to report its progress after a Redis error

What does this MR do and why?

Screenshots or screen recordings

How to set up and validate locally

MR acceptance checklist

Merge request reports