Gitlab::RepositorySetCache#fetch is not atomic
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
Gitlab::RepositorySetCache#fetch is currently implemented as follows:
def fetch(key, &block)
full_key = cache_key(key)
smembers, exists = with do |redis|
redis.multi do |multi|
multi.smembers(full_key)
multi.exists?(full_key) # rubocop:disable CodeReuse/ActiveRecord
end
end
return smembers if exists
write(key, yield)
end
I.e.
- Atomic read and existence check
- Early return if key existed
- A possibly slow
yieldto a block, calculating a value - A write (which is atomic)
This means that two concurrent invocations of #fetch can both read exists == false and then race to write a result.
This can result in stale reads and false negatives for checks such as branch_names_include?. Example, where X is be an arbitrary web request or sidekiq job invoking the cached #branch_names method:
-
PostReceive 1clears thebranch_namescache -
Xinvokesbranch_names, and readsexists == false, but theyieldis slow -
PostReceive 1writes thebranch_namescache -
PostReceive 2clears thebranch_namescache for the push to a new branch -
Xfinally completes theyieldand writes the branch names cache -
PostReceive 2's invocation ofbranch_namesgetssmember == falseandexists == truein its#fetch, i.e. a false negative
Steps to reproduce
Injecting the following rspec test under describe #fetch in spec/lib/gitlab/repository_set_cache_spec.rb reproduces the bug:
it 'is atomic' do
key = 'race'
fiber_1 = Fiber.new do
cache.fetch(key) do
Fiber.yield
['main']
end
end
fiber_2 = Fiber.new do
cache.fetch(key) do
%w[main new]
end
end
# Start fiber_1, which will yield before writing
fiber_1.resume
# Run fiber_2 to completion
result_2 = fiber_2.resume
# Resume fiber_1 to complete its execution
result_1 = fiber_1.resume
# Both fibers should have different results, demonstrating the race condition
expect(result_1).to eq(['main'])
expect(result_2).to eq(%w[main new])
# The cache now contains the result from fiber_1, which is stale.
expect(cache.fetch(key) { ['something'] }).to eq(['main'])
end
Warning: A passing test means the bug is present. When fixing the bug, change the description and final expectation to whatever the decided desired behaviour is.
Example Project
What is the current bug behavior?
The third cache.fetch(key) the stale value ['main']
What is the expected correct behavior?
The third cache.fetch(key) { ... } should either invoke the block or return %w[main new].
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)(we will only investigate if the tests are passing)
Possible fixes
Implement optimistic locking using a third value, lock_version, in addition to smember and exists. This means the multi block in #write would need to be replaced with a lua script, and possibly also the multi block in #fetch.