coordinator: Fix repo creation/removal race for up-to-date secondaries
When creating repositories, Rails will in some cases first schedule a
call to RemoveRepository()
to make sure that there's no old repository
obstructing the path we are about to create the repository at. Given
that RemoveRepository()
isn't transactional, this action will be
replicated to secondaries asynchronously. With our current design, these
replication jobs will not cause a bump of the repository generation
given that we may not even have an entry for the repository. Combined
with the fact that we do not take outstanding replication jobs into
account when computing whether a node is up-to-date or not, this means
that secondaries will be considered up-to-date even though they have
deletions pending. This results in a race between the replication jobs
which are about to delete the target repository and the subsequent call
to CreateRepository()
et al. If this race is lost (that is, the
deletion wasn't scheduled before CreateRepository()
gets executed,
which is almost always), then we will end up with the repository getting
deleted during or after its creation.
This race needs to be fixed via two changes:
1. We need to take outstanding modifying replication jobs into
account when computing whether a node is up to date or not.
Otherwise, we may be in the middle of processing a transactional
RPC when a replication job kicks in and modifies repository state
while we're operating on it.
2. With (1) implemented, we need to make `RemoveRepository()`
transactional, otherwise it would always cause secondaries to be
considered out-of-date and repository creation wouldn't use
transactions in most cases.
This MR implements (2) and enables transactional behaviour for `RemoveRepository().
Part of #3669 (closed)