Skip to content

coordinator: Fix repo creation/removal race for up-to-date secondaries

Patrick Steinhardt requested to merge pks-tx-remove-repository into master

When creating repositories, Rails will in some cases first schedule a call to RemoveRepository() to make sure that there's no old repository obstructing the path we are about to create the repository at. Given that RemoveRepository() isn't transactional, this action will be replicated to secondaries asynchronously. With our current design, these replication jobs will not cause a bump of the repository generation given that we may not even have an entry for the repository. Combined with the fact that we do not take outstanding replication jobs into account when computing whether a node is up-to-date or not, this means that secondaries will be considered up-to-date even though they have deletions pending. This results in a race between the replication jobs which are about to delete the target repository and the subsequent call to CreateRepository() et al. If this race is lost (that is, the deletion wasn't scheduled before CreateRepository() gets executed, which is almost always), then we will end up with the repository getting deleted during or after its creation.

This race needs to be fixed via two changes:

1. We need to take outstanding modifying replication jobs into
   account when computing whether a node is up to date or not.
   Otherwise, we may be in the middle of processing a transactional
   RPC when a replication job kicks in and modifies repository state
   while we're operating on it.

2. With (1) implemented, we need to make `RemoveRepository()`
   transactional, otherwise it would always cause secondaries to be
   considered out-of-date and repository creation wouldn't use
   transactions in most cases.

This MR implements (2) and enables transactional behaviour for `RemoveRepository().

Part of #3669 (closed)

Merge request reports