coordinator: Fix error comparison causing excessive replication jobs (!4349) · Merge requests · GitLab.org / Gitaly

When determining whether nodes need replication jobs or not we also take into account the error status of a node: if the node returned an error that is different from the error returned by the primary node we create a replication job. The underlying assumption is that if two nodes behave the same, they should also run into the same kind of error. And if they returned different errors, then they likely did different things and may have diverged.

This comparison is flawed though: we typically handle gRPC-style errors in this context, and those cannot be directly compared with each other. As a result, even in the case where two nodes returned the same error message and code we label them as different and thus create replication jobs.

Fix this bug by manually comparing error code and message in case we've got a gRPC error. Note that we do not do this for normal Go errors: it is unexpected in the first place to get anything but a gRPC error, so we treat these as "weird" state and err on the side of caution.

Changelog: fixed

Fixes #4045 (closed)

Edited Feb 14, 2022 by Patrick Steinhardt

coordinator: Fix error comparison causing excessive replication jobs

Merge request reports