Detect node failure by read/write failures

Problem to solve

Health checks can often succeed even if there is a problem with the server, Gitaly or Git. For example, the Gitaly process may be running just fine and respond with a healthy signal even though the persistent disk is not available which would prevent any Git operation from succeeding.

Further details

Health checks are relatively slow to detect failures (seconds), and are performed less frequently to the total number of Git operations. For example, a busy Gitaly node may be handle hundreds of requests per second (get commit, get file, list something etc). If multiple health checks must fail before failover, this means thousands of operations have failed before the system notices a failure.

Proposal

Praefect should monitor Gitaly RPC success/failures and mark a node as bad if there multiple failues in a row. For example, if 5 or 10 RPCs fail in a row, this is a strong signal that the node isn't working.

Worst case scenario we fail over to an up to date replica because strong consistency means we always have an up to date replica &1189 (closed).

Testing

Functional end-to-end coverage will be provided by a new failover test. And broad coverage by running the entire end-to-end test suite against environments using Praefect for storage (including Staging).

Performance end-to-end testing will be implemented as part of gitlab-org/quality/performance#231 (closed) (see https://gitlab.com/gitlab-org/quality/team-tasks/-/issues/451 for a more detailed plan of performance tests under failure conditions).

Links / references

Edited May 26, 2020 by Mark Lapierre

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information