Skip to content

testserver: Fix Praefect resource exhaustion and assert Gitaly's healthiness

Patrick Steinhardt requested to merge pks-testhelper-gitaly-health-checks into master

This MR fixes two issues:

  1. When using the "test-with-praefect" target, then most tests will fork an extra Praefect instance which proxies connections to Gitaly. With our recent push to use t.Parallel(), this means that there are now potentially hundreds of concurrent Praefect instances created which both consume a lot of memory and exhaust the connection pool of the Postgres database. The result is out-of-memory situations and tests failing because of too many clients connected to Postgres.

  2. When starting Gitaly with Praefect, then we always wait until Gitaly becomes healthy before we return from the function such that Praefect can correctly route information to it. It does make sense to wait for Gitaly to be healthy in the general case though such that we don't ever run with Gitaly not yet being able to server requests.

I think 2 should fix a number of flakes we're currently seeing, where the general failure mode is that RPCs randomly fail because Gitaly isn't yet heatlhy. I've seen this pattern in FetchSourceBranch, the Hook service and in some other places. I think I've also seen Praefect-based tests to fail sometimes, which makes sense because we failed to check healthiness when using StartGitalyServer(), but only did so for RunGitalyServer().

Merge request reports