Skip to content

testserver: Fix deadlock caused by Praefect spawn tokens

In order to limit the resource usage by the Praefect proxy that we use when running test-with-praefect, we have introduced a Praefect spawn token. This token limits us to at most 16 instances of the proxy at the same time and thus keeps both the Postgres connection count as well as memory usage at bay.

Unfortunately, this protection mechanism causes deadlocks in our CI system. Nowadays, many of our tests run parallelized and thus try to spawn many Praefect proxies in parallel, too. In the context of table driven tests, we have two different ways to spawn Praefect:

- In case the server setup can be shared it will be the outer test
  that creates the Praefect proxy. This is the preferred way of
  doing things as it means less overhead caused by recreating the
  test setup.

- Some tests need to set up the Praefect server as part of the
  innermost testcase, e.g. because the different cases require
  different server options.

Even though the former way is preferred, both ways have their usecases.

The deadlock can now happen when we run both kinds of testcases at the same time via parallelized tests:

1. A testcase that sets up Praefect in the outer test runs first and
   spawns Praefect, thus acquiring a spawn token. The table-driven
   test cases do not yet start to run.

2. A testcase that sets up Praefect in the inner test starts to run
   now. It will try to acquire the spawn token, which is currently
   blocked by (1).

3. The table-driven test cases are now unblocked, but because Go
   tests only allow a maximum number of concurrent tests to run at
   the same time, they are now blocked waiting on (2).

We thus have a deadlock: we can't return the Praefect spawn token as the subtests cannot run, and we cannot unblock a thread as we're waiting for a spawn token. The root cause of this issue is thus hidden shared state between all the tests, which is extremely fragile and hard to debug.

While we could play games and try to remove parallelization in a subset of tests, this feels like the wrong way to handle this issue as it would only increase the fragility of our test suite further. Instead, we get rid of the spawn token completely. While the original intent was to save on resources like e.g. the Postgres connection count, we have long since bumped the maximum allowed number of connections to somewhat accommodate for the fact that our tests may create many connections at the same time. Tests performed both locally and in our CI systems have not shown any issues around resource starvation until now.

Despite fixing the deadlock issue and removing hidden shared state, this change also speeds up test runs significantly. One of the symptomps that were observed locally is that tests became slower and slower, which is because they couldn't be parallelized anymore and thus ran essentially sequential. By removing the token we stop serializing them, which leads to properly-saturated CPU cores as it should ideally be the case for tests. Execution times previous to this change fluctuated wildly between 80 to 260 seconds for the Operations service alone, often deadlocking completely. After the change, the tests finish consistently in about 60 seconds.

Fixes #5474 (closed).

Merge request reports