Skip to content

[MCCS-699] Stability and thread safety

Drew Devereux requested to merge mccs-699-stability into main

The MR addresses known stability issues.

Three things have been done on this branch:

  1. Several unit tests related to healthState were being skipped because they were occasionally deadlocking. These were unskipped so that the problem could be investigated, but it turned out that the problem had already been resolved by previous work. The tests are now holding up under hundreds of retries.

  2. After a full day of chasing an instability, it finally revealed itself as caused by multiple threads concurrently in a non-threadsafe method. That individual case was easily fixed, but it raised the question whether there are other places in the code where this problem could occur. To investigate this, I created a metaclass that keeps track of the threads that are currently running in each (bound) method. The metaclass raises an exception if a thread tries to enter a method that already has a thread in it. This neat little piece of tooling enabled me to find several other places in the code where thread contention was occurring.

  3. The metaclass also revealed a more significant problem where a thread was re-entering a method, and then deadlocking trying to acquire a lock that it was already holding. This was indirectly solved by making the ObjectComponentManager use the message queue, which it should really have been doing all along. This broke quite a few tests, which just needed to be tweaked a bit to get them passing again. In most cases I just dropped in a sleep, pending a more complete solution to the problem of testing in the presence of asynchrony.

A little more detail on the metaclass: once I started making methods threadsafe, I needed a way of telling the metaclass that it shouldn't get upset if it finds multiple threads in them. This was done by introducing a @threadsafe decorator.

I am uncertain whether it is appropriate to leave this tooling in place. To justify leaving it in, I have created a flag that turns the thread checking functionality off/on, and set it to False by default. Thus the thread checking will only run if a developer turns it on. The @threadsafe decorators themselves cost nothing.

Merge request reports