Skip to content

CTDB NFS monitoring improvements

I made some NFS monitoring improvements in !3672 (closed). After seeing this behave strangely due an unexpected NFS-Ganesha stall at startup, I decided to make some more changes.

However, to be able to sanely test the new changes, I need to rework some of CTDB's NFS RPC unit test code. I realised that in the previous MR I had made things worse, so I decided to fix it properly. So, that's the first 10 commits in this MR. 😃

Following that, there are 3 fairly simple changes:

  • Only consider statistics on timeout

    If an RPC service isn't registered then there is no point consulting statistics because the service can't work. The statistics were only really introduced to mitigate timeouts when a service is still making progress.

  • Make initial statistics output empty

    This is more likely to make a failure fetching statistics, resulting in unchanged statistics.

  • Avoid flapping NFS services at startup

    At startup, don't count failures until becoming unhealthy or restarting services. Don't count, just return unhealthy. This avoids the stupidity of a service with unhealthy_after=2 becoming healthy on the 1st monitor event, resulting in a healthy node, then becoming unhealthy on the 2nd monitor event. The counting really exists to mitigate failures under load and there is no load at startup. Services should be reliable at startup.

There are more details in the commit messages of the last 3 commit.

Checklist

  • Commits have Signed-off-by: with name/author being identical to the commit author
  • (optional) This MR is just one part towards a larger feature.
  • (optional, if backport required) Bugzilla bug filed and BUG: tag added
  • Test suite updated with functionality tests
  • Test suite updated with negative tests
  • Documentation updated
  • CI timeout is 3h or higher (see Settings/CICD/General pipelines/ Timeout)

Reviewer's checklist:

  • There is a test suite reasonably covering new functionality or modifications
  • Function naming, parameters, return values, types, etc., are consistent and according to README.Coding.md
  • This feature/change has adequate documentation added
  • No obvious mistakes in the code
Edited by Amitay Isaacs

Merge request reports