Skip to content

Implement ratio-based threshold for Zoekt lost node detection

Summary

Currently, the LostNodeEvent is only skipped when ALL search nodes are lost. This follow-up issue proposes implementing a ratio-based threshold to trigger lost node handling before the entire cluster becomes compromised.

Background

In MR !209617 (merged), we implemented logic to skip LostNodeEvent when all search nodes are lost to avoid unnecessary event processing. However, as suggested in this comment, instead of waiting for the entire cluster to be compromised, we could implement a ratio-based approach with a configurable threshold.

Proposal

Introduce a new application setting that allows administrators to configure a threshold ratio for lost node detection. For example:

  • If 80% of nodes are lost, trigger the lost node handling workflow
  • If 100% of nodes are lost, skip the event (current behavior from !209617 (merged))

This would provide better operational control and allow for proactive handling of cluster degradation.

Implementation considerations

  • Add a new application setting for the lost node ratio threshold
  • Update the lost_nodes_check condition in Search::Zoekt::SchedulingService to use the configurable ratio
  • Ensure backward compatibility with the current behavior
  • Add appropriate tests for different ratio scenarios

Related