Skip to content

Elastic cluster health checker

Madelein van Niekerk requested to merge 409573-elastic-health-checker into master

What does this MR do and why?

Adds a health checker for elastic cluster that can be called as Search::ClusterHealthCheck::Elastic.healthy?. This is part 2 of the implementation plan.

The logic for the existing health check is transferred so that we can iterate on it. At the moment it only takes action when the cluster status is red.

When the feature flag log_advanced_search_cluster_health_elastic is enabled, we also log useful metrics related to the cluster health:

  1. load_average: the N highest load_average from all nodes - os.cpu.load_average.1m
  2. heap_usage: the N highest heap_used_percent from all nodes - jvm.mem.heap_used_percent
  3. utilization: a calculation which adds saturation for heap_usage and load_average to give a percentage. Each metric has a saturation threshold and a multiplication factor. We will monitor the metrics and decide if the factors need to be changed.
    1. load_average reaches 0.5 saturation at 15 and multiplied by 1
    2. heap_usage reaches 0.5 saturation at 90 and is multiplied by 0.8 so that load carries more weight than heap.

If the cluster does not return with a valid response, we only log a warning with the error message, return false and don't calculate the metrics, e.g.:

[1] pry(main)> ::Search::ClusterHealthCheck::Elastic.healthy?
W, [2023-08-04T17:04:43.520053 #95276]  WARN -- : Failed to open TCP connection to localhost:9200 (Connection refused - connect(2) for "localhost" port 9200)
=> false

The metrics are cached for 5 minutes and will be logged every 1 minute if the flag is enabled.

How to set up and validate locally

  1. Change the logger in ee/lib/search/cluster_health_check/elastic.rb:L93 to ::Logger.new($stdout)
  2. Verify ::Search::ClusterHealthCheck::Elastic.healthy? => true
  3. Note the metrics that are logged
  4. Disable the feature flag: Feature.disable(:log_advanced_search_cluster_health_elastic)
  5. Verify ::Search::ClusterHealthCheck::Elastic.healthy? => true
  6. Note that nothing is logged
  7. Stop elasticsearch cluster: gdk stop elasticsearch
  8. Enable the feature flag: Feature.enable(:log_advanced_search_cluster_health_elastic)
  9. Note that ::Search::ClusterHealthCheck::Elastic.healthy? logs a warning with the error message, returns false and does not have a log containing metrics:
[1] pry(main)> ::Search::ClusterHealthCheck::Elastic.healthy?
W, [2023-08-04T17:04:43.520053 #95276]  WARN -- : Failed to open TCP connection to localhost:9200 (Connection refused - connect(2) for "localhost" port 9200)
=> false

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #409573

Edited by Madelein van Niekerk

Merge request reports