Elastic cluster health checker
What does this MR do and why?
Adds a health checker for elastic cluster that can be called as Search::ClusterHealthCheck::Elastic.healthy?. This is part 2 of the implementation plan.
The logic for the existing health check is transferred so that we can iterate on it. At the moment it only takes action when the cluster status is red.
When the feature flag log_advanced_search_cluster_health_elastic is enabled, we also log useful metrics related to the cluster health:
-
load_average: the N highestload_averagefrom all nodes -os.cpu.load_average.1m -
heap_usage: the N highestheap_used_percentfrom all nodes -jvm.mem.heap_used_percent -
utilization: a calculation which adds saturation forheap_usageandload_averageto give a percentage. Each metric has a saturation threshold and a multiplication factor. We will monitor the metrics and decide if the factors need to be changed.-
load_averagereaches0.5saturation at15and multiplied by1 -
heap_usagereaches0.5saturation at90and is multiplied by0.8so that load carries more weight than heap.
-
If the cluster does not return with a valid response, we only log a warning with the error message, return false and don't calculate the metrics, e.g.:
[1] pry(main)> ::Search::ClusterHealthCheck::Elastic.healthy?
W, [2023-08-04T17:04:43.520053 #95276] WARN -- : Failed to open TCP connection to localhost:9200 (Connection refused - connect(2) for "localhost" port 9200)
=> false
The metrics are cached for 5 minutes and will be logged every 1 minute if the flag is enabled.
How to set up and validate locally
- Change the logger in
ee/lib/search/cluster_health_check/elastic.rb:L93to::Logger.new($stdout) - Verify
::Search::ClusterHealthCheck::Elastic.healthy? => true - Note the metrics that are logged
- Disable the feature flag:
Feature.disable(:log_advanced_search_cluster_health_elastic) - Verify
::Search::ClusterHealthCheck::Elastic.healthy? => true - Note that nothing is logged
- Stop elasticsearch cluster:
gdk stop elasticsearch - Enable the feature flag:
Feature.enable(:log_advanced_search_cluster_health_elastic) - Note that
::Search::ClusterHealthCheck::Elastic.healthy?logs a warning with the error message, returns false and does not have a log containing metrics:
[1] pry(main)> ::Search::ClusterHealthCheck::Elastic.healthy?
W, [2023-08-04T17:04:43.520053 #95276] WARN -- : Failed to open TCP connection to localhost:9200 (Connection refused - connect(2) for "localhost" port 9200)
=> false
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #409573