Elastic cluster health checker
What does this MR do and why?
Adds a health checker for elastic cluster that can be called as Search::ClusterHealthCheck::Elastic.healthy?
. This is part 2 of the implementation plan.
The logic for the existing health check is transferred so that we can iterate on it. At the moment it only takes action when the cluster status is red.
When the feature flag log_advanced_search_cluster_health_elastic
is enabled, we also log useful metrics related to the cluster health:
-
load_average
: the N highestload_average
from all nodes -os.cpu.load_average.1m
-
heap_usage
: the N highestheap_used_percent
from all nodes -jvm.mem.heap_used_percent
-
utilization
: a calculation which adds saturation forheap_usage
andload_average
to give a percentage. Each metric has a saturation threshold and a multiplication factor. We will monitor the metrics and decide if the factors need to be changed.-
load_average
reaches0.5
saturation at15
and multiplied by1
-
heap_usage
reaches0.5
saturation at90
and is multiplied by0.8
so that load carries more weight than heap.
-
If the cluster does not return with a valid response, we only log a warning with the error message, return false and don't calculate the metrics, e.g.:
[1] pry(main)> ::Search::ClusterHealthCheck::Elastic.healthy?
W, [2023-08-04T17:04:43.520053 #95276] WARN -- : Failed to open TCP connection to localhost:9200 (Connection refused - connect(2) for "localhost" port 9200)
=> false
The metrics are cached for 5 minutes and will be logged every 1 minute if the flag is enabled.
How to set up and validate locally
- Change the logger in
ee/lib/search/cluster_health_check/elastic.rb:L93
to::Logger.new($stdout)
- Verify
::Search::ClusterHealthCheck::Elastic.healthy? => true
- Note the metrics that are logged
- Disable the feature flag:
Feature.disable(:log_advanced_search_cluster_health_elastic)
- Verify
::Search::ClusterHealthCheck::Elastic.healthy? => true
- Note that nothing is logged
- Stop elasticsearch cluster:
gdk stop elasticsearch
- Enable the feature flag:
Feature.enable(:log_advanced_search_cluster_health_elastic)
- Note that
::Search::ClusterHealthCheck::Elastic.healthy?
logs a warning with the error message, returns false and does not have a log containing metrics:
[1] pry(main)> ::Search::ClusterHealthCheck::Elastic.healthy?
W, [2023-08-04T17:04:43.520053 #95276] WARN -- : Failed to open TCP connection to localhost:9200 (Connection refused - connect(2) for "localhost" port 9200)
=> false
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #409573