Elastic cluster health checker (!127731) · Merge requests · GitLab.org / GitLab

Madelein van Niekerk requested to merge 409573-elastic-health-checker into master Jul 27, 2023

What does this MR do and why?

Adds a health checker for elastic cluster that can be called as Search::ClusterHealthCheck::Elastic.healthy?. This is part 2 of the implementation plan.

The logic for the existing health check is transferred so that we can iterate on it. At the moment it only takes action when the cluster status is red.

When the feature flag log_advanced_search_cluster_health_elastic is enabled, we also log useful metrics related to the cluster health:

load_average: the N highest load_average from all nodes - os.cpu.load_average.1m
heap_usage: the N highest heap_used_percent from all nodes - jvm.mem.heap_used_percent
utilization: a calculation which adds saturation for heap_usage and load_average to give a percentage. Each metric has a saturation threshold and a multiplication factor. We will monitor the metrics and decide if the factors need to be changed.
1. load_average reaches 0.5 saturation at 15 and multiplied by 1
2. heap_usage reaches 0.5 saturation at 90 and is multiplied by 0.8 so that load carries more weight than heap.

If the cluster does not return with a valid response, we only log a warning with the error message, return false and don't calculate the metrics, e.g.:

[1] pry(main)> ::Search::ClusterHealthCheck::Elastic.healthy?
W, [2023-08-04T17:04:43.520053 #95276]  WARN -- : Failed to open TCP connection to localhost:9200 (Connection refused - connect(2) for "localhost" port 9200)
=> false

The metrics are cached for 5 minutes and will be logged every 1 minute if the flag is enabled.

How to set up and validate locally

Change the logger in ee/lib/search/cluster_health_check/elastic.rb:L93 to ::Logger.new($stdout)
Verify ::Search::ClusterHealthCheck::Elastic.healthy? => true
Note the metrics that are logged
Disable the feature flag: Feature.disable(:log_advanced_search_cluster_health_elastic)
Verify ::Search::ClusterHealthCheck::Elastic.healthy? => true
Note that nothing is logged
Stop elasticsearch cluster: gdk stop elasticsearch
Enable the feature flag: Feature.enable(:log_advanced_search_cluster_health_elastic)
Note that ::Search::ClusterHealthCheck::Elastic.healthy? logs a warning with the error message, returns false and does not have a log containing metrics:

[1] pry(main)> ::Search::ClusterHealthCheck::Elastic.healthy?
W, [2023-08-04T17:04:43.520053 #95276]  WARN -- : Failed to open TCP connection to localhost:9200 (Connection refused - connect(2) for "localhost" port 9200)
=> false

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #409573

Edited Aug 07, 2023 by Madelein van Niekerk

Elastic cluster health checker

What does this MR do and why?

How to set up and validate locally

MR acceptance checklist

Merge request reports