Health Check can fail when node_stats cannot get metrics

Summary

In a customer ticket (ZD 543884) (Internal) we saw a case where a health check was failing unexpectedly, preventing ElasticIndexInitialBulkCronWorker from executing.

We were consistently seeing some errors indicating the the cluster is unhealthy, both before a reindex and after. The errors are like the below:

elasticsearch.log.10:{"severity":"ERROR","time":"2024-06-21T00:50:05.824Z","meta.caller_id":"Elastic::MigrationWorker","correlation_id":"629ab4e40a7e03b3606d1417cc67d73e","meta.root_caller_id":"Cronjob","meta.feature_category":"global_search","meta.client_id":"ip/","class":"Elastic::MigrationWorker","message":"Advanced search cluster is unhealthy. Execution is skipped.","job_status":"running","queue":"default","jid":"82ec80a6e118ab731d68c709"}

The customer provided logs from the cluster, and we could clearly see that the status of the cluster was green. We originally believed there might have been some sort of intermittent connection issue, but we know that the cluster is connected and reachable, despite the messages showing consistently every 5 minutes. In the tests, we were able to use the elastic helper to reach the cluster.

We knew that processing was indeed not working because the size of the indexing queues was large:

Indexing Queues
Initial queue: 161563
Incremental queue: 26817

In the end, the only thing we were able to determine was that the cluster was unhealthy due to the repeated message:

"advanced search cluster is unhealthy. ElasticIndexInitialBulkCronWorker execution is skipped.

We confirmed that the cluster was indeed failing the health check and that non_cached_metrics was returning nil

irb(main):002:0> Search::ClusterHealthCheck::Elastic.healthy?
=> false
irb(main):003:0> Search::ClusterHealthCheck::Elastic.non_cached_metrics
=> nil

non_cached_metrics should return the latest metrics from the cluster. We ran the code manually and found out that node_stats.map was failing:

# Define the client which uses the helper to formulate how we connect to the cluster
def client
  @client ||= Gitlab::Elastic::Helper.default.client
end

# Gets the stats from the ES node
def node_stats
  @node_stats ||= client.nodes.stats(metric: %w[os jvm])['nodes']
end

# Test node_load_averages - I expect this to fail
node_stats.map { |node| node.last['os']['cpu']['load_average']['1m'] }

# Retrieve all node stats
node_stats

Results:

irb(main):007:0> node_stats.map { |node| node.last['os']['cpu']['load_average']['1m'] }
(irb):7:in `block in <top (required)>': undefined method `[]' for nil:NilClass (NoMethodError)

When looking at node_stats we could see the response was:

"os"=>
    {"timestamp"=>1721934044835,
     "cpu"=>{"percent"=>2},

In this case, there is no load_average => 1m. We are expecting something like:

   "os"=>
    {"timestamp"=>1721930608943,
     "cpu"=>{"percent"=>0, "load_average"=>{"1m"=>0.0, "5m"=>0.0, "15m"=>0.0}},

According to the Elasticsearch documentation:

(field is not present if one-minute load average is not available).

In this case, the customer is running a Elasticsearch in a Windows VM, despite being a supported version. This is the primary differentiator where it's possible that load_average and other metrics may not be available. Due to this, it's not possible to pass the health check or easily prove that the cluster is healthy to allow it to index. This wasn't a problem in 15.11 because the health check was not introduced until 16.

Steps to reproduce

Did not directly reproduce.

What is the current bug behavior?

Search::ClusterHealthCheck::Elastic.healthy? fails if some metrics are not able to be retrieved.

What is the expected correct behavior?

The health check should not fail if the cluster is indeed healthy. We should skip a check if the metric is not available and log that to elasticsearch.log.

Relevant logs and/or screenshots

Possible fixes

We could consider performing additional checks to determine if the load_average and similar fields are actually available and skip them if they are not available.