Do not run bulk cron indexer when cluster is unhealthy (!123675) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

We saw that in the related issue, customers were having jobs hang while waiting to connect to an Elasticsearch instance that was unreachable or unhealthy (some jobs took > 60 seconds). The cron worker is set to run every 1 minute and starts up 16 more of itself (which all then hang).

This will prevent any cron worker from processing a shard OR queueing up the 16 shards if the search cluster is not available or unhealthy. There is a 5 minute cache to help give the server time to recover.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

stop elasticsearch: gdk stop elasticsearch
open rails console
delete cache for healthy?: Rails.cache.delete(['Gitlab::Elastic::Helper', :healthy?])
run bulk cron worker: ElasticIndexBulkCronWorker.new.perform, should return false
verify in sidekiq logs you get a warning

2023-06-14_14:33:04.98440 rails-background-jobs                 : {"severity":"WARN","time":"2023-06-14T14:33:04.984Z","message":"Elasticsearch cluster is unhealthy or unreachable. ElasticIndexBulkCronWorker execution is skipped.","retry":0}

start elasticsearch: gdk start elasticsearch
delete cache for healthy?: Rails.cache.delete(['Gitlab::Elastic::Helper', :healthy?])
run bulk cron worker: ElasticIndexBulkCronWorker.new.perform, should return an array

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Jun 14, 2023 by Terri Chu

Do not run bulk cron indexer when cluster is unhealthy

What does this MR do and why?

Screenshots or screen recordings

How to set up and validate locally

MR acceptance checklist

Merge request reports