Skip to content

Do not run bulk cron indexer when cluster is unhealthy

What does this MR do and why?

Related to #415101 (closed)

We saw that in the related issue, customers were having jobs hang while waiting to connect to an Elasticsearch instance that was unreachable or unhealthy (some jobs took > 60 seconds). The cron worker is set to run every 1 minute and starts up 16 more of itself (which all then hang).

This will prevent any cron worker from processing a shard OR queueing up the 16 shards if the search cluster is not available or unhealthy. There is a 5 minute cache to help give the server time to recover.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

  1. stop elasticsearch: gdk stop elasticsearch
  2. open rails console
  3. delete cache for healthy?: Rails.cache.delete(['Gitlab::Elastic::Helper', :healthy?])
  4. run bulk cron worker: ElasticIndexBulkCronWorker.new.perform, should return false
  5. verify in sidekiq logs you get a warning
2023-06-14_14:33:04.98440 rails-background-jobs                 : {"severity":"WARN","time":"2023-06-14T14:33:04.984Z","message":"Elasticsearch cluster is unhealthy or unreachable. ElasticIndexBulkCronWorker execution is skipped.","retry":0}
  1. start elasticsearch: gdk start elasticsearch
  2. delete cache for healthy?: Rails.cache.delete(['Gitlab::Elastic::Helper', :healthy?])
  3. run bulk cron worker: ElasticIndexBulkCronWorker.new.perform, should return an array

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Terri Chu

Merge request reports