Handle version conflict errors in ElasticDeleteProjectWorker
What does this MR do and why?
Handles Elasticsearch::Transport::Transport::Errors::Conflict
errors in ElasticDeleteProjectWorker
which contributes 26% of Global Search's error budget.
The failure occurs when multiple processes try to change a document in Elasticsearch at the same time. Elastic employs a version per document precisely to check that it hasn't changed before actioning another change.
https://www.elastic.co/guide/en/elasticsearch/reference/8.6/optimistic-concurrency-control.html
This happens most often in remove_children_documents
delete_by_query
(11607 out of 11675 times in the last 7 days).
The fix is to rescue the error and re-enqueue the worker with the same args with a delay so that when it tries again, the version is resolved.
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
How to set up and validate locally
- Checkout master
- Run
ElasticDeleteProjectWorker
for a project at the same time using threads:project = Project.find(some-id) thread1 = Thread.new { ElasticDeleteProjectWorker.new.perform(project.id, project.es_id) } thread2 = Thread.new { ElasticDeleteProjectWorker.new.perform(project.id, project.es_id) } thread1.join thread2.join
- Note that it results in
Elasticsearch::Transport::Transport::Errors::Conflict
errors - Checkout this branch
- Run the threads again
- Note that it doesn't result in an error and check that all the documents have been successfully removed from elasticsearch
Related to #442823 (closed)