Retry failed ES deletions
Summary
We use the Delete By Query API to perform bulk-deletions in ES. The API can return failed deletions, and it looks like we need to check for these and retry, similar to the recent change to retry bulk indexing.
We should also verify that errors in the non-bulk Delete API correctly raise an error in our Ruby code.
Improvements
Similar to https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/14657, we should retry deletions 3 times and fail background jobs if they don't succeed.
ES itself will also perform the deletion query in batches, with its own retry logic:
During the
_delete_by_query
execution, multiple search requests are sequentially executed in order to find all the matching documents to delete. Every time a batch of documents is found, a corresponding bulk request is executed to delete all these documents. In case a search or bulk request got rejected,_delete_by_query
relies on a default policy to retry rejected requests (up to 10 times, with exponential back off). Reaching the maximum retries limit causes the_delete_by_query
to abort and all failures are returned in thefailures
of the response. The deletions that have been performed still stick. In other words, the process is not rolled back, only aborted. While the first failure causes the abort, all failures that are returned by the failing bulk request are returned in the failures element; therefore it’s possible for there to be quite a few failed entities.
It's also possible that documents are not deleted because they have been updated while the deletion query is running:
_delete_by_query
gets a snapshot of the index when it starts and deletes what it finds using internal versioning. That means that you’ll get a version conflict if the document changes between the time when the snapshot was taken and when the delete request is processed. When the versions match the document is deleted.
These get reported in version_conflicts
in the response, and we might want to pass conflicts=proceed
to the request to avoid stopping at the first conflict.
Involved components
-
ElasticIndexerWorker#remove_documents_by_project_id
and#remove_children_documents
Elasticsearch::Git::Repository#delete_index_for_commits_and_blobs
GemExtensions::Elasticsearch::Model::Indexing::InstanceMethods#delete_document
Optional: Intended side effects
This should help to avoid the ES index getting out of sync with the DB (another approach for this is in https://gitlab.com/gitlab-org/gitlab-ee/issues/14035).
Optional: Missing test coverage
I'm not sure yet how to reproduce a document deletion failure.