Retry failed ES deletions

Summary

We use the Delete By Query API to perform bulk-deletions in ES. The API can return failed deletions, and it looks like we need to check for these and retry, similar to the recent change to retry bulk indexing.

We should also verify that errors in the non-bulk Delete API correctly raise an error in our Ruby code.

Improvements

Similar to https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/14657, we should retry deletions 3 times and fail background jobs if they don't succeed.

ES itself will also perform the deletion query in batches, with its own retry logic:

During the _delete_by_query execution, multiple search requests are sequentially executed in order to find all the matching documents to delete. Every time a batch of documents is found, a corresponding bulk request is executed to delete all these documents. In case a search or bulk request got rejected, _delete_by_query relies on a default policy to retry rejected requests (up to 10 times, with exponential back off). Reaching the maximum retries limit causes the _delete_by_query to abort and all failures are returned in the failures of the response. The deletions that have been performed still stick. In other words, the process is not rolled back, only aborted. While the first failure causes the abort, all failures that are returned by the failing bulk request are returned in the failures element; therefore it’s possible for there to be quite a few failed entities.

It's also possible that documents are not deleted because they have been updated while the deletion query is running:

_delete_by_query gets a snapshot of the index when it starts and deletes what it finds using internal versioning. That means that you’ll get a version conflict if the document changes between the time when the snapshot was taken and when the delete request is processed. When the versions match the document is deleted.

These get reported in version_conflicts in the response, and we might want to pass conflicts=proceed to the request to avoid stopping at the first conflict.

Involved components

ElasticIndexerWorker#remove_documents_by_project_id and #remove_children_documents
Elasticsearch::Git::Repository#delete_index_for_commits_and_blobs
GemExtensions::Elasticsearch::Model::Indexing::InstanceMethods#delete_document

Optional: Intended side effects

This should help to avoid the ES index getting out of sync with the DB (another approach for this is in https://gitlab.com/gitlab-org/gitlab-ee/issues/14035).

Optional: Missing test coverage

I'm not sure yet how to reproduce a document deletion failure.

Admin message