elastic: Intelligently retry bulk-insert failures when indexing

Summary

Discovered while working on gitlab-com/gl-infra/production#800 (closed)

We use the elastic bulk insert API to index many documents quickly. This is true for all repository content, and also for the initial import of database content.

It is a design feature of elasticsearch that some bulk-inserts can fail: https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster - essentially, it's a backoff mechanism: too much parallelism would cause the elasticsearch server to drown in a queue of updates, so it rejects some requests.

However, we don't handle this well on the GitLab side. For bulk-indexing database content, we silently ignore the failures. I've observed this cause, e.g., 10% of snippets to fail to be indexed. At a bad time, it could cause 100% of a given project's database records to fail to be indexed, yet we'd consider the project to have been successfully indexed; and it's quite possible we've already seen this on production.

We have persistent status data for repositories, so there the impact is somewhat reduced - we can see that the repo hasn't been properly indexed, and so retry the operation at the sidekiq level.

Steps to reproduce

Index projects with lots and lots of parallelism. Observe many mysterious failures.

PersonalSnippet.es_import in a rails console will return a number greater than 0, indicating some documents failed to index.

ElasticCommitIndexerWorker.new.perform on certain projects (especially ones with a large number of documents) will cause later bulk requests to fail with output like:

Flushing error:  Failed to perform all operations
2019/06/24 11:55:01 bulk request 4: failed to insert 2/2 documents 
2019/06/24 11:55:01 bulk request 5: failed to insert 6/6 documents 
2019/06/24 11:55:01 bulk request 6: failed to insert 1/1 documents 
2019/06/24 11:55:01 bulk request 7: failed to insert 2/2 documents 
2019/06/24 11:55:01 bulk request 8: failed to insert 1/1 documents 
2019/06/24 11:55:01 bulk request 9: failed to insert 1/1 documents 
2019/06/24 11:55:01 bulk request 10: failed to insert 1/1 documents

Since repository indexing is an all-or-nothing scenario, our sidekiq-level retries must redo the work that was in bulk requests 1-3 in this example.

Output of checks

This bug happens on GitLab.com

Possible fixes

In both cases, having some documents fail is an expected part of making bulk requests. We can get status information back listing exactly the documents that failed, and use that to retry the documents at the operation level before failing the whole operation.

For database content, we need to ensure that we do fail the operation. At present, we don't track database indexing status at all, meaning initial indexing can easily leave out some documents, causing search gaps. cc @phikai