Allow an elasticsearch index to be recovered gracefully(ish) after gaps in indexing
As seen in https://gitlab.com/gitlab-org/gitlab-ee/issues/5282
Sometimes, customers disable the elasticsearch indexing
checkbox in GitLab, for one reason or another, or sidekiq jobs relating to elasticsearch indexing end up in the dead jobs
list due to an outage.
This causes the index to get get out-of-sync with the database, leading to incorrect or invalid search results being returned. At present, the only remedy is to drop the index completely and re-create it from scratch, which is a huge burden for larger installations.
I think we can make a first pass on a graceful recovery with the following parts:
- Reindex just the database contents from scratch
- Remove any repository contents relating to a deleted project
- Run the indexing task for all repositories from the current, intermediate state
On GitLab.com, indexing the database was a 24-hour operation, as opposed to the repositories being about a month, and this is where we're missing some kind of "last_indexed_at" status. We can throw it all away, and concentrate on keeping as much of the repository data as possible.
For repositories, we do store state - the commit that was last indexed. As long as that commit hasn't been force-pushed away on the default branch, we can always just schedule another reindex from that commit to current master (a no-op if nothing has changed).
So, the only problem remaining is commits and blobs for those repositories that have been removed in the interim. These can be detected and deleted from the elasticsearch index by adding a maintenance rake task and/or an elasticsearch management dashboard to gitlab. We're long overdue the latter :)