Add ES housekeeping to purge documents for deleted DB records

Problem to solve

The Elasticsearch index can get out of sync with the records in the database, due to any number of reasons:

Temporary disabling of ES indexing in the GitLab administration UI
Temporary unavailability of ES cluster
Indexing background jobs failing or not getting run for various reasons
Bugs in our code

The effect from this is that outdated search results can be shown to users (though in some cases they get excluded automatically, due to how queries are structured). This can result in:

404 errors and user confusion
Unintended data disclosure
Bugs in our application code like https://gitlab.com/gitlab-org/gitlab-ee/issues/12578

Intended users

Devon (DevOps Engineer)
Sidney (Systems Administrator)

Further details

Known workarounds:

Re-run GitLab deletion jobs through the Rails console
Purge documents from ES through its API
Delete and recreate the full ES index

Proposal

Implement a housekeeping task which can be run manually through a Rake task, and automated through a cronjob.

The task should compare the contents of the ES index with the contents of the DB, and purge any documents that reference DB records which don't exist anymore.

This will be tricky to scale, so the task should perform queries in batches and use the minimal amount of information for comparison (e.g. DB primary keys).

As a first iteration we can focus on projects, and also remove child documents in ES if their parent projects don't exist anymore in the DB.

Permissions and Security

Rake task are only accessible to instance administrators.

Documentation

The Rake task needs to be documented in doc/integration/elasticsearch.md

Testing

The task will delete data (though only ephemeral data inside ES), so we need to make sure it only deletes the right data.

What does success look like, and how can we measure that?

The task should record the number of deletions per document type, and also record its runtime (though we might see that already through Sidekiq).

To get a better feeling for how often the index actually gets out of sync, we should put this behind a feature flag and enable it on gitlab.com first.

Once we enable the task for self-managed instances, we should also look into reporting statistics back to us through usage ping or similar.

What is the type of buyer?

This feature helps to ensure the correctness of our Elasticsearch functionality, so it should be available in GitLab Starter.