Add ES housekeeping to purge documents for deleted DB records
Problem to solve
The Elasticsearch index can get out of sync with the records in the database, due to any number of reasons:
- Temporary disabling of ES indexing in the GitLab administration UI
- Temporary unavailability of ES cluster
- Indexing background jobs failing or not getting run for various reasons
- Bugs in our code
The effect from this is that outdated search results can be shown to users (though in some cases they get excluded automatically, due to how queries are structured). This can result in:
- 404 errors and user confusion
- Unintended data disclosure
- Bugs in our application code like https://gitlab.com/gitlab-org/gitlab-ee/issues/12578
Intended users
- Devon (DevOps Engineer)
- Sidney (Systems Administrator)
Further details
Known workarounds:
- Re-run GitLab deletion jobs through the Rails console
- Purge documents from ES through its API
- Delete and recreate the full ES index
Proposal
Implement a housekeeping task which can be run manually through a Rake task, and automated through a cronjob.
The task should compare the contents of the ES index with the contents of the DB, and purge any documents that reference DB records which don't exist anymore.
This will be tricky to scale, so the task should perform queries in batches and use the minimal amount of information for comparison (e.g. DB primary keys).
As a first iteration we can focus on projects, and also remove child documents in ES if their parent projects don't exist anymore in the DB.
Permissions and Security
Rake task are only accessible to instance administrators.
Documentation
The Rake task needs to be documented in doc/integration/elasticsearch.md
Testing
The task will delete data (though only ephemeral data inside ES), so we need to make sure it only deletes the right data.
What does success look like, and how can we measure that?
The task should record the number of deletions per document type, and also record its runtime (though we might see that already through Sidekiq).
To get a better feeling for how often the index actually gets out of sync, we should put this behind a feature flag and enable it on gitlab.com first.
Once we enable the task for self-managed instances, we should also look into reporting statistics back to us through usage ping or similar.
What is the type of buyer?
This feature helps to ensure the correctness of our Elasticsearch functionality, so it should be available in GitLab Starter.