Make it easy to disable elasticsearch indexing for a troublesome project, then re-add later, without harming the overall index
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problem to solve
When adding a group to the elasticsearch rollout on GitLab.com, we bumped into a problem that caused a cascading failure of all elasticsearch nodes on GitLab.com . In general, there are always going to be individual groups or projects that put the stability of indexing generally at risk.
The full index for every group or project on GitLab.com is only valuable if it is consistent - i.e., we haven't lost or mislaid any created/updated/deleted events for the projects. So, a single misbehaving project can endanger the whole index.
Intended users
Proposal
We should make it easy for an administrator to identify projects that may be troublesome for elasticsearch (perhaps a heat ranking based on how many updates they're enqueuing, or some other measure of cost?), and present an option to exclude specific projects from indexing. Effectively, we'd sacrifice one project to retain the integrity of the rest of the index.
This action should:
- Prevent any further jobs from being pushed to sidekiq
- Remove any existing sidekiq jobs
- Disable searching via elasticsearch for this job
It should not assume that elasticsearch is available, so should not attempt to remove elasticsearch jobs.
However, it should also be reversible - we should be able to remove the exceptional state from the project at some point. When this is does, elasticsearch is assumed to be available, so we can:
- Remove all elasticsearch documents for this project
- Reset the
index_status
- Restart indexing from scratch
When using elasticsearch selective sync, we get a poor man's version of this by removing and re-adding the project from the list of selected projects. However:
- There's no way to blacklist a single project in a group
- Existing load (jobs in sidekiq) is not cleared
- The action expects to be able to remove documents from elasticsearch when the cluster is down
So, I think we can justify an additional action here.
Permissions and Security
This functionality should only be available to instance admins via the elasticsearch panel
What does success look like, and how can we measure that?
During an elasticsearch outage caused by abnormal behaviour from a single project, we can sacrifice that project to bring elasticsearch back online, without the health of the overall index being affected.
Once the underlying problem is remedied, we can re-add the single project and it is added to the index in a consistent way.