GitLab Elasticsearch integration support for in-cluster re-indexing
Problem to solve
We are planning on implementing the ability to re-index everything in GitLab to a new cluster/index in !17230 (closed) but this may not always be the most efficient option and there are often cases where we'll just want to do a straight re-indexing in the Elasticsearch cluster.
It's possible that the re-indexing using all of GitLab's code/data might end up being considerably more costly and put more strain on other systems like Postgres, Redis, Sidekiq when we could be doing this re-indexing in Elasticsearch. Doing a re-indexing in Elasticsearch is certainly more limited than the implementation in !17230 (closed) since it will only cover cases where the index options have changed and not where the application code surrounding it has changed but in such cases which may be somewhat frequent it will likely be the most efficient option.
Intended users
Further details
Related to:
Proposal
If we have #204826 (closed) and #213628 (closed) we could add a feature to GitLab admin UI which pauses indexing, creates a new index and trigger a re-index and then swap aliases when it's done. This would fully automate the processes in gitlab-com/gl-infra/production#1907 (closed).
This could actually just be a single button called "reindex in cluster" which automates everything. It could show the task ID of the reindexing process and display whatever progress we can get from the task API.
This feature would need to use a lock to avoid it being pressed twice. So once it is pressed and until it is completed the button will be disabled and the progress indicator will be displayed.
This feature could be used by anyone that wants to roll out optional index settings changes that we release.
One thing worth noting is that we've ran into several issues in the past with the reindex API not being very robust against errors. So we'd need to address this somehow. We can of course abort the operation if an error occurs but it's likely that we might need to provide configuration options to the reindex in order for it to work under all circumstances. Previous errors:
-
search_context_missing_exception
=> gitlab-com/gl-infra/production#1902 (comment 318813682) => possibly thescroll
was timing out and maybe needed to be configured for a longer time -
Remote responded with a chunk that was too large. Use a smaller batch size.
=> gitlab-com/gl-infra/production#1862 (comment 315666096) => needed a smaller batchsource.size
configured but it's not clear to me if these settings actually need to be tweaked unless you are reindexing to remote but this issue will only implement indexing within the same cluster.
Permissions and Security
Documentation
Availability & Testing
We believe this warrants a new end to end UI test that
- Starts an in-cluster reindex
- Then makes sure the index button is locked
- Waits for completion (How will we know it's done? Will the button become un-locked? Also, is there a way to tell via the API?)
- Asserts a search completes properly.
If we're able to trigger this via the API then we'll add an API test that:
- Starts an in-cluster reindex
- Verifies a second call to reindex replies with a sensible message like "reindexing in progress".
- Waits for completion
- Asserts an API search completes properly.
Since this change only effects the Elasticsearch integration page UI and perhaps the API running the above test should suffice for QA, no need to verify with a full package-and-qa
run.
What does success look like, and how can we measure that?
What is the type of buyer?
Is this a cross-stage feature?
Links / references
Reindex API: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html