[META] Switch to Elasticsearch on GitLab.com for search functionality
In https://gitlab.com/gitlab-org/gitlab-ce/issues/27084 , https://gitlab.com/gitlab-com/infrastructure/issues/1157 and https://gitlab.com/gitlab-com/infrastructure/issues/1477 , we've improved the Elasticsearch support in GitLab EE, experimented with its performance at GitLab.com scale, and built a production elasticsearch cluster for GitLab.com
We've demonstrated that we can throttle elasticsearch indexing to prevent outages. Now we need to think about enabling elasticsearch indexing on GitLab.com permanently and using it to service search queries.
We need to set up a number of dedicated sidekiq-cluster workers for the elastic_batch_project_indexer
and for the elasticsearch
(9.0) or ["elastic_indexer", "elastic_commit_indexer"]
(9.1rc1+) queues. CPU and RAM requirements are high, so perhaps these should be isolated to their own hosts.
Sidekiq job throttling does not affect these workers, so can be used to keep the number of elasticsearch jobs processed on the shared infrastructure low.
A concurrency of 10 should be sufficient for the elastic_indexer
queue. We're more uncertain about the other two, and requirements are likely to be higher before repository backfill is completed. Perhaps start with 20 for elastic_commit_indexer
and increase it if necessary.
Once indexing is enabled permanently, we need to run the indexing backfill jobs:
$ sudo gitlab-rake gitlab:elastic:index_repositories_async
This will enqueue a large number of jobs into the elastic_batch_project_indexer
queue. It doesn't matter if we take a week or four to work through them.
$ sudo gitlab-rake gitlab:elastic:index_database
This indexes every relevant row in the database and takes about a working day to complete, so should be run in screen
.