Reindex GitLab.com Global Search Elasticsearch cluster main index using Zero-Downtime reindexing

Production Change

We have a list of changes we want to apply to GitLab.com main Advanced Search index:

This can be done by reindexing the index. Only execute this change request after gitlab-org/gitlab!100424 (merged) has reached production

Services Impacted - Elasticsearch global search
Change Technician - @dgruzd (EMEA) @john-mason (AMER)
Change Reviewer - @terrichu
Time tracking - 48h
Downtime Component - No downtime, but Advanced Search indexing will be paused

Estimated Time to Complete (mins) - 60

Estimated Time to Complete (mins) - 60 to trigger reindexing

Estimated Time to Complete (mins) - 60

If the ongoing reindex is consuming too many resources it is possible to throttle the running reindex :
1. You can check the index write throughput in ES monitoring to determine a sensible throttle. Since it defaults to no throttling at all it's safe to just set some throttle and observe the impact
2. curl -XPOST "$CLUSTER_URL/_reindex/$TASK_ID/_rethrottle?requests_per_second=500
- If reindexing task fails, it will automatically revert to the original index
- Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track as well as any updates that went through sidekiq workers ElasticCommitIndexerWorker, ElasticDeleteProjectWorker.
Delete incomplete index gitlab-production-20221020-2340 by running curl -XDELETE "$CLUSTER_URL/gitlab-production-20221020-2340"
Set label changeaborted /label ~change::aborted

Metric: Elasticsearch cluster health
- Location: https://00a4ef3362214c44a044feaa539b4686.us-central1.gcp.cloud.es.io:9243/app/monitoring#/overview?_g=(cluster_uuid:HdF5sKvcT5WQHHyYR_EDcw)
- What changes to this metric should prompt a rollback: Unhealthy nodes/indices that do not recover
Metric: Elasticsearch monitoring in Grafana
- Location: https://dashboards.gitlab.net/d/search-main/search-overview?orgId=1
Metric: Indexing queues
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: After unpausing the indexing is failing and the queues are constantly growing

Edited Oct 24, 2022 by Dmitry Gruzd