Reindex GitLab.com Global Search Elasticsearch cluster to fix large segments

Production Change

Change Summary

We will perform a reindex of our main Global Search Elasticsearch index to hopefully resolve the performance regressions discovered in gitlab-org/gitlab#292439 (closed)

Change Details

Services Impacted - Elasticsearch global search
Change Technician - @DylanGriffith
Change Criticality - C3
Change Type - changescheduled
Change Reviewer - @msmiley @cindy
Due Date - 2020-12-14
Time tracking - 1hr
Downtime Component - Indexing will be paused throughout the duration. It could be as long as 24 hrs based on previous attempts.

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 30

Run all the steps on staging
1. #3172 (comment 464466907)
Make the cluster larger if necessary. It should be less than 40% full (more than 60% free)

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

If you resized the cluster then scale it back down based on the new storage requirements

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

If the ongoing reindex is consuming too many resources it is possible to throttle the running reindex :
1. You can check the index write throughput in ES monitoring to determine a sensible throttle. Since it defaults to no throttling at all it's safe to just set some throttle and observe the impact
2. curl -XPOST "$CLUSTER_URL/_reindex/$TASK_ID/_rethrottle?requests_per_second=500
- If you get past the step of updating the alias then simply switch the Alias to point back to the original index
  1. curl -XPOST -H 'Content-Type: application/json' -d '"actions":[{"add":{"index":"gitlab-production-202007070000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202007270000,"alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
  2. Confirm it works curl $CLUSTER_URL/gitlab-production/_count
- Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track as well as any updates that went through sidekiq workers ElasticCommitIndexerWorker, ElasticDeleteProjectWorker.

Monitoring

Key metrics to observe

Metric: Elasticsearch cluster health
- Location: https://00a4ef3362214c44a044feaa539b4686.us-central1.gcp.cloud.es.io:9243/app/monitoring#/overview?_g=(cluster_uuid:HdF5sKvcT5WQHHyYR_EDcw)
- What changes to this metric should prompt a rollback: Unhealthy nodes/indices that do not recover
Metric: Elasticsearch monitoring in Grafana
- Location: https://dashboards.gitlab.net/d/search-main/search-overview?orgId=1
Metric: Indexing queues
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: After unpausing the indexing is failing and the queues are constantly growing

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Dec 14, 2020 by Dylan Griffith