Reindex GitLab.com Global Search Elasticsearch cluster to fix large segments
Production Change
Change Summary
We will perform a reindex of our main Global Search Elasticsearch index to hopefully resolve the performance regressions discovered in gitlab-org/gitlab#292439 (closed)
Change Details
- Services Impacted - Elasticsearch global search
- Change Technician - @DylanGriffith
- Change Criticality - C3
- Change Type - changescheduled
- Change Reviewer - @msmiley @cindy
- Due Date - 2020-12-14
- Time tracking - 1hr
- Downtime Component - Indexing will be paused throughout the duration. It could be as long as 24 hrs based on previous attempts.
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 30
-
Run all the steps on staging -
Make the cluster larger if necessary. It should be less than 40% full (more than 60% free)
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30
-
Confirm the cluster storage is less than 40% full (more than 60% free) -
Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a reindex of our production Elasticsearch cluster which will re-index all of our production global search index to another index in the same cluster using the Elasticsearch reindex API. During the reindex we'll be pausing indexing to the cluster which will cause the incremental updates queue to grow but should not cause alerts since we don't have any for that queue. This will increase load on the Elasticsearch cluster but should not impact any other systems. <LINK>
-
Pause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
-
In any console set CLUSTER_URL
and confirm that it is the expected cluster with expected indices:curl $CLUSTER_URL/_cat/indices
-
Note the total size of the source gitlab-production-202010260000
index:5.8 TB
-
Note the size of all segments and attach to comment on this issue curl "$CLUSTER_URL/_cat/segments/gitlab-production?v&s=size" > segments-before.txt
- #3172 (comment 466597307)
-
Note the total number of documents in the source gitlab-production-202010260000
index:706851986
curl $CLUSTER_URL/gitlab-production-202010260000/_count
-
Create new destination index gitlab-production-202012140000
with correct settings/mappings from rails console:Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false, options: { index_name: 'gitlab-production-202012140000' })
-
Set index settings in destination index to optimize for writes: curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202012140000/_settings"
-
Trigger re-index from source index gitlab-production-202010260000
to destination indexgitlab-production-202012140000
curl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production-202010260000" }, "dest": { "index": "gitlab-production-202012140000" } } }' -X POST "$CLUSTER_URL/_reindex?slices=180&wait_for_completion=false&scroll=1h"
-
Note the returned task ID from the above: r0RI54oqRmK9JZiXZ9O41Q:383322
-
Note the time when the task started: 2020-12-14 23:19:22 UTC
-
Track the progress of reindexing using the Tasks API curl $CLUSTER_URL/_tasks/$TASK_ID
- If failures happen only in some slices it's possible to retry those slices only following the steps used last time
-
Note the time when the task finishes: 2020-12-14 XX:XX:XX UTC
-
Note the total time taken to reindex: XX hrs
-
Change the refresh_interval
setting on Destination Index to60s
curl -XPUT -d '{"index":{"refresh_interval":"60s"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202012140000/_settings"
-
Verify number of documents in Destination index
=number of documents in Source index
- Be aware it may take 60s to refresh on the destination cluster
-
curl $CLUSTER_URL/gitlab-production-202010260000/_count
=>XX
-
curl $CLUSTER_URL/gitlab-production-202012140000/_count
=>XX
-
Increase replication on Destination index to 1
:-
curl -XPUT -d '{"index":{"number_of_replicas":"1"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202012140000/_settings"
-
-
Increase recovery max bytes to speed up replication: -
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "400mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
-
-
Wait for cluster monitoring to show the replication has completed -
Set recovery max bytes back to default -
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
-
-
Set translog
durability back torequest
on the destination index back to default:curl -XPUT -d '{"index":{"translog":{"durability":"request"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202012140000/_settings"
-
Note the size of the destination index gitlab-production-202012140000
index:XX TB
-
Note the size of all segments and attach to comment on this issue curl "$CLUSTER_URL/_cat/segments/gitlab-production-202012140000?v&s=size" > segments-after.txt
-
Update the alias gitlab-production
to point to the new index-
curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202012140000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202010260000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
-
Confirm it works curl $CLUSTER_URL/gitlab-production/_count
-
-
Unpause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
-
Wait until the backlog of incremental updates gets below 10,000 - Chart
Global search incremental indexing queue depth
https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- Chart
-
Create a comment somewhere then search for it to ensure indexing still works (can take up to 2 minutes before it shows up in the search results) -
Delete the old gitlab-production-202010260000
indexcurl -XDELETE $CLUSTER_URL/gitlab-production-202010260000
-
Test again that searches work as expected
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
If you resized the cluster then scale it back down based on the new storage requirements
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
- If the ongoing reindex is consuming too many resources it is possible to throttle the running reindex :
- You can check the index write throughput in ES monitoring to determine a sensible throttle. Since it defaults to no throttling at all it's safe to just set some throttle and observe the impact
curl -XPOST "$CLUSTER_URL/_reindex/$TASK_ID/_rethrottle?requests_per_second=500
-
If you get past the step of updating the alias then simply switch the Alias to point back to the original index -
curl -XPOST -H 'Content-Type: application/json' -d '"actions":[{"add":{"index":"gitlab-production-202007070000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202007270000,"alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
-
Confirm it works curl $CLUSTER_URL/gitlab-production/_count
-
-
Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track
as well as any updates that went through sidekiq workersElasticCommitIndexerWorker
,ElasticDeleteProjectWorker
.
Monitoring
Key metrics to observe
- Metric: Elasticsearch cluster health
- Location: https://00a4ef3362214c44a044feaa539b4686.us-central1.gcp.cloud.es.io:9243/app/monitoring#/overview?_g=(cluster_uuid:HdF5sKvcT5WQHHyYR_EDcw)
- What changes to this metric should prompt a rollback: Unhealthy nodes/indices that do not recover
- Metric: Elasticsearch monitoring in Grafana
- Metric: Indexing queues
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: After unpausing the indexing is failing and the queues are constantly growing
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.
Edited by Dylan Griffith