Split shards in GitLab.com Global Search Elasticsearch cluster -> 120 (240 inc. replicas)
Production Change
Change Summary
Double the number of shards in our gitlab-production-202007270000 Elasticsearch index of the prod-gitlab-com indexing-20200330 cluster. This is to improve performance as our shards are becoming quite large.
Change Details
- Services Impacted - Elasticsearch (for GitLab global search)
- Change Technician - @DylanGriffith
- Change Criticality - C3
- Change Type - changescheduled
- Change Reviewer - @DylanGriffith
- Due Date - 2020-10-26
- Time tracking -
- Downtime Component - Indexing will be paused for the duration. This took ~4hrs last time. Paused indexing means search results may be out of date but otherwise searching should still work for finding anything created before it was paused.
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 60 mins
-
Run all the steps on staging -
Make the cluster larger if necessary. It should be less than 25% full (more than 75% free)
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 5 hrs
-
Confirm the cluster storage is less than 25% full (more than 75% free) -
Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a "split index" on our production Global Search Elasticsearch cluster to increase the number of shards. We will pause indexing during the time it takes to split the index. Read more at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2872 -
Pause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing -
In any console set CLUSTER_URLand confirm that it is the expected cluster with expected indices:-
curl $CLUSTER_URL/_cat/indices
-
-
Wait until we see index writes drop to 0 in Elasticsearch monitoring -
Block writes to the source index: curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":true}}' $CLUSTER_URL/gitlab-production-202007270000/_settings
-
Take a snapshot of the cluster - Will happen on the half hour and this process will take more than half an hour
-
Note the total size of the source gitlab-production-202007270000index:7 TB -
Note the total number of documents in the source gitlab-production-202007270000index:637829071curl $CLUSTER_URL/gitlab-production-202007270000/_count
-
Add a comment to this issue with the shard sizes: curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202007270000?v&s=store:desc&h=shard,prirep,docs,store,node"- #2872 (comment 435703736)
-
Increase recovery max bytes to speed up replication: -
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
-
-
Trigger split from source index gitlab-production-202007270000to destination indexgitlab-production-202010260000-
curl -X POST -H 'Content-Type: application/json' -d '{"settings":{"index.number_of_shards": 120}}' "$CLUSTER_URL/gitlab-production-202007270000/_split/gitlab-production-202010260000?copy_settings=true"
-
-
Note the time when the task started: 2020-10-25 23:07 UTC -
Track the progress of splitting using the Recovery API curl "$CLUSTER_URL/_cat/recovery/gitlab-production-202010260000?v" -
Note the time when the split finishes: 2020-10-26 00:20 UTC -
Note the total time taken to recover: 1 hr 13min -
Verify number of documents in Destination index=number of documents in Source index-
Be aware it may take 60s to refresh on the destination cluster -
curl $CLUSTER_URL/gitlab-production-202007270000/_count=>637829071 -
curl $CLUSTER_URL/gitlab-production-202010260000/_count=>637829071
-
-
Force merge the new index to remove all deleted docs: curl -XPOST $CLUSTER_URL/gitlab-production-202010260000/_forcemerge
-
Add a comment to this issue with the new shard sizes: curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202010260000?v&s=store:desc&h=shard,prirep,docs,store,node"- #2872 (comment 435735238)
-
Set recovery max bytes back to default -
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
-
-
Force expunge deletes curl -XPOST $CLUSTER_URL/gitlab-production-202010260000/_forcemerge?only_expunge_deletes=true
-
Record when this expunge deletes started: 2020-10-26 02:30 UTC -
Wait for disk storage to shrink as deletes are cleared and wait until the disk usage flatlines -
Record when this expunge deletes finishes: 2020-10-26 08:10 UTC -
Record how long this expunge deletes takes: 5 hr 40 min -
Add a comment to this issue with the new shard sizes: curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202010260000?v&s=store:desc&h=shard,prirep,docs,store,node"- #2872 (comment 435854664)
-
Note the size of the destination index gitlab-production-202010260000index:4.9 TB -
Update the alias gitlab-productionto point to the new index-
curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202010260000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202007270000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases -
Confirm it works curl $CLUSTER_URL/gitlab-production/_count
-
-
Test that searching still works. -
Unblock writes to the destination index: curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":false}}' $CLUSTER_URL/gitlab-production-202010260000/_settings
-
Unpause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing -
For consistency (and in case we reindex later) update the number of shards setting in the admin UI to 120 to match the new index: Admin > Settings > Integrations > Elasticsearch > Number of Elasticsearch shards -
Wait until the backlog of incremental updates gets below 10,000 - Chart
Global search incremental indexing queue depthhttps://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- Chart
-
Create a comment somewhere then search for it to ensure indexing still works (can take up to 2 minutes before it shows up in the search results)
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Wait until you see disk usage drop quite a bit down to somewhere near where it was before curl $CLUSTER_URL/_cat/indices- This will likely take quite some time. But it can wait and we can re-enable indexing now and the shards will slowly shrink in size as the deleted docs are eventually cleaned up
-
Note the size of the destination index gitlab-production-202010260000index:XX TB -
Add a comment to this issue with the new shard sizes: curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202010260000?v&s=store:desc&h=shard,prirep,docs,store,node"
-
Delete the old gitlab-production-202007270000index-
curl -XDELETE $CLUSTER_URL/gitlab-production-202007270000
-
-
Test again that searches work as expected -
Scale the cluster down again based on the current size
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) -
If you've finished the whole process but want to revert for performance reasons
-
Create a new change request doing all these steps here again but using the shrink API to shrink it back to 60shards
If you've already updated the alias gitlab-production
-
Update the alias gitlab-productionto point to the old index-
curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202007270000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202010260000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
-
-
Delete the newly created index -
curl -XDELETE $CLUSTER_URL/gitlab-production-202010260000
-
If you have not switched indices yet
-
Delete the newly created index -
curl -XDELETE $CLUSTER_URL/gitlab-production-202010260000
-
Monitoring
Key metrics to observe
- Metric: Elasticsearch cluster health
- Location: https://00a4ef3362214c44a044feaa539b4686.us-central1.gcp.cloud.es.io:9243/app/monitoring#/overview?_g=(cluster_uuid:HdF5sKvcT5WQHHyYR_EDcw)
- What changes to this metric should prompt a rollback: Unhealthy nodes/indices that do not recover
- Metric: Elasticsearch monitoring in Grafana
- Metric: Indexing queues
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: After unpausing the indexing is failing and the queues are constantly growing
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. - [-] A dry-run has been conducted and results noted in a comment on this issue.
-
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue.) -
There are currently no active incidents.
Edited by Cameron McFarland