Split shards in GitLab.com Global Search Elasticsearch cluster -> 120 (240 inc. replicas)

Production Change

Change Summary

Double the number of shards in our gitlab-production-202007270000 Elasticsearch index of the prod-gitlab-com indexing-20200330 cluster. This is to improve performance as our shards are becoming quite large.

Change Details

Services Impacted - Elasticsearch (for GitLab global search)
Change Technician - @DylanGriffith
Change Criticality - C3
Change Type - changescheduled
Change Reviewer - @DylanGriffith
Due Date - 2020-10-26
Time tracking -
Downtime Component - Indexing will be paused for the duration. This took ~4hrs last time. Paused indexing means search results may be out of date but otherwise searching should still work for finding anything created before it was paused.

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 60 mins

Run all the steps on staging
Make the cluster larger if necessary. It should be less than 25% full (more than 75% free)

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 5 hrs

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Wait until you see disk usage drop quite a bit down to somewhere near where it was before
- curl $CLUSTER_URL/_cat/indices
- This will likely take quite some time. But it can wait and we can re-enable indexing now and the shards will slowly shrink in size as the deleted docs are eventually cleaned up
Note the size of the destination index gitlab-production-202010260000 index: XX TB
Add a comment to this issue with the new shard sizes:
- curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202010260000?v&s=store:desc&h=shard,prirep,docs,store,node"
Delete the old gitlab-production-202007270000 index
1. curl -XDELETE $CLUSTER_URL/gitlab-production-202007270000
Test again that searches work as expected
Scale the cluster down again based on the current size

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) -

If you've finished the whole process but want to revert for performance reasons

Create a new change request doing all these steps here again but using the shrink API to shrink it back to 60 shards

If you've already updated the alias `gitlab-production`

Update the alias gitlab-production to point to the old index
1. curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202007270000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202010260000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
Delete the newly created index
1. curl -XDELETE $CLUSTER_URL/gitlab-production-202010260000

If you have not switched indices yet

Delete the newly created index
1. curl -XDELETE $CLUSTER_URL/gitlab-production-202010260000

Monitoring

Key metrics to observe

Metric: Elasticsearch cluster health
- Location: https://00a4ef3362214c44a044feaa539b4686.us-central1.gcp.cloud.es.io:9243/app/monitoring#/overview?_g=(cluster_uuid:HdF5sKvcT5WQHHyYR_EDcw)
- What changes to this metric should prompt a rollback: Unhealthy nodes/indices that do not recover
Metric: Elasticsearch monitoring in Grafana
- Location: https://dashboards.gitlab.net/d/search-main/search-overview?orgId=1
Metric: Indexing queues
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: After unpausing the indexing is failing and the queues are constantly growing

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and resultes noted in a comment on this issue.
[-] A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
There are currently no active incidents.

Edited Oct 26, 2020 by Cameron McFarland