Skip to content

Split shards in GitLab.com Global Search Elasticsearch cluster -> 120 (240 inc. replicas)

Production Change

Change Summary

Double the number of shards in our gitlab-production-202007270000 Elasticsearch index of the prod-gitlab-com indexing-20200330 cluster. This is to improve performance as our shards are becoming quite large.

Change Details

  1. Services Impacted - Elasticsearch (for GitLab global search)
  2. Change Technician - @DylanGriffith
  3. Change Criticality - C3
  4. Change Type - changescheduled
  5. Change Reviewer - @DylanGriffith
  6. Due Date - 2020-10-26
  7. Time tracking -
  8. Downtime Component - Indexing will be paused for the duration. This took ~4hrs last time. Paused indexing means search results may be out of date but otherwise searching should still work for finding anything created before it was paused.

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 60 mins

  1. Run all the steps on staging
  2. Make the cluster larger if necessary. It should be less than 25% full (more than 75% free)

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 5 hrs

  1. Confirm the cluster storage is less than 25% full (more than 75% free)
  2. Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a "split index" on our production Global Search Elasticsearch cluster to increase the number of shards. We will pause indexing during the time it takes to split the index. Read more at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2872
  3. Pause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
  4. In any console set CLUSTER_URL and confirm that it is the expected cluster with expected indices:
    1. curl $CLUSTER_URL/_cat/indices
  5. Wait until we see index writes drop to 0 in Elasticsearch monitoring
  6. Block writes to the source index:
    • curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":true}}' $CLUSTER_URL/gitlab-production-202007270000/_settings
  7. Take a snapshot of the cluster
    • Will happen on the half hour and this process will take more than half an hour
  8. Note the total size of the source gitlab-production-202007270000 index: 7 TB
  9. Note the total number of documents in the source gitlab-production-202007270000 index: 637829071
    • curl $CLUSTER_URL/gitlab-production-202007270000/_count
  10. Add a comment to this issue with the shard sizes:
    • curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202007270000?v&s=store:desc&h=shard,prirep,docs,store,node"
    • #2872 (comment 435703736)
  11. Increase recovery max bytes to speed up replication:
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
  12. Trigger split from source index gitlab-production-202007270000 to destination index gitlab-production-202010260000
    1. curl -X POST -H 'Content-Type: application/json' -d '{"settings":{"index.number_of_shards": 120}}' "$CLUSTER_URL/gitlab-production-202007270000/_split/gitlab-production-202010260000?copy_settings=true"
  13. Note the time when the task started: 2020-10-25 23:07 UTC
  14. Track the progress of splitting using the Recovery API curl "$CLUSTER_URL/_cat/recovery/gitlab-production-202010260000?v"
  15. Note the time when the split finishes: 2020-10-26 00:20 UTC
  16. Note the total time taken to recover: 1 hr 13min
  17. Verify number of documents in Destination index = number of documents in Source index
    • Be aware it may take 60s to refresh on the destination cluster
    • curl $CLUSTER_URL/gitlab-production-202007270000/_count => 637829071
    • curl $CLUSTER_URL/gitlab-production-202010260000/_count => 637829071
  18. Force merge the new index to remove all deleted docs:
    • curl -XPOST $CLUSTER_URL/gitlab-production-202010260000/_forcemerge
  19. Add a comment to this issue with the new shard sizes:
    • curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202010260000?v&s=store:desc&h=shard,prirep,docs,store,node"
    • #2872 (comment 435735238)
  20. Set recovery max bytes back to default
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
  21. Force expunge deletes
    • curl -XPOST $CLUSTER_URL/gitlab-production-202010260000/_forcemerge?only_expunge_deletes=true
  22. Record when this expunge deletes started: 2020-10-26 02:30 UTC
  23. Wait for disk storage to shrink as deletes are cleared and wait until the disk usage flatlines
  24. Record when this expunge deletes finishes: 2020-10-26 08:10 UTC
  25. Record how long this expunge deletes takes: 5 hr 40 min
  26. Add a comment to this issue with the new shard sizes:
    • curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202010260000?v&s=store:desc&h=shard,prirep,docs,store,node"
    • #2872 (comment 435854664)
  27. Note the size of the destination index gitlab-production-202010260000 index: 4.9 TB
  28. Update the alias gitlab-production to point to the new index
    1. curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202010260000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202007270000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
    2. Confirm it works curl $CLUSTER_URL/gitlab-production/_count
  29. Test that searching still works.
  30. Unblock writes to the destination index:
    • curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":false}}' $CLUSTER_URL/gitlab-production-202010260000/_settings
  31. Unpause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
  32. For consistency (and in case we reindex later) update the number of shards setting in the admin UI to 120 to match the new index: Admin > Settings > Integrations > Elasticsearch > Number of Elasticsearch shards
  33. Wait until the backlog of incremental updates gets below 10,000
  34. Create a comment somewhere then search for it to ensure indexing still works (can take up to 2 minutes before it shows up in the search results)

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  1. Wait until you see disk usage drop quite a bit down to somewhere near where it was before
    • curl $CLUSTER_URL/_cat/indices
    • This will likely take quite some time. But it can wait and we can re-enable indexing now and the shards will slowly shrink in size as the deleted docs are eventually cleaned up
  2. Note the size of the destination index gitlab-production-202010260000 index: XX TB
  3. Add a comment to this issue with the new shard sizes:
    • curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202010260000?v&s=store:desc&h=shard,prirep,docs,store,node"
  4. Delete the old gitlab-production-202007270000 index
    1. curl -XDELETE $CLUSTER_URL/gitlab-production-202007270000
  5. Test again that searches work as expected
  6. Scale the cluster down again based on the current size

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) -

If you've finished the whole process but want to revert for performance reasons

  1. Create a new change request doing all these steps here again but using the shrink API to shrink it back to 60 shards

If you've already updated the alias gitlab-production

  1. Update the alias gitlab-production to point to the old index
    1. curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202007270000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202010260000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
  2. Delete the newly created index
    1. curl -XDELETE $CLUSTER_URL/gitlab-production-202010260000

If you have not switched indices yet

  1. Delete the newly created index
    1. curl -XDELETE $CLUSTER_URL/gitlab-production-202010260000

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and resultes noted in a comment on this issue.
  • [-] A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
  • There are currently no active incidents.
Edited by Cameron McFarland