Split shards in GitLab.com Global Search Elasticsearch cluster

Production Change - Criticality 3 C3

Change Objective Describe the objective of the change
Change Type ConfigurationChange
Services Impacted GitLab.com , Redis SharedState, Redis Sidekiq, Elasticsearch cluster, Sidekiq
Change Team Members @DylanGriffith
Change Criticality C3
Change Reviewer or tested in staging #2374 (comment 373938155)
Dry-run output
Due Date 2020-07-07 02:52 UTC
Time tracking

Detailed steps for the change

Our cluster shards are getting too large. This can cause problems recovering failed shards (it can take a long time) and also means that searching may not be very efficiently utilising our CPUs since searches only use one thread per shard.

We can increase the number of shards by using the Split Index API. This is supposed to be incredibly fast (just a few seconds) but even so it should definitely be faster than a reindex which we've done several times before.

Pre-check

  1. Run all the steps on staging
  2. Make the cluster larger if necessary. It should be less than 25% full (more than 75% free)

Process

  1. Confirm the cluster storage is less than 25% full (more than 75% free)
  2. Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a "split index" on our production Global Search Elasticsearch cluster to increase the number of shards. We will pause indexing during the time it takes to split the index. Read more at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2374
  3. Pause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
  4. In any console set CLUSTER_URL and confirm that it is the expected cluster with expected indices:
    1. curl $CLUSTER_URL/_cat/indices
  5. Wait until we see index writes drop to 0 in Elasticsearch monitoring
  6. Block writes to the source index:
    • curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":true}}' $CLUSTER_URL/gitlab-production-202006290000/_settings
  7. Take a snapshot of the cluster
  8. Note the total size of the source gitlab-production-202006290000 index: 4.1 TB
  9. Note the total number of documents in the source gitlab-production-202006290000 index: 465932650
    • curl $CLUSTER_URL/gitlab-production-202006290000/_count
  10. Add a comment to this issue with the shard sizes:
    • curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202006290000?v&s=store:desc&h=shard,prirep,docs,store,node"
    • #2374 (comment 374610790)
  11. Increase recovery max bytes to speed up replication:
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
  12. Trigger split from source index gitlab-production-202006290000 to destination index gitlab-production-202007070000
    1. curl -X POST -H 'Content-Type: application/json' -d '{"settings":{"index.number_of_shards": 60}}' "$CLUSTER_URL/gitlab-production-202006290000/_split/gitlab-production-202007070000?copy_settings=true"
  13. Note the time when the task started: 2020-07-07 02:52 UTC
  14. Track the progress of splitting using the Recovery API curl "$CLUSTER_URL/_cat/recovery/gitlab-production-202007070000?v"
  15. Note the time when the split finishes: 2020-07-07 04:25 UTC
  16. Note the total time taken to reindex: 1 hr 33 m
  17. Verify number of documents in Destination index = number of documents in Source index
    • Be aware it may take 60s to refresh on the destination cluster
    • curl $CLUSTER_URL/gitlab-production-202006290000/_count => 465932650
    • curl $CLUSTER_URL/gitlab-production-202007070000/_count => 465932650
  18. Force merge the new index to remove all deleted docs:
    • curl -XPOST $CLUSTER_URL/gitlab-production-202007070000/_forcemerge
  19. [ ] Wait until you see disk usage drop quite a bit (eventually should get down to similar size as the original index)
    • curl $CLUSTER_URL/_cat/indices
    • This will likely take quite some time. But it can wait and we can re-enable indexing now and the shards will slowly shrink in size as the deleted docs are eventually cleaned up
  20. Add a comment to this issue with the new shard sizes:
    • curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202007070000?v&s=store:desc&h=shard,prirep,docs,store,node"
  21. Set recovery max bytes back to default
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
  22. Note the size of the destination index gitlab-production-202007070000 index: 6.9 TB
  23. Update the alias gitlab-production to point to the new index
    1. curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202007070000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202006290000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
    2. Confirm it works curl $CLUSTER_URL/gitlab-production/_count
  24. Test that searching still works.
  25. Unblock writes to the destination index:
    • curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":false}}' $CLUSTER_URL/gitlab-production-202007070000/_settings
  26. Unpause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
  27. For consistency (and in case we reindex later) update the number of shards setting in the admin UI to 60 to match the new index: Admin > Settings > Integrations > Elasticsearch > Number of Elasticsearch shards
  28. Wait until the backlog of incremental updates gets below 10,000
  29. Create a comment somewhere then search for it to ensure indexing still works (can take up to 2 minutes before it shows up in the search results)
  30. Delete the old gitlab-production-202006290000 index
    1. curl -XDELETE $CLUSTER_URL/gitlab-production-202006290000
  31. Test again that searches work as expected
  32. Scale the cluster down again based on the current size

Monitoring

Key metrics to observe

Other metrics to observe

Rollback steps

  1. Unblock writes to the source index:
    • curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":false}}' $CLUSTER_URL/gitlab-production-202006290000/_settings
  2. If you get past the step of updating the alias then simply switch the Alias to point back to the original index
    1. curl -XPOST -H 'Content-Type: application/json' -d '"actions":[{"add":{"index":"gitlab-production-202006290000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202007070000,"alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
    2. Confirm it works curl $CLUSTER_URL/gitlab-production/_count
  3. Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • Person on-call has been informed prior to change being rolled out
Edited by Dylan Griffith