Gitlab.com main index shard resize

Background

The average shard size for the main index on GitLab.com is growing about 0.3gb/day.

We will likely be back up over 50gb within 1.5 months. If embeddings data is added to the index, it could be sooner

This issue will track the work to resize the number of shards for the index

Proposal/Options

Whatever path is chosen, I recommend we schedule this over a weekend to reduce impact. All indexing must be paused to do any maintenance work and doing it over a weekend will limit the impact to customers.

Zero downtime reindexing

Recent change request issue: Reindex main index to resize shards count (gitlab-com/gl-infra/production#18158 - closed)

There is an issue template to use if we go this route.

Pros

  • everything is done automatically by the reindexing feature
  • little manual effort

Cons

  • took a LONG time
  • indexing must be paused
  • potential for failure

Split shards API

recent issue: gitlab-com/gl-infra/production#2872 (closed))

  1. determine cluster storage required (TBD?)
  2. confirm cluster has enough storage
  3. scale up cluster if needed
    • pausing indexing, scale up cluster, validate search, unpause indexing, validate indexing
  4. pause indexing (again if cluster scale has occurred), verify writes are not happening
  5. Take a snapshot of the cluster
  6. Note size of gitlab-production-OLD_INDEX
  7. Note total number of documents in gitlab-production-OLD_INDEX
  8. Block writes to gitlab-production-OLD_INDEX:
    • curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":true}}' $CLUSTER_URL/gitlab-production-OLD_INDEX/_settings
  9. increase recovery max bytes to speed up replication:
    • curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}' -XPUT $CLUSTER_URL/\_cluster/settings
  10. Trigger split from source index gitlab-production-OLD_INDEX to destination index gitlab-production-NEW_INDEX
    • curl -X POST -H 'Content-Type: application/json' -d '{"settings":{"index.number_of_shards": 500}}' "$CLUSTER_URL/gitlab-production-OLD_INDEX/_split/gitlab-production-NEW_INDEX?copy_settings=true"
  11. Note time when the task started
  12. Track the progress of splitting using the Recovery API
    • curl "$CLUSTER_URL/_cat/recovery/gitlab-production-NEW_INDEX?v"
  13. Note the time when the split finishes:
  14. Note total time taken
  15. Verify number of documents in the NEW_INDEX = OLD_INDEX
  16. Force merge the new index to remove all deleted docs:
    • curl -XPOST $CLUSTER_URL/gitlab-production-NEW_INDEX/_forcemerge
  17. Add a comment to the issue with the new shard sizes:
    • curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-NEW_INDEX?v&s=store:desc&h=shard,prirep,docs,store,node"
  18. Set recovery max bytes back to default
    • curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
  19. Force expunge deletes
    • curl -XPOST $CLUSTER_URL/gitlab-production-NEW_INDEX/_forcemerge?only_expunge_deletes=true
  20. Record when this expunge deletes started:
  21. Wait for disk storage to shrink as deletes are cleared and wait until the disk usage flatlines
  22. Record when this expunge deletes finishes:
  23. Record how long this expunge deletes takes:
  24. Add a comment to this issue with the new shard sizes:
    • curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-NEW_INDEX?v&s=store:desc&h=shard,prirep,docs,store,node"
  25. Note the size of the new index gitlab-production-NEW_INDEX
  26. Update the alias gitlab-production to point to the new index
    • curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-NEW_INDEX","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-OLD_INDEX","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
  27. Confirm it works curl $CLUSTER_URL/gitlab-production/_count
  28. Test that searching still works.
  29. Unblock writes to the new index:
    • curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":false}}' $CLUSTER_URL/gitlab-production-NEW_INDEX/_settings
  30. Unpause indexing
  31. Verify search is working
  32. For consistency (and in case we reindex later) update the number of shards setting in the admin UI to match the new index: Admin > Settings > Integrations > Elasticsearch > Number of Elasticsearch shards
  33. Verify indexing is working

Pros

  • may be faster than zero downtime reindexing
  • less chance for failure

Cons

  • unknown time to complete
  • manual process required
Edited by Terri Chu