Gitlab.com main index shard resize
Background
The average shard size for the main index on GitLab.com is growing about 0.3gb/day.
We will likely be back up over 50gb within 1.5 months. If embeddings data is added to the index, it could be sooner
This issue will track the work to resize the number of shards for the index
Proposal/Options
Whatever path is chosen, I recommend we schedule this over a weekend to reduce impact. All indexing must be paused to do any maintenance work and doing it over a weekend will limit the impact to customers.
Zero downtime reindexing
Recent change request issue: Reindex main index to resize shards count (gitlab-com/gl-infra/production#18158 - closed)
There is an issue template to use if we go this route.
Pros
- everything is done automatically by the reindexing feature
- little manual effort
Cons
- took a LONG time
- indexing must be paused
- potential for failure
Split shards API
recent issue: gitlab-com/gl-infra/production#2872 (closed))
- determine cluster storage required (TBD?)
- confirm cluster has enough storage
- scale up cluster if needed
- pausing indexing, scale up cluster, validate search, unpause indexing, validate indexing
- pause indexing (again if cluster scale has occurred), verify writes are not happening
- Take a snapshot of the cluster
- Note size of
gitlab-production-OLD_INDEX - Note total number of documents in
gitlab-production-OLD_INDEX - Block writes to
gitlab-production-OLD_INDEX:curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":true}}' $CLUSTER_URL/gitlab-production-OLD_INDEX/_settings
- increase recovery max bytes to speed up replication:
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}' -XPUT $CLUSTER_URL/\_cluster/settings
- Trigger split from source index
gitlab-production-OLD_INDEXto destination indexgitlab-production-NEW_INDEXcurl -X POST -H 'Content-Type: application/json' -d '{"settings":{"index.number_of_shards": 500}}' "$CLUSTER_URL/gitlab-production-OLD_INDEX/_split/gitlab-production-NEW_INDEX?copy_settings=true"
- Note time when the task started
- Track the progress of splitting using the Recovery API
curl "$CLUSTER_URL/_cat/recovery/gitlab-production-NEW_INDEX?v"
- Note the time when the split finishes:
- Note total time taken
- Verify number of documents in the NEW_INDEX = OLD_INDEX
- Force merge the new index to remove all deleted docs:
curl -XPOST $CLUSTER_URL/gitlab-production-NEW_INDEX/_forcemerge
- Add a comment to the issue with the new shard sizes:
curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-NEW_INDEX?v&s=store:desc&h=shard,prirep,docs,store,node"
- Set recovery max bytes back to default
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
- Force expunge deletes
curl -XPOST $CLUSTER_URL/gitlab-production-NEW_INDEX/_forcemerge?only_expunge_deletes=true
- Record when this expunge deletes started:
- Wait for disk storage to shrink as deletes are cleared and wait until the disk usage flatlines
- Record when this expunge deletes finishes:
- Record how long this expunge deletes takes:
- Add a comment to this issue with the new shard sizes:
curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-NEW_INDEX?v&s=store:desc&h=shard,prirep,docs,store,node"
- Note the size of the new index
gitlab-production-NEW_INDEX - Update the alias
gitlab-productionto point to the new indexcurl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-NEW_INDEX","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-OLD_INDEX","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
- Confirm it works
curl $CLUSTER_URL/gitlab-production/_count - Test that searching still works.
- Unblock writes to the new index:
curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":false}}' $CLUSTER_URL/gitlab-production-NEW_INDEX/_settings
- Unpause indexing
- Verify search is working
- For consistency (and in case we reindex later) update the number of shards setting in the admin UI to match the new index:
Admin > Settings > Integrations > Elasticsearch > Number of Elasticsearch shards - Verify indexing is working
Pros
- may be faster than zero downtime reindexing
- less chance for failure
Cons
- unknown time to complete
- manual process required
Edited by Terri Chu