Gitlab.com main index shard resize

Background

The average shard size for the main index on GitLab.com is growing about 0.3gb/day.

We will likely be back up over 50gb within 1.5 months. If embeddings data is added to the index, it could be sooner

This issue will track the work to resize the number of shards for the index

Proposal/Options

Whatever path is chosen, I recommend we schedule this over a weekend to reduce impact. All indexing must be paused to do any maintenance work and doing it over a weekend will limit the impact to customers.

Zero downtime reindexing

Recent change request issue: Reindex main index to resize shards count (gitlab-com/gl-infra/production#18158 - closed)

There is an issue template to use if we go this route.

Pros

everything is done automatically by the reindexing feature
little manual effort

Cons

took a LONG time
indexing must be paused
potential for failure

Split shards API

recent issue: gitlab-com/gl-infra/production#2872 (closed))

determine cluster storage required (TBD?)
confirm cluster has enough storage
scale up cluster if needed
- pausing indexing, scale up cluster, validate search, unpause indexing, validate indexing
pause indexing (again if cluster scale has occurred), verify writes are not happening
Take a snapshot of the cluster
Note size of gitlab-production-OLD_INDEX
Note total number of documents in gitlab-production-OLD_INDEX
Block writes to gitlab-production-OLD_INDEX:
- curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":true}}' $CLUSTER_URL/gitlab-production-OLD_INDEX/_settings
increase recovery max bytes to speed up replication:
- curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}' -XPUT $CLUSTER_URL/\_cluster/settings
Trigger split from source index gitlab-production-OLD_INDEX to destination index gitlab-production-NEW_INDEX
- curl -X POST -H 'Content-Type: application/json' -d '{"settings":{"index.number_of_shards": 500}}' "$CLUSTER_URL/gitlab-production-OLD_INDEX/_split/gitlab-production-NEW_INDEX?copy_settings=true"
Note time when the task started
Track the progress of splitting using the Recovery API
- curl "$CLUSTER_URL/_cat/recovery/gitlab-production-NEW_INDEX?v"
Note the time when the split finishes:
Note total time taken
Verify number of documents in the NEW_INDEX = OLD_INDEX
Force merge the new index to remove all deleted docs:
- curl -XPOST $CLUSTER_URL/gitlab-production-NEW_INDEX/_forcemerge
Add a comment to the issue with the new shard sizes:
- curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-NEW_INDEX?v&s=store:desc&h=shard,prirep,docs,store,node"
Set recovery max bytes back to default
- curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
Force expunge deletes
- curl -XPOST $CLUSTER_URL/gitlab-production-NEW_INDEX/_forcemerge?only_expunge_deletes=true
Record when this expunge deletes started:
Wait for disk storage to shrink as deletes are cleared and wait until the disk usage flatlines
Record when this expunge deletes finishes:
Record how long this expunge deletes takes:
Add a comment to this issue with the new shard sizes:
- curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-NEW_INDEX?v&s=store:desc&h=shard,prirep,docs,store,node"
Note the size of the new index gitlab-production-NEW_INDEX
Update the alias gitlab-production to point to the new index
- curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-NEW_INDEX","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-OLD_INDEX","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
Confirm it works curl $CLUSTER_URL/gitlab-production/_count
Test that searching still works.
Unblock writes to the new index:
- curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":false}}' $CLUSTER_URL/gitlab-production-NEW_INDEX/_settings
Unpause indexing
Verify search is working
For consistency (and in case we reindex later) update the number of shards setting in the admin UI to match the new index: Admin > Settings > Integrations > Elasticsearch > Number of Elasticsearch shards
Verify indexing is working

Pros

may be faster than zero downtime reindexing
less chance for failure

Cons

unknown time to complete
manual process required

Edited Jul 26, 2024 by Terri Chu