Split shards in GitLab.com Global Search Elasticsearch cluster
Production Change - Criticality 3 C3
| Change Objective | Describe the objective of the change |
|---|---|
| Change Type | ConfigurationChange |
| Services Impacted | GitLab.com , Redis SharedState, Redis Sidekiq, Elasticsearch cluster, Sidekiq |
| Change Team Members | @DylanGriffith |
| Change Criticality | C3 |
| Change Reviewer or tested in staging | #2374 (comment 373938155) |
| Dry-run output | |
| Due Date | 2020-07-07 02:52 UTC |
| Time tracking |
Detailed steps for the change
Our cluster shards are getting too large. This can cause problems recovering failed shards (it can take a long time) and also means that searching may not be very efficiently utilising our CPUs since searches only use one thread per shard.
We can increase the number of shards by using the Split Index API. This is supposed to be incredibly fast (just a few seconds) but even so it should definitely be faster than a reindex which we've done several times before.
Pre-check
-
Run all the steps on staging -
Make the cluster larger if necessary. It should be less than 25% full (more than 75% free)
Process
-
Confirm the cluster storage is less than 25% full (more than 75% free) -
Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a "split index" on our production Global Search Elasticsearch cluster to increase the number of shards. We will pause indexing during the time it takes to split the index. Read more at https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2374 -
Pause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing -
In any console set CLUSTER_URLand confirm that it is the expected cluster with expected indices:-
curl $CLUSTER_URL/_cat/indices
-
-
Wait until we see index writes drop to 0 in Elasticsearch monitoring -
Block writes to the source index: curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":true}}' $CLUSTER_URL/gitlab-production-202006290000/_settings
-
Take a snapshot of the cluster -
Note the total size of the source gitlab-production-202006290000index:4.1 TB -
Note the total number of documents in the source gitlab-production-202006290000index:465932650curl $CLUSTER_URL/gitlab-production-202006290000/_count
-
Add a comment to this issue with the shard sizes: curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202006290000?v&s=store:desc&h=shard,prirep,docs,store,node"- #2374 (comment 374610790)
-
Increase recovery max bytes to speed up replication: -
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
-
-
Trigger split from source index gitlab-production-202006290000to destination indexgitlab-production-202007070000-
curl -X POST -H 'Content-Type: application/json' -d '{"settings":{"index.number_of_shards": 60}}' "$CLUSTER_URL/gitlab-production-202006290000/_split/gitlab-production-202007070000?copy_settings=true"
-
-
Note the time when the task started: 2020-07-07 02:52 UTC -
Track the progress of splitting using the Recovery API curl "$CLUSTER_URL/_cat/recovery/gitlab-production-202007070000?v" -
Note the time when the split finishes: 2020-07-07 04:25 UTC -
Note the total time taken to reindex: 1 hr 33 m -
Verify number of documents in Destination index=number of documents in Source index-
Be aware it may take 60s to refresh on the destination cluster -
curl $CLUSTER_URL/gitlab-production-202006290000/_count=>465932650 -
curl $CLUSTER_URL/gitlab-production-202007070000/_count=>465932650
-
-
Force merge the new index to remove all deleted docs: curl -XPOST $CLUSTER_URL/gitlab-production-202007070000/_forcemerge
-
[ ] Wait until you see disk usage drop quite a bit (eventually should get down to similar size as the original index)curl $CLUSTER_URL/_cat/indices- This will likely take quite some time. But it can wait and we can re-enable indexing now and the shards will slowly shrink in size as the deleted docs are eventually cleaned up
-
Add a comment to this issue with the new shard sizes: curl -s "$CLUSTER_URL/_cat/shards/gitlab-production-202007070000?v&s=store:desc&h=shard,prirep,docs,store,node"
-
Set recovery max bytes back to default -
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
-
-
Note the size of the destination index gitlab-production-202007070000index:6.9 TB -
Update the alias gitlab-productionto point to the new index-
curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202007070000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202006290000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases -
Confirm it works curl $CLUSTER_URL/gitlab-production/_count
-
-
Test that searching still works. -
Unblock writes to the destination index: curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":false}}' $CLUSTER_URL/gitlab-production-202007070000/_settings
-
Unpause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing -
For consistency (and in case we reindex later) update the number of shards setting in the admin UI to 60 to match the new index: Admin > Settings > Integrations > Elasticsearch > Number of Elasticsearch shards -
Wait until the backlog of incremental updates gets below 10,000 - Chart
Global search incremental indexing queue depthhttps://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- Chart
-
Create a comment somewhere then search for it to ensure indexing still works (can take up to 2 minutes before it shows up in the search results) -
Delete the old gitlab-production-202006290000index-
curl -XDELETE $CLUSTER_URL/gitlab-production-202006290000
-
-
Test again that searches work as expected -
Scale the cluster down again based on the current size
Monitoring
Key metrics to observe
- Gitlab admin panel:
- Grafana:
- Platform triage
- Sidekiq:
-
Sidekiq SLO dashboard overview
-
Sidekiq Queue Lengths per Queue- Expected to climb during initial indexing and sudden drop-off once initial indexing jobs are finished.
Sidekiq Inflight Operations by Queue-
Node Maximum Single Core Utilization per Priority- expected to be 100% during initial indexing
-
-
Sidekiq SLO dashboard overview
- Redis-sidekiq:
-
Redis-sidekiq SLO dashboard overview
-
Memory Saturation- If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs).
- If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped.
-
-
Redis-sidekiq SLO dashboard overview
- Incremental updates queue:
- Chart
Global search incremental indexing queue depthhttps://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1 - From rails console
Elastic::ProcessBookkeepingService.queue_size
- Chart
Other metrics to observe
- Elastic support diagnostics: In the event of any issues (eg. like last time seeing too many threads in https://support.elastic.co/customers/s/case/5004M00000cL8IJ/cluster-crashing-under-high-load-possibly-due-to-jvm-heap-size-too-large-again) we can grab the support diagnostics per instructions at https://support.elastic.co/customers/s/article/support-diagnostics
- Grafana:
- Rails:
- Postgres:
- patroni SLO dashboard overview
- postgresql overview
- pgbouncer SLO dashboard overview
- pgbouncer overview
-
"Waiting Sidekiq pgbouncer Connections"
- If we see this increase to say 500 and stay that way then we should be concerned and disable indexing at that point
- Gitaly:
- Gitaly SLO dashboard overview
- Gitaly latency
- Gitaly saturation overview
-
Gitaly single node saturation
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
ElasticCommitIndexerWorkerthat it will help and then stop if it's clearly correlated.
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
Rollback steps
-
Unblock writes to the source index: curl -X PUT -H 'Content-Type: application/json' -d '{"settings":{"index.blocks.write":false}}' $CLUSTER_URL/gitlab-production-202006290000/_settings
-
If you get past the step of updating the alias then simply switch the Alias to point back to the original index -
curl -XPOST -H 'Content-Type: application/json' -d '"actions":[{"add":{"index":"gitlab-production-202006290000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202007070000,"alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases -
Confirm it works curl $CLUSTER_URL/gitlab-production/_count
-
-
Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Edited by Dylan Griffith