Reindex GitLab.com global search cluster with alias
Production Change - Criticality 3 C3
| Change Objective | Describe the objective of the change | 
|---|---|
| Change Type | ConfigurationChange | 
| Services Impacted | GitLab.com, Elasticsearch cluster | 
| Change Team Members | @DylanGriffith @aamarsanaa | 
| Change Criticality | C3 | 
| Change Reviewer or tested in staging | #1907 (comment 318876101) | 
| Dry-run output | |
| Due Date | 2020-04-07 XX:XX:00 UTC | 
| Time tracking | 
Detailed steps for the change
Pre-check
- 
Run all the steps on staging 
Process
- 
Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a reindex of our production Elasticsearch cluster which will re-index all of our production global search cluster to another index in the same cluster using the Elasticsearch reindex API. During the reindex we'll be disabling sidekiq on the Elasticsearch sidekiq node so the associated queues will grow and may cause alerts. This will increase load on the production cluster but should not impact any other systems. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1907
- 
Pause indexing writes (stop Elasticsearch sidekiq node): sudo gitlab-ctl stop sidekiq-cluster
- 
Take a snapshot of the cluster 
- 
Note the total size of the source gitlab-productionindex:- 580.8 GB
 
- 
In any console set CLUSTER_URLthat it is the expected cluster with expected indices:- 
curl $CLUSTER_URL/_cat/indices
 
- 
- 
Create index gitlab-production-202004062333with correct settings/mappings:- 
index-create.json=> index-create.json
- 
curl -H 'Content-Type: application/json' -d @index-create.json -XPUT "$CLUSTER_URL/gitlab-production-202004062333?include_type_name=true"
 
- 
- 
Set index settings in destination index to optimize for writes: - curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202004062333/_settings"
 
- 
Trigger re-index from source index gitlab-productionto destination indexgitlab-production-202004062333- 
curl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production" }, "dest": { "index": "gitlab-production-202004062333" } }' -X POST "$CLUSTER_URL/_reindex?slices=auto&wait_for_completion=false"
 
- 
- 
Note the returned task ID from the above: 5RluDddRTgq7jdqmfAYagg:4459919
- 
Note the time when the task started: 2020-04-07 06:55:23 UTC
- 
Track the progress of reindexing using the Tasks API curl $CLUSTER_URL/_tasks/$TASK_ID
- 
Note the time when the task finishes: 2020-04-07 09:59:42 UTC
- 
Note the total time taken to reindex: 3.1 hrs
- 
Change the refresh_intervalsetting on Destination Cluster to60- 
curl -XPUT -d '{"index":{"refresh_interval":"60s"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202004062333/_settings"
 
- 
- 
Verify number of documents in Destination Cluster gitlab-production index=number of documents in Source Cluster gitlab-production index- 
Be aware it may take 60s to refresh on the destination cluster 
- 
curl $CLUSTER_URL/gitlab-production/_count=>24765325
- 
curl $CLUSTER_URL/gitlab-production-202004062333/_count=>24765325
 
- 
- 
curl -XPOST $CLUSTER_URL/gitlab-production-202004062333/_forcemerge
- 
Increase replication on Destination Cluster to 1:- 
curl -XPUT -d '{"index":{"number_of_replicas":"1"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202004062333/_settings"
 
- 
- 
Increase recovery max bytes: - 
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
 
- 
- 
Wait for cluster monitoring to show the replication has completed 
- 
Set recovery max bytes back to default - 
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
 
- 
- 
Set translogdurability back torequeston the destination cluster to speed up writes:- curl -XPUT -d '{"index":{"translog":{"durability":"request"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202004062333/_settings"
 
- 
Note the size of the destination index gitlab-production-202004062333index:591.5 GB
- 
Delete the gitlab-productionindex- 
curl -XDELETE $CLUSTER_URL/gitlab-production
 
- 
- 
Create an alias gitlab-productionthat points togitlab-production-202004062333- 
curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202004062333","alias":"gitlab-production"}}]}}' $CLUSTER_URL/_aliases
- 
Confirm it works curl $CLUSTER_URL/gitlab-production/_count
 
- 
- 
Enable search with Elasticsearch in GitLab > Admin > Settings > Integrations
- 
Test that phrase searching in issues works: https://gitlab.com/search?project_id=278964&repository_ref=master&scope=issues&search=%22slack+integration%22&snippets= 
- 
Disable search with Elasticsearch in GitLab > Admin > Settings > Integrationsto allow indexing to catch up
- 
Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster
- 
Wait until the backlog of incremental updates gets below 1000 - Chart Global search incremental indexing queue depthhttps://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
 
- Chart 
- 
Wait for elastic_indexerqueue to drop below 10k
- 
Enable search with Elasticsearch in GitLab > Admin > Settings > Integrations
- 
Create a comment somewhere then search for it to ensure indexing still works (can take a minute to catch up) - 
Confirm it's caught up by checking Global search incremental indexing queue depth or the source of truth via rails console: Elastic::ProcessBookkeepingService.queue_size
 
- 
- 
Test that phrase searching in issues works: https://gitlab.com/search?project_id=278964&repository_ref=master&scope=issues&search=%22slack+integration%22&snippets= 
Rollback steps
- 
If you have gotten past deleting the initial gitlab-productionindex then you'll need to restore from the snapshot
- 
Ensure any updates that only went to Destination Cluster are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track
Monitoring
Key metrics to observe
- Gitlab admin panel:
- Grafana:
- Platform triage
- Sidekiq:
- 
Sidekiq SLO dashboard overview
- 
Sidekiq Queue Lengths per Queue- Expected to climb during initial indexing and sudden drop-off once initial indexing jobs are finished.
 
- Sidekiq Inflight Operations by Queue
- 
Node Maximum Single Core Utilization per Priority- expected to be 100% during initial indexing
 
 
- 
 
- 
Sidekiq SLO dashboard overview
- Redis-sidekiq:
- 
Redis-sidekiq SLO dashboard overview
- 
Memory Saturation- If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs).
- If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped.
 
 
- 
 
- 
Redis-sidekiq SLO dashboard overview
 
- Incremental updates queue:
- Chart Global search incremental indexing queue depthhttps://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- From rails console Elastic::ProcessBookkeepingService.queue_size
 
- Chart 
Other metrics to observe
- Elastic support diagnostics: In the event of any issues (eg. like last time seeing too many threads in https://support.elastic.co/customers/s/case/5004M00000cL8IJ/cluster-crashing-under-high-load-possibly-due-to-jvm-heap-size-too-large-again) we can grab the support diagnostics per instructions at https://support.elastic.co/customers/s/article/support-diagnostics
- Grafana:
- Rails:
- Postgres:
- patroni SLO dashboard overview
- postgresql overview
- pgbouncer SLO dashboard overview
- pgbouncer overview
- 
"Waiting Sidekiq pgbouncer Connections"
- If we see this increase to say 500 and stay that way then we should be concerned and disable indexing at that point
 
 
- Gitaly:
- Gitaly SLO dashboard overview
- Gitaly latency
- Gitaly saturation overview
- 
Gitaly single node saturation
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down ElasticCommitIndexerWorkerthat it will help and then stop if it's clearly correlated.
 
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down 
 
 
Rollback steps
- 
If we got past the step "Change settings in GitLab > Admin > Settings > Integrationsto point to Destination Cluster" then:- Switch GitLab settings to point back to Source Cluster in GitLab > Admin > Settings > Integrations
 
- Switch GitLab settings to point back to Source Cluster in 
- 
If we got past the step "Re-enable indexing writes": - Ensure any updates that only went to Destination Cluster are replayed
against Source Cluster by searching the logs for the updates
https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23
and triggering those updates again using
ProcessBookkeepingService#track
 
- Ensure any updates that only went to Destination Cluster are replayed
against Source Cluster by searching the logs for the updates
https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23
and triggering those updates again using
- 
If we stopped before "Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster" then all you need to do to rollback is:- "Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster"
 
- "Re-enable indexing writes (start Elasticsearch sidekiq node) 
Changes checklist
- 
Detailed steps and rollback steps have been filled prior to commencing work 
- 
Person on-call has been informed prior to change being rolled out 
Edited  by Dylan Griffith