Reindex GitLab.com Global Search Elasticsearch cluster with new regex pattern
C3
Production Change - Criticality 3Change Objective | Describe the objective of the change |
---|---|
Change Type | ConfigurationChange |
Services Impacted | GitLab.com , Redis SharedState, Redis Sidekiq, Elasticsearch cluster, Sidekiq |
Change Team Members | @DylanGriffith |
Change Criticality | C3 |
Change Reviewer or tested in staging | #2408 (comment 384849906) |
Dry-run output | |
Due Date | 2020-07-27 04:02:36 UTC |
Time tracking |
Detailed steps for the change
This is to rollout the indexing changes in gitlab-org/gitlab!36255 (merged)
Pre-check
-
Practice dry-run on production with various reindexing settings until we can get time down to less than 24 hrs -
Run all the steps on staging -
Make the cluster larger if necessary. It should be less than 33% full (more than 67% free)
Process
-
Confirm the cluster storage is less than 33% full (more than 67% free) -- 63% free will be fine considering we ended up using less storage when we've re-indexed staging -
Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a reindex of our production Elasticsearch cluster which will re-index all of our production global search index to another index in the same cluster using the Elasticsearch reindex API. During the reindex we'll be pausing indexing to the cluster which will cause the incremental updates queue to grow but should not cause alerts since we don't have any for that queue. This will increase load on the Elasticsearch cluster but should not impact any other systems. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2408
-
Pause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
-
In any console set CLUSTER_URL
and confirm that it is the expected cluster with expected indices:-
curl $CLUSTER_URL/_cat/indices
-
-
Note the total size of the source gitlab-production-202007070000
index:5.3 TB
-
Note the total number of documents in the source gitlab-production-202007070000
index:661083549
curl $CLUSTER_URL/gitlab-production-202007070000/_count
-
Create new destination index gitlab-production-202007270000
with correct settings/mappings from rails console:Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false, options: { index_name: 'gitlab-production-202007270000' })
-
Set index settings in destination index to optimize for writes: curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
-
Trigger re-index from source index gitlab-production-202007070000
to destination indexgitlab-production-202007270000
curl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production-202007070000" }, "dest": { "index": "gitlab-production-202007270000" } } }' -X POST "$CLUSTER_URL/_reindex?slices=180&wait_for_completion=false&scroll=1h"
-
Note the returned task ID from the above: JnbbPKxbRpqoWDmCj6fhcw:1169025
-
Note the time when the task started: 2020-07-27 04:16:28 UTC
-
------------------------------------------------------------------- -
PER #2408 (comment 386406235) this failed so I'm restarting the reindexing with the following steps -
Unpause indexing -
Delete the new gitlab-production-202007270000
index -
Wait until queues catch up -
------------------------------------------------------------------- -
Pause indexing -
Note the total size of the source gitlab-production-202007070000
index:5.3 TB
-
Note the total number of documents in the source gitlab-production-202007070000
index:662660754
curl $CLUSTER_URL/gitlab-production-202007070000/_count
-
Create new destination index gitlab-production-202007270000
with correct settings/mappings from rails console:Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false, options: { index_name: 'gitlab-production-202007270000' })
-
Set index settings in destination index to optimize for writes: curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
-
Trigger re-index from source index gitlab-production-202007070000
to destination indexgitlab-production-202007270000
curl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production-202007070000" }, "dest": { "index": "gitlab-production-202007270000" } } }' -X POST "$CLUSTER_URL/_reindex?slices=180&wait_for_completion=false&scroll=1h"
-
Note the returned task ID from the above: Gyp_pWWaShSBKGhC1OiT8w:4184769
-
Note the time when the task started: 2020-07-28 06:34:51 UTC
-
------------------------------------------------------------------- -
PER #2408 (comment 387128571) this failed so I'm restarting the reindexing with the following steps -
Delete the new gitlab-production-202007270000
index -
------------------------------------------------------------------- -
Note the total size of the source gitlab-production-202007070000
index:5.3 TB
-
Note the total number of documents in the source gitlab-production-202007070000
index:662690754
curl $CLUSTER_URL/gitlab-production-202007070000/_count
-
Create new destination index gitlab-production-202007270000
with correct settings/mappings from rails console:Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false, options: { index_name: 'gitlab-production-202007270000' })
-
Set index settings in destination index to optimize for writes: curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
-
Trigger re-index from source index gitlab-production-202007070000
to destination indexgitlab-production-202007270000
curl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production-202007070000" }, "dest": { "index": "gitlab-production-202007270000" } } }' -X POST "$CLUSTER_URL/_reindex?slices=180&wait_for_completion=false&scroll=1h"
-
Note the returned task ID from the above: Gyp_pWWaShSBKGhC1OiT8w:10843499
-
Note the time when the task started: 2020-07-29 00:36:50 UTC
-
Track the progress of reindexing using the Tasks API curl $CLUSTER_URL/_tasks/$TASK_ID
-
Note the time when the task finishes: - Read #2408 (comment 387249996) as this failed and was broken up into multiple manual slices multiple times.
-
Note the total time taken to reindex: - Read #2408 (comment 387249996) as this failed and was broken up into multiple manual slices multiple times.
-
Change the refresh_interval
setting on Destination Index to60s
curl -XPUT -d '{"index":{"refresh_interval":"60s"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
-
Verify number of documents in Destination index
=number of documents in Source index
- Be aware it may take 60s to refresh on the destination cluster
-
curl $CLUSTER_URL/gitlab-production-202007070000/_count
=>662690754
-
curl $CLUSTER_URL/gitlab-production-202007270000/_count
=>662690754
-
Force merge the index to speed up replication: curl -XPOST $CLUSTER_URL/gitlab-production-202007270000/_forcemerge
-
Expunge the many deleted docs: curl -XPOST $CLUSTER_URL/gitlab-production-202007070000/_forcemerge?only_expunge_deletes=true
-
Increase replication on Destination index to 1
:-
curl -XPUT -d '{"index":{"number_of_replicas":"1"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
-
-
Increase recovery max bytes to speed up replication: -
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "400mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
-
-
Wait for cluster monitoring to show the replication has completed -
Set recovery max bytes back to default -
curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
-
-
Set translog
durability back torequest
on the destination index back to default:curl -XPUT -d '{"index":{"translog":{"durability":"request"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
-
Note the size of the destination index gitlab-production-202007270000
index:6.3 TB
=> Noting that due to retrying multiple times the index still contains many deleted docs that will be cleaned over time. There is currently71459720
which probably accounts for the large size increase. -
Update the alias gitlab-production
to point to the new index-
curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202007270000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202007070000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
-
Confirm it works curl $CLUSTER_URL/gitlab-production/_count
-
-
Test that searching for the new pattern works using https://gitlab.com/search?utf8=%E2%9C%93&snippets=false&scope=&repository_ref=master&search=hook_to_event&group_id=9970&project_id=278964 which should now find a match for https://gitlab.com/gitlab-org/gitlab/-/blob/c086bd8f75537a7f73fc386f2053962b389dc990/app/services/web_hook_service.rb#L20 -
Unpause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
-
Wait until the backlog of incremental updates gets below 10,000 - Chart
Global search incremental indexing queue depth
https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- Chart
-
Create a comment somewhere then search for it to ensure indexing still works (can take up to 2 minutes before it shows up in the search results) -
Delete the old gitlab-production-202007070000
indexcurl -XDELETE $CLUSTER_URL/gitlab-production-202007070000
-
Test again that searches work as expected -
Scale the cluster down again based on the current size=> Leaving scaled up for a few days to get a baseline to see if extra capacity helps performance
Monitoring
Key metrics to observe
- Gitlab admin panel:
- Grafana:
- Platform triage
- Sidekiq:
-
Sidekiq SLO dashboard overview
-
Sidekiq Queue Lengths per Queue
- Expected to climb during initial indexing and sudden drop-off once initial indexing jobs are finished.
Sidekiq Inflight Operations by Queue
-
Node Maximum Single Core Utilization per Priority
- expected to be 100% during initial indexing
-
-
Sidekiq SLO dashboard overview
- Redis-sidekiq:
-
Redis-sidekiq SLO dashboard overview
-
Memory Saturation
- If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs).
- If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped.
-
-
Redis-sidekiq SLO dashboard overview
- Incremental updates queue:
- Chart
Global search incremental indexing queue depth
https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1 - From rails console
Elastic::ProcessBookkeepingService.queue_size
- Chart
Other metrics to observe
- Elastic support diagnostics: In the event of any issues (eg. like last time seeing too many threads in https://support.elastic.co/customers/s/case/5004M00000cL8IJ/cluster-crashing-under-high-load-possibly-due-to-jvm-heap-size-too-large-again) we can grab the support diagnostics per instructions at https://support.elastic.co/customers/s/article/support-diagnostics
- Grafana:
- Rails:
- Postgres:
- patroni SLO dashboard overview
- postgresql overview
- pgbouncer SLO dashboard overview
- pgbouncer overview
-
"Waiting Sidekiq pgbouncer Connections"
- If we see this increase to say 500 and stay that way then we should be concerned and disable indexing at that point
- Gitaly:
- Gitaly SLO dashboard overview
- Gitaly latency
- Gitaly saturation overview
-
Gitaly single node saturation
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
ElasticCommitIndexerWorker
that it will help and then stop if it's clearly correlated.
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
Rollback steps
-
If the ongoing reindex is consuming too many resources it is possible to throttle the running reindex : - You can check the index write throughput in ES monitoring to determine a sensible throttle. Since it defaults to no throttling at all it's safe to just set some throttle and observe the impact
curl -XPOST "$CLUSTER_URL/_reindex/$TASK_ID/_rethrottle?requests_per_second=500
-
If you get past the step of updating the alias then simply switch the Alias to point back to the original index -
curl -XPOST -H 'Content-Type: application/json' -d '"actions":[{"add":{"index":"gitlab-production-202007070000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202007270000,"alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
-
Confirm it works curl $CLUSTER_URL/gitlab-production/_count
-
-
Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Edited by Dylan Griffith