Reindex GitLab.com Global Search Elasticsearch cluster main index
Production Change
Change Summary
We have a list of changes we want to apply to GitLab.com main Advanced Search index:
- gitlab-org/gitlab#349099 (closed) (gitlab-org/gitlab!77226 (merged))
- gitlab-org/gitlab#346914 (closed) (gitlab-org/gitlab!96785 (merged))
- gitlab-org/gitlab#371988 (closed)
This can be done by reindexing the index.
Change Details
- Services Impacted - Elasticsearch global search
- Change Technician - @dgruzd (EMEA) @john-mason (AMER)
- Change Reviewer - @terrichu
- Time tracking - 2880m
- Downtime Component - No downtime, but Advanced Search indexing will be paused for the duration of reindexing
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 30m
-
Run all the steps on staging -
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 2880m
-
Add a silence via https://alerts.gitlab.net/#/silences/new with a matcher on alert name: env="gprd"
,alertname="SearchServiceElasticsearchIndexingTrafficAbsent"
,alertname="gitlab_search_indexing_queue_backing_up"
, andalertname="SidekiqServiceGlobalSearchIndexingApdexSLOViolation"
. Link the comment field back to the Change Request Issue. => https://alerts.gitlab.net/#/silences/c69d2b4e-72e4-44b8-b504-baa52d7fd0d9 -
Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a reindex of one of our production Elasticsearch cluster indices which will re-index all of our main index to another index in the same cluster using the Elasticsearch reindex API. During the reindex we’ll be pausing indexing to the cluster which will cause the incremental updates queue to grow. We have added a silence for the SearchServiceElasticsearchIndexingTrafficAbsent alert. This will increase load on the Elasticsearch cluster but should not impact any other systems. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6116
-
Pause indexing writes: ApplicationSetting.current.update!(elasticsearch_pause_indexing: true)
=> https://gitlab.slack.com/archives/C101F3796/p1666609373933799?thread_ts=1666608961.045649&cid=C101F3796 -
In any console set CLUSTER_URL
and confirm that it is the expected cluster with expected indices:curl $CLUSTER_URL/_cat/indices
-
Note the total size of the source gitlab-production-202012160000
index:- 6.85TB
-
Ensure there's enough capacity for the copy of gitlab-production
index. Note free space for the cluster:- 43311989678080 bytes free (43.3TB)
- If there isn't enough capacity, add data nodes to the cluster using Elastic Cloud console and wait until resizing is completed
-
Take a screenshot of index advanced metrics for last 30 days and last 7 days and attach to a comment on this issue -
Note the total number of documents in the source gitlab-production-202012160000
index:curl $CLUSTER_URL/gitlab-production-202012160000/_count
{"count":826555716,"_shards":{"total":120,"successful":120,"skipped":0,"failed":0}}
-
Create new destination index with correct settings/mappings from rails console: Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false)
-
Replace below index gitlab-production-NEW_INDEX_SUFFIX
names with the newly created index name -
Confirm the newly created index has new mappings curl $CLUSTER_URL/gitlab-production-20221024-1119/_settings
-
Set index settings in destination index to optimize for writes: curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-20221024-1119/_settings"
-
Increase recovery max bytes to speed up replication: curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "400mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
-
Trigger re-index from source index gitlab-production-202012160000
to destination indexgitlab-production-20221024-1119
curl -H 'Content-Type: application/json' -d '{ "conflicts": "proceed", "source": { "index": "gitlab-production-202012160000" }, "dest": { "index": "gitlab-production-20221024-1119" } } }' -X POST "$CLUSTER_URL/_reindex?slices=auto&wait_for_completion=false&timeout=72h"
-
Note the returned task ID from the above: kbL5gZn-RKi2F3k_IaHQzA:3540867
-
Note the time when the task started: 2022-10-24 12:04 UTC
-
Track the progress of reindexing using the Tasks API curl $CLUSTER_URL/_tasks/$TASK_ID
- If failures happen only in some slices it's possible to retry those slices only following the steps used last time
-
Note the time when the task finishes: 2022-10-26 23:25 UTC
-
Note the total time taken to reindex: 3561 minutes
or60 hours
-
Change the refresh_interval
and number of replicas setting on Destination Index to defaultcurl -XPUT -d '{"index":{"number_of_replicas":"1", "refresh_interval": null}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-20221024-1119/_settings"
-
Verify number of documents in Destination index
=number of documents in Source index
- Be aware it may take 60s to refresh on the destination cluster
-
curl $CLUSTER_URL/gitlab-production-202012160000/_count
=>{"count":826555716,"_shards":{"total":120,"successful":120,"skipped":0,"failed":0}}%
-
curl $CLUSTER_URL/gitlab-production-20221024-1119/_count
=>{"count":826555716,"_shards":{"total":200,"successful":200,"skipped":0,"failed":0}}%
-
Wait for cluster monitoring to show the replication has completed -
Set recovery max bytes back to default curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
-
Set translog
durability back torequest
on the destination index back to default:curl -XPUT -d '{"index":{"translog":{"durability":"request"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-20221024-1119/_settings"
-
Note the size of the destination index gitlab-production-20221024-1119
index:5.1 TB
-
Update the alias gitlab-production
to point to the new indexcurl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-20221024-1119","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202012160000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
-
Confirm it works curl $CLUSTER_URL/gitlab-production/_settings
-
verify that there are no code search regressions -
Unpause indexing writes: ApplicationSetting.current.update!(elasticsearch_pause_indexing: false)
https://gitlab.slack.com/archives/C101F3796/p1666828526346619 -
Wait until the backlog of incremental updates gets below 10,000 - Chart
Global search incremental indexing queue depth
https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- Chart
-
Create a file somewhere then search for it to ensure indexing still works (can take up to 2 minutes before it shows up in the search results) -
remove the alert silence https://alerts.gitlab.net/#/silences/c69d2b4e-72e4-44b8-b504-baa52d7fd0d9, https://alerts.gitlab.net/#/silences/6df29b7f-c2b2-416e-896f-554e5a622f00 -
delete the previous index gitlab-production-202012160000
when it is safe to do so
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 1m
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 60
- If the ongoing reindex is consuming too many resources it is possible to throttle the running reindex :
- You can check the index write throughput in ES monitoring to determine a sensible throttle. Since it defaults to no throttling at all it's safe to just set some throttle and observe the impact
curl -XPOST "$CLUSTER_URL/_reindex/$TASK_ID/_rethrottle?requests_per_second=500
-
If you get past the step of updating the alias then simply switch the Alias to point back to the original index -
curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202012160000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-20221024-1119","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
-
Confirm it works curl $CLUSTER_URL/gitlab-production/_count
-
-
Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track
as well as any updates that went through sidekiq workersElasticCommitIndexerWorker
,ElasticDeleteProjectWorker
.
Monitoring
Key metrics to observe
- Metric: Elasticsearch cluster health
- Location: https://00a4ef3362214c44a044feaa539b4686.us-central1.gcp.cloud.es.io:9243/app/monitoring#/overview?_g=(cluster_uuid:HdF5sKvcT5WQHHyYR_EDcw)
- What changes to this metric should prompt a rollback: Unhealthy nodes/indices that do not recover
- Metric: Elasticsearch monitoring in Grafana
- Metric: Indexing queues
- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: After unpausing the indexing is failing and the queues are constantly growing
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Change Reviewer checklist
-
The scheduled day and time of execution of the change is appropriate. -
The change plan is technically accurate. -
The change plan includes estimated timing values based on previous testing. -
The change plan includes a viable rollback plan. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change.
Change Technician checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.
Edited by Dmitry Gruzd