[GitLab.com] Split shards for main index in Elasticsearch cluster
Production Change
Change Summary
Double the number of shards in the main index of the gprd-indexing-20220523 cluster. This is to improve performance as the average shard size is nearing 50GiB.
main index name: gitlab-production-20240624-1635-reindex-1000118-0
main index alias: gitlab-production
Related to gitlab-org/search-team/team-tasks#182 (closed)
We want to resize the gitlab-production-20240624-1635-reindex-1000118-0 index shard size using the Elasticsearch Split Index API. The index will be split from 300 to 600 shards.
From the docs
Indices can only be split if they satisfy the following requirements:
- The target index must not exist
- The source index must have fewer primary shards than the target index.
- The number of primary shards in the target index must be a multiple of the number of primary shards in the source index.
- The node handling the split process must have sufficient free disk space to accommodate a second copy of the existing index.
Change Details
- Services Impacted - ServiceElasticsearch
- Change Technician - @terrichu
- Change Reviewer - @dgruzd
- Time tracking - 1440
Detailed steps for the change
Pre-Change Steps
-
Check if there are any active high severity incidents -
Ping @sre-oncallin Slack to let them know about the change request -
Determine if cluster needs to be scaled elastic_helper = ::Gitlab::Elastic::Helper.default target_classes = [Repository] current_size = target_classes.sum do |klass| name = elastic_helper.klass_to_alias_name(klass: klass) elastic_helper.index_size_bytes(index_name: name) end expected_free_size = current_size * 2 elastic_helper.cluster_free_size_bytes elastic_helper.cluster_free_size_bytes > expected_free_size # if true, cluster does not need to scale
If scaling needs to occur, perform the following steps:
-
In the Elastic Cloud UI, click Edit for the production deployment -
Take a screenshot of the existing and proposed settings and add in an internal comment
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 60 mins
-
Run all the steps on staging, double the number of shards -
Make the cluster larger if necessary. It should have enough space to contain double the size of the main index.
Change Steps - steps to take to execute the change
-
Set label changein-progress /label ~change::in-progress
-
Add silences via https://alerts.gitlab.net/#/silences/new with a matcher on env and alert name for each pair: -
env="gprd",alertname="SearchServiceElasticsearchIndexingTrafficAbsent" -
env="gprd",alertname="gitlab_search_indexing_queue_backing_up" -
env="gprd",alertname="SidekiqServiceGlobalSearchIndexingApdexSLOViolation" -
env="gprd",alertname="SearchServiceGlobalSearchIndexingTrafficCessation"
-
-
(optional) Scale the cluster using the settings determined above -
In the Elastic Cloud UI, click Edit for the production deployment -
Select the required specs and click Apply. Wait until the changes have applied successfully -
Verify search operations: -
Ensure Enable exact code search is disabled in your user preference setting -
Search for code -
Search for comments
-
-
-
For each index: Take a screenshot of ES monitoring cluster index advanced metrics for last 4 days and attach to an Internal comment on this issue -
For each index, find the current number of shards and attach to a comment on this issue elastic_helper = ::Gitlab::Elastic::Helper.default alias_name = elastic_helper.klass_to_alias_name(klass: Repository) current_shards = Elastic::IndexSetting.find_by(alias_name: alias_name).number_of_shards -
For each index, update the number of shards for the affected indices to {+ New number of shards +}Elastic::IndexSetting.find_by(alias_name: alias_name).update!(number_of_shards: current_shards * 2) -
Pause indexing, wait at least 2 minutes for the queues to drain ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: true) -
Take a snapshot of the cluster -
Verify the queues are growing in the Global SearchSidekiq graphs via Grafana -
Note size of the index in a private comment elastic_helper.index_size_bytes(index_name: alias_name) -
Note total number of documents of the index in a private comment elastic_helper.documents_count(index_name: alias_name, refresh: true) -
Block writes to gitlab-production-20240624-1635-reindex-1000118-0:updated_index_setting = { "index.blocks.write": true } elastic_helper.update_settings(settings: updated_index_setting, index_name: alias_name) elastic_helper.get_settings # validate the setting worked -
increase recovery max bytes to speed up replication: updated_cluster_setting = {"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}} elastic_helper.client.cluster.put_settings(body: updated_cluster_setting) elastic_helper.client.cluster.get_settings # validate the setting worked -
Trigger split from source index to destination index new_index_name = elastic_helper.index_name_with_timestamp(alias_name, suffix: '-split') old_index_name = elastic_helper.target_index_name(target: alias_name) new_index_settings = elastic_helper.get_settings(index_name: alias_name).to_hash.merge('number_of_shards' => "#{current_shards * 2}").except('creation_date', 'uuid', 'provided_name', 'version') elastic_helper.client.indices.split(index: old_index_name, target: new_index_name, body: { settings: { index: new_index_settings } } ) -
Note time when the task started: 2024-11-07 11:30 UTC -
Track the progress of splitting using the Recovery API Hash.new(0).tap { |hsh| elastic_helper.client.indices.recovery[new_index_name]['shards'].each { |shard| hsh[shard['stage']] += 1 } } -
Note the time when the split finishes: -
Verify number of documents in the old index equals new index old_index_count = elastic_helper.documents_count(index_name: old_index_name, refresh: true) new_index_count = elastic_helper.documents_count(index_name: new_index_name, refresh: true) -
Force merge the new index to remove all deleted docs: elastic_helper.client.indices.forcemerge(index: new_index_name) -
Add a comment to the issue with the new shard sizes: puts elastic_helper.client.cat.shards(index: new_index_name, h: "shard,prirep,docs,store,node", v: true) -
Set recovery max bytes back to default updated_cluster_setting = {"persistent":{"indices.recovery.max_bytes_per_sec": nil}} elastic_helper.client.cluster.put_settings(body: updated_cluster_setting) elastic_helper.client.cluster.get_settings # validate the setting worked -
Force expunge deletes elastic_helper.client.indices.forcemerge(index: new_index_name, only_expunge_deletes: true) -
Record when this expunge deletes started: -
You can look at the Disk (GB)graph in the ElasticCloud monitoring for the new index. You may see the disk storage shrink as deletes are cleared and the disk usage may flatline -
Record when this expunge deletes finishes: -
Add a comment to this issue with the new shard sizes: puts elastic_helper.client.cat.shards(index: new_index_name, h: "shard,prirep,docs,store,node", v: true) -
Note the size of the new index: elastic_helper.index_size_bytes(index_name: new_index_name) -
Update the alias gitlab-productionto point to the new indexelastic_helper.switch_alias(to: new_index_name, from: old_index_name, alias_name: alias_name) -
Test that searching still works. -
Unblock writes to the new index: updated_index_setting = { "index.blocks.write": false } elastic_helper.update_settings(settings: updated_index_setting, index_name: new_index_name) elastic_helper.get_settings # validate the setting worked -
Unpause indexing ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: false) -
Wait until the backlog of incremental updates gets below 10,000 -
Manually delete the old indices. The newer indices will have a newer date for the suffix. Before, confirm that the index being removed is not receiving any search or indexing traffic in the ElasticCloud monitoring cluster elastic_helper.client.indices.delete(index: old_index_name) -
Scale down the cluster back down to the original settings (this requires pausing indexing again) -
Pause indexing ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: true) -
In the Elastic Cloud UI, click Edit for the production deployment -
Select the original specs and click Apply. Wait until the changes have applied successfully -
Verify search operations: -
Ensure Enable exact code search is disabled in your user preference setting -
Search for code
-
-
Unpause indexing ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: false) -
Verify indexing: -
Add code to a test project, verify it is searchable (may take time depending on how backed up indexing is)
-
-
-
Remove the alert silences -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 60
If you've finished the whole process but want to revert for performance reasons
-
Create a new change request doing all these steps here again but using the shrink API to shrink it back to 300shards
If you've already updated the alias gitlab-production
-
Pause indexing (if it’s not paused already) -
Switch the alias back to the original index -
Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track as well as any updates that went through SideKiq workers ElasticDeleteProjectWorker -
Delete the newly created index elastic_helper.client.indices.delete(index: new_index_name)
If you have not switched the alias yet
-
Delete the newly created index elastic_helper.client.indices.delete(index: new_index_name) -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: Elasticsearch cluster health
- Location: https://00a4ef3362214c44a044feaa539b4686.us-central1.gcp.cloud.es.io:9243/app/monitoring#/overview?_g=(cluster_uuid:nkvJVBhsSwWfoqyHIA_raQ,refreshInterval:(pause:!f,value:10000),time:(from:now-15m,to:now)))
- What changes to this metric should prompt a rollback: Unhealthy nodes/indices that do not recover
- Metric: Elasticsearch monitoring in Grafana
- Metric:
Sidekiq Queues (Global Search)Indexing queues- Location: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
- What changes to this metric should prompt a rollback: After unpausing the indexing is failing and the queues are constantly growing
Change Reviewer checklist
C3:
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- There are currently no active incidents that are severity1 or severity2