[GitLab.com] Split shards for main index in Elasticsearch cluster

Production Change

Change Summary

Double the number of shards in the main index of the gprd-indexing-20220523 cluster. This is to improve performance as the average shard size is nearing 50GiB.

main index name: gitlab-production-20240624-1635-reindex-1000118-0 main index alias: gitlab-production

We want to resize the gitlab-production-20240624-1635-reindex-1000118-0 index shard size using the Elasticsearch Split Index API. The index will be split from 300 to 600 shards.

From the docs

Indices can only be split if they satisfy the following requirements:

The target index must not exist

The source index must have fewer primary shards than the target index.

The number of primary shards in the target index must be a multiple of the number of primary shards in the source index.

The node handling the split process must have sufficient free disk space to accommodate a second copy of the existing index.

Change Details

Services Impacted - ServiceElasticsearch
Change Technician - @terrichu
Change Reviewer - @dgruzd
Time tracking - 1440

Detailed steps for the change

Pre-Change Steps

Check if there are any active high severity incidents
Ping @sre-oncall in Slack to let them know about the change request

Determine if cluster needs to be scaled

  elastic_helper = ::Gitlab::Elastic::Helper.default
  target_classes = [Repository]
  
  current_size = target_classes.sum do |klass|
    name = elastic_helper.klass_to_alias_name(klass: klass)
    elastic_helper.index_size_bytes(index_name: name)
  end
  
  expected_free_size = current_size * 2
  elastic_helper.cluster_free_size_bytes
  elastic_helper.cluster_free_size_bytes > expected_free_size # if true, cluster does not need to scale

If scaling needs to occur, perform the following steps:

In the Elastic Cloud UI, click Edit for the production deployment
Take a screenshot of the existing and proposed settings and add in an internal comment

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 60 mins

Run all the steps on staging, double the number of shards
Make the cluster larger if necessary. It should have enough space to contain double the size of the main index.

Change Steps - steps to take to execute the change

Set label changein-progress /label ~change::in-progress

Add silences via https://alerts.gitlab.net/#/silences/new with a matcher on env and alert name for each pair:
- env="gprd", alertname="SearchServiceElasticsearchIndexingTrafficAbsent"
- env="gprd", alertname="gitlab_search_indexing_queue_backing_up"
- env="gprd", alertname="SidekiqServiceGlobalSearchIndexingApdexSLOViolation"
- env="gprd", alertname="SearchServiceGlobalSearchIndexingTrafficCessation"
(optional) Scale the cluster using the settings determined above
- In the Elastic Cloud UI, click Edit for the production deployment
- Select the required specs and click Apply. Wait until the changes have applied successfully
- Verify search operations:
  - Ensure Enable exact code search is disabled in your user preference setting
  - Search for code
  - Search for comments
For each index: Take a screenshot of ES monitoring cluster index advanced metrics for last 4 days and attach to an Internal comment on this issue

For each index, find the current number of shards and attach to a comment on this issue

 elastic_helper = ::Gitlab::Elastic::Helper.default
 alias_name = elastic_helper.klass_to_alias_name(klass: Repository)
 current_shards = Elastic::IndexSetting.find_by(alias_name: alias_name).number_of_shards

For each index, update the number of shards for the affected indices to {+ New number of shards +}

 Elastic::IndexSetting.find_by(alias_name: alias_name).update!(number_of_shards: current_shards * 2)

Pause indexing, wait at least 2 minutes for the queues to drain

  ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: true)

Take a snapshot of the cluster
Verify the queues are growing in the Global Search Sidekiq graphs via Grafana

Note size of the index in a private comment

  elastic_helper.index_size_bytes(index_name: alias_name)

Note total number of documents of the index in a private comment

  elastic_helper.documents_count(index_name: alias_name, refresh: true)

Block writes to gitlab-production-20240624-1635-reindex-1000118-0:

 updated_index_setting = { "index.blocks.write": true }
 elastic_helper.update_settings(settings: updated_index_setting, index_name: alias_name)
 elastic_helper.get_settings # validate the setting worked

increase recovery max bytes to speed up replication:

 updated_cluster_setting = {"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}
 elastic_helper.client.cluster.put_settings(body: updated_cluster_setting)
 elastic_helper.client.cluster.get_settings # validate the setting worked

Trigger split from source index to destination index

  new_index_name = elastic_helper.index_name_with_timestamp(alias_name, suffix: '-split')
  old_index_name = elastic_helper.target_index_name(target: alias_name)
  new_index_settings = elastic_helper.get_settings(index_name: alias_name).to_hash.merge('number_of_shards' => "#{current_shards * 2}").except('creation_date', 'uuid', 'provided_name', 'version')
  elastic_helper.client.indices.split(index: old_index_name, target: new_index_name, body: { settings: { index: new_index_settings } } )

Note time when the task started: 2024-11-07 11:30 UTC

Track the progress of splitting using the Recovery API

 Hash.new(0).tap { |hsh| elastic_helper.client.indices.recovery[new_index_name]['shards'].each { |shard| hsh[shard['stage']] += 1 } }

Note the time when the split finishes:

Verify number of documents in the old index equals new index

 old_index_count = elastic_helper.documents_count(index_name: old_index_name, refresh: true)
 new_index_count = elastic_helper.documents_count(index_name: new_index_name, refresh: true)

Force merge the new index to remove all deleted docs:

  elastic_helper.client.indices.forcemerge(index: new_index_name)

Add a comment to the issue with the new shard sizes:

 puts elastic_helper.client.cat.shards(index: new_index_name, h: "shard,prirep,docs,store,node", v: true)

Set recovery max bytes back to default

 updated_cluster_setting = {"persistent":{"indices.recovery.max_bytes_per_sec": nil}}
 elastic_helper.client.cluster.put_settings(body: updated_cluster_setting)
 elastic_helper.client.cluster.get_settings # validate the setting worked

Force expunge deletes

  elastic_helper.client.indices.forcemerge(index: new_index_name, only_expunge_deletes: true)

Record when this expunge deletes started:
You can look at the Disk (GB) graph in the ElasticCloud monitoring for the new index. You may see the disk storage shrink as deletes are cleared and the disk usage may flatline
Record when this expunge deletes finishes:

Add a comment to this issue with the new shard sizes:

 puts elastic_helper.client.cat.shards(index: new_index_name, h: "shard,prirep,docs,store,node", v: true)

Note the size of the new index:

 elastic_helper.index_size_bytes(index_name: new_index_name)

Update the alias gitlab-production to point to the new index

 elastic_helper.switch_alias(to: new_index_name, from: old_index_name, alias_name: alias_name)

Test that searching still works.

Unblock writes to the new index:

  updated_index_setting = { "index.blocks.write": false }
  elastic_helper.update_settings(settings: updated_index_setting, index_name: new_index_name)
  elastic_helper.get_settings # validate the setting worked

Unpause indexing

  ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: false)

Wait until the backlog of incremental updates gets below 10,000
- Chart Global search incremental indexing queue depth
Manually delete the old indices. The newer indices will have a newer date for the suffix. Before, confirm that the index being removed is not receiving any search or indexing traffic in the ElasticCloud monitoring cluster
```
  elastic_helper.client.indices.delete(index: old_index_name)
```
Scale down the cluster back down to the original settings (this requires pausing indexing again)
- Pause indexing
```
  ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: true)
```
- In the Elastic Cloud UI, click Edit for the production deployment
- Select the original specs and click Apply. Wait until the changes have applied successfully
- Verify search operations:
  - Ensure Enable exact code search is disabled in your user preference setting
  - Search for code
- Unpause indexing
```
  ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: false)
```
- Verify indexing:
  - Add code to a test project, verify it is searchable (may take time depending on how backed up indexing is)

Remove the alert silences
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 60

If you've finished the whole process but want to revert for performance reasons

Create a new change request doing all these steps here again but using the shrink API to shrink it back to 300 shards

If you've already updated the alias `gitlab-production`

Pause indexing (if it’s not paused already)
Switch the alias back to the original index
Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track as well as any updates that went through SideKiq workers ElasticDeleteProjectWorker

Delete the newly created index

    elastic_helper.client.indices.delete(index: new_index_name)

If you have not switched the alias yet