[GitLab.com] Split shards for main index in Elasticsearch cluster

Production Change

Change Summary

Double the number of shards in the main index of the gprd-indexing-20220523 cluster. This is to improve performance as the average shard size is nearing 50GiB.

main index name: gitlab-production-20240624-1635-reindex-1000118-0 main index alias: gitlab-production

Related to gitlab-org/search-team/team-tasks#182 (closed)

We want to resize the gitlab-production-20240624-1635-reindex-1000118-0 index shard size using the Elasticsearch Split Index API. The index will be split from 300 to 600 shards.

From the docs

Indices can only be split if they satisfy the following requirements:

  • The target index must not exist
  • The source index must have fewer primary shards than the target index.
  • The number of primary shards in the target index must be a multiple of the number of primary shards in the source index.
  • The node handling the split process must have sufficient free disk space to accommodate a second copy of the existing index.

Change Details

  1. Services Impacted - ServiceElasticsearch
  2. Change Technician - @terrichu
  3. Change Reviewer - @dgruzd
  4. Time tracking - 1440

Detailed steps for the change

Pre-Change Steps

  • Check if there are any active high severity incidents
  • Ping @sre-oncall in Slack to let them know about the change request
  • Determine if cluster needs to be scaled
      elastic_helper = ::Gitlab::Elastic::Helper.default
      target_classes = [Repository]
      
      current_size = target_classes.sum do |klass|
        name = elastic_helper.klass_to_alias_name(klass: klass)
        elastic_helper.index_size_bytes(index_name: name)
      end
      
      expected_free_size = current_size * 2
      elastic_helper.cluster_free_size_bytes
      elastic_helper.cluster_free_size_bytes > expected_free_size # if true, cluster does not need to scale

If scaling needs to occur, perform the following steps:

  • In the Elastic Cloud UI, click Edit for the production deployment
  • Take a screenshot of the existing and proposed settings and add in an internal comment

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 60 mins

  1. Run all the steps on staging, double the number of shards
  2. Make the cluster larger if necessary. It should have enough space to contain double the size of the main index.

Change Steps - steps to take to execute the change


  • Add silences via https://alerts.gitlab.net/#/silences/new with a matcher on env and alert name for each pair:
    • env="gprd", alertname="SearchServiceElasticsearchIndexingTrafficAbsent"
    • env="gprd", alertname="gitlab_search_indexing_queue_backing_up"
    • env="gprd", alertname="SidekiqServiceGlobalSearchIndexingApdexSLOViolation"
    • env="gprd", alertname="SearchServiceGlobalSearchIndexingTrafficCessation"
  • (optional) Scale the cluster using the settings determined above
    • In the Elastic Cloud UI, click Edit for the production deployment
    • Select the required specs and click Apply. Wait until the changes have applied successfully
    • Verify search operations:
  • For each index: Take a screenshot of ES monitoring cluster index advanced metrics for last 4 days and attach to an Internal comment on this issue
  • For each index, find the current number of shards and attach to a comment on this issue
     elastic_helper = ::Gitlab::Elastic::Helper.default
     alias_name = elastic_helper.klass_to_alias_name(klass: Repository)
     current_shards = Elastic::IndexSetting.find_by(alias_name: alias_name).number_of_shards
  • For each index, update the number of shards for the affected indices to {+ New number of shards +}
     Elastic::IndexSetting.find_by(alias_name: alias_name).update!(number_of_shards: current_shards * 2)
  • Pause indexing, wait at least 2 minutes for the queues to drain
      ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: true)
  • Take a snapshot of the cluster
  • Verify the queues are growing in the Global Search Sidekiq graphs via Grafana
  • Note size of the index in a private comment
      elastic_helper.index_size_bytes(index_name: alias_name)
  • Note total number of documents of the index in a private comment
      elastic_helper.documents_count(index_name: alias_name, refresh: true)
  • Block writes to gitlab-production-20240624-1635-reindex-1000118-0:
     updated_index_setting = { "index.blocks.write": true }
     elastic_helper.update_settings(settings: updated_index_setting, index_name: alias_name)
     elastic_helper.get_settings # validate the setting worked
  • increase recovery max bytes to speed up replication:
     updated_cluster_setting = {"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}
     elastic_helper.client.cluster.put_settings(body: updated_cluster_setting)
     elastic_helper.client.cluster.get_settings # validate the setting worked
  • Trigger split from source index to destination index
      new_index_name = elastic_helper.index_name_with_timestamp(alias_name, suffix: '-split')
      old_index_name = elastic_helper.target_index_name(target: alias_name)
      new_index_settings = elastic_helper.get_settings(index_name: alias_name).to_hash.merge('number_of_shards' => "#{current_shards * 2}").except('creation_date', 'uuid', 'provided_name', 'version')
      elastic_helper.client.indices.split(index: old_index_name, target: new_index_name, body: { settings: { index: new_index_settings } } )
  • Note time when the task started: 2024-11-07 11:30 UTC
  • Track the progress of splitting using the Recovery API
     Hash.new(0).tap { |hsh| elastic_helper.client.indices.recovery[new_index_name]['shards'].each { |shard| hsh[shard['stage']] += 1 } }
  • Note the time when the split finishes:
  • Verify number of documents in the old index equals new index
     old_index_count = elastic_helper.documents_count(index_name: old_index_name, refresh: true)
     new_index_count = elastic_helper.documents_count(index_name: new_index_name, refresh: true)
  • Force merge the new index to remove all deleted docs:
      elastic_helper.client.indices.forcemerge(index: new_index_name)
  • Add a comment to the issue with the new shard sizes:
     puts elastic_helper.client.cat.shards(index: new_index_name, h: "shard,prirep,docs,store,node", v: true)
  • Set recovery max bytes back to default
     updated_cluster_setting = {"persistent":{"indices.recovery.max_bytes_per_sec": nil}}
     elastic_helper.client.cluster.put_settings(body: updated_cluster_setting)
     elastic_helper.client.cluster.get_settings # validate the setting worked
  • Force expunge deletes
      elastic_helper.client.indices.forcemerge(index: new_index_name, only_expunge_deletes: true)
  • Record when this expunge deletes started:
  • You can look at the Disk (GB) graph in the ElasticCloud monitoring for the new index. You may see the disk storage shrink as deletes are cleared and the disk usage may flatline
  • Record when this expunge deletes finishes:
  • Add a comment to this issue with the new shard sizes:
     puts elastic_helper.client.cat.shards(index: new_index_name, h: "shard,prirep,docs,store,node", v: true)
  • Note the size of the new index:
     elastic_helper.index_size_bytes(index_name: new_index_name)
  • Update the alias gitlab-production to point to the new index
     elastic_helper.switch_alias(to: new_index_name, from: old_index_name, alias_name: alias_name)
  • Test that searching still works.
  • Unblock writes to the new index:
      updated_index_setting = { "index.blocks.write": false }
      elastic_helper.update_settings(settings: updated_index_setting, index_name: new_index_name)
      elastic_helper.get_settings # validate the setting worked
  • Unpause indexing
      ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: false)
  • Wait until the backlog of incremental updates gets below 10,000
  • Manually delete the old indices. The newer indices will have a newer date for the suffix. Before, confirm that the index being removed is not receiving any search or indexing traffic in the ElasticCloud monitoring cluster
      elastic_helper.client.indices.delete(index: old_index_name)
  • Scale down the cluster back down to the original settings (this requires pausing indexing again)
    • Pause indexing
        ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: true)
    • In the Elastic Cloud UI, click Edit for the production deployment
    • Select the original specs and click Apply. Wait until the changes have applied successfully
    • Verify search operations:
      • Ensure Enable exact code search is disabled in your user preference setting
      • Search for code
    • Unpause indexing
        ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: false)
    • Verify indexing:
      • Add code to a test project, verify it is searchable (may take time depending on how backed up indexing is)

  • Remove the alert silences
  • Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 60

If you've finished the whole process but want to revert for performance reasons

  • Create a new change request doing all these steps here again but using the shrink API to shrink it back to 300 shards

If you've already updated the alias gitlab-production

If you have not switched the alias yet

  • Delete the newly created index

        elastic_helper.client.indices.delete(index: new_index_name)
  • Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Change Reviewer checklist

C3:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • The change execution window respects the Production Change Lock periods.
    • There are currently no active incidents that are severity1 or severity2
Edited by Dmitry Gruzd