Skip to content

Reindex main index to resize shards count

Production Change

Change Summary

Related to https://gitlab.com/gitlab-org/search-team/team-tasks/-/issues/173+

We want to resize the main index shard sizes using Zero downtime reindexing feature.

This is the second attempt, we are introducing a few code changes in gitlab-org/gitlab!156519 (merged) and changes to the configuration for zero downtime reindexing

Attempt 1: #18080 (closed)

Change Details

  1. Services Impacted - ServiceElasticsearch
  2. Change Technician - @terrichu
  3. Change Reviewer - @dgruzd
  4. Time tracking - 240
  5. Downtime Component - no downtime

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes


  • Add silences via https://alerts.gitlab.net/#/silences/new with a matcher on env and alert name for each pair:
    • env="gprd", alertname="SearchServiceElasticsearchIndexingTrafficAbsent"
    • env="gprd", alertname="gitlab_search_indexing_queue_backing_up"
    • env="gprd", alertname="SidekiqServiceGlobalSearchIndexingApdexSLOViolation"
    • env="gprd", alertname="SearchServiceGlobalSearchIndexingTrafficCessation"
  • Scale the cluster up to support the reindexing operation, we need more space per #18080 (comment 1933936051)
    • Pause indexing
      ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: true)
    • In the Elastic Cloud UI, click Edit for the production deployment
    • Select the required specs (see image in #18080 (comment 1933936051)), click the Apply button
    • Test searching for:
      • Ensure Enable exact code search is disabled in your user preference setting
      • Search for code
      • Search for notes
      • Unpause indexing
        ::Gitlab::CurrentSettings.update!(elasticsearch_pause_indexing: false)
    • Test indexing:
      • Add code to a test project, verify it is searchable (may take time depending on how backedup indexing is)
      • Add a comment to a test issue, verify it is searchable (may take time depending on how backedup indexing is)
  • For each index: Take a screenshot of ES monitoring cluster index advanced metrics for last 7 days and attach to an Internal comment on this issue
  • For each index, find the current number of shards and attach to a comment on this issue
     Elastic::IndexSetting.find_by(alias_name: 'gitlab-production').number_of_shards
  • Update the number of shards for the affected indices to 300 (from https://gitlab.com/gitlab-org/search-team/team-tasks/-/issues/173#main-index)
    ::Elastic::IndexSetting.find_by(alias_name: 'gitlab-production').update!(number_of_shards: 300)
  • Trigger re-index with max_slices_running set to 50 and slice_multiplier set to 1
  • Note the timestamp when it was triggered -> 2024-06-24 15:40 UTC
    Elastic::ReindexingTask.create!(targets: %w[Repository], max_slices_running: 20, slice_multiplier: 1)
  • Monitor the status of the reindexing through rails console Elastic::ReindexingTask.current
  • Ensure that it has finished successfully
  • Note the time when the task finishes -> 2024-06-27 02:21 UTC
  • Wait until the backlog of incremental updates gets below 10,000
  • Create a file somewhere then search for it to ensure indexing still works (can take up to 2 minutes before it shows up in the search results)
  • Remove the alert silences
  • Re-enable the slowlog for each index by using the Dev Console to issue the command below:
    PUT /<index_name>/_settings
    {
       "index.search.slowlog.threshold.query.warn": "30s",
       "index.search.slowlog.threshold.query.info": "10s",
       "index.search.slowlog.threshold.query.debug": "-1",
       "index.search.slowlog.threshold.query.trace": "-1",
       "index.search.slowlog.threshold.fetch.warn": "30s",
       "index.search.slowlog.threshold.fetch.info": "10s",
       "index.search.slowlog.threshold.fetch.debug": "-1",
       "index.search.slowlog.threshold.fetch.trace": "-1"
    }

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 60

  • If the ongoing reindex is consuming too many resources it is possible to throttle the running reindex :
  • You can check the index write throughput in ES monitoring cluster to determine a sensible throttle. Since it defaults to no throttling, it's safe to just set some throttle and observe the impact
    • curl -XPOST "$CLUSTER_URL/_reindex/$TASK_ID/_rethrottle?requests_per_second=500
  • If reindexing task fails, it will automatically revert to the original index
    • Pause indexing (if it's not paused already)
    • Scale the cluster back down to avoid paying additional running costs
  • If reindexing task completes, but you need to rollback.
  • Delete incomplete indices by running
curl -XDELETE "$CLUSTER_URL/gitlab-production-20240613-1540-reindex-1000114-0"` # (The suffix number will be different)

Monitoring

Key metrics to observe

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • The change execution window respects the Production Change Lock periods.
    • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
    • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
    • Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Dmitry Gruzd