Reindex GitLab.com Global Search Elasticsearch cluster main index

Production Change

Change Summary

We have a list of changes we want to apply to GitLab.com main Advanced Search index:

This can be done by reindexing the index.

Change Details

  1. Services Impacted - Elasticsearch global search
  2. Change Technician - @dgruzd (EMEA) @john-mason (AMER)
  3. Change Reviewer - @terrichu
  4. Time tracking - 2880m
  5. Downtime Component - No downtime, but Advanced Search indexing will be paused for the duration of reindexing

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 30m

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 2880m

  1. Add a silence via https://alerts.gitlab.net/#/silences/new with a matcher on alert name: env="gprd", alertname="SearchServiceElasticsearchIndexingTrafficAbsent", alertname="gitlab_search_indexing_queue_backing_up", and alertname="SidekiqServiceGlobalSearchIndexingApdexSLOViolation". Link the comment field back to the Change Request Issue. => https://alerts.gitlab.net/#/silences/c69d2b4e-72e4-44b8-b504-baa52d7fd0d9
  2. Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a reindex of one of our production Elasticsearch cluster indices which will re-index all of our main index to another index in the same cluster using the Elasticsearch reindex API. During the reindex we’ll be pausing indexing to the cluster which will cause the incremental updates queue to grow. We have added a silence for the SearchServiceElasticsearchIndexingTrafficAbsent alert. This will increase load on the Elasticsearch cluster but should not impact any other systems. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6116
  3. Pause indexing writes: ApplicationSetting.current.update!(elasticsearch_pause_indexing: true) => https://gitlab.slack.com/archives/C101F3796/p1666609373933799?thread_ts=1666608961.045649&cid=C101F3796
  4. In any console set CLUSTER_URL and confirm that it is the expected cluster with expected indices:
    1. curl $CLUSTER_URL/_cat/indices
  5. Note the total size of the source gitlab-production-202012160000 index:
    1. 6.85TB
  6. Ensure there's enough capacity for the copy of gitlab-production index. Note free space for the cluster:
    1. 43311989678080 bytes free (43.3TB)
    2. If there isn't enough capacity, add data nodes to the cluster using Elastic Cloud console and wait until resizing is completed
  7. Take a screenshot of index advanced metrics for last 30 days and last 7 days and attach to a comment on this issue
    1. #6116 (comment 1146240300)
  8. Note the total number of documents in the source gitlab-production-202012160000 index:
    1. curl $CLUSTER_URL/gitlab-production-202012160000/_count
    2. {"count":826555716,"_shards":{"total":120,"successful":120,"skipped":0,"failed":0}}
  9. Create new destination index with correct settings/mappings from rails console:
    1. Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false)
  10. Replace below index gitlab-production-NEW_INDEX_SUFFIX names with the newly created index name
  11. Confirm the newly created index has new mappings
    1. curl $CLUSTER_URL/gitlab-production-20221024-1119/_settings
  12. Set index settings in destination index to optimize for writes:
    1. curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-20221024-1119/_settings"
  13. Increase recovery max bytes to speed up replication:
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "400mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
  14. Trigger re-index from source index gitlab-production-202012160000 to destination index gitlab-production-20221024-1119
    1. curl -H 'Content-Type: application/json' -d '{ "conflicts": "proceed", "source": { "index": "gitlab-production-202012160000" }, "dest": { "index": "gitlab-production-20221024-1119" } } }' -X POST "$CLUSTER_URL/_reindex?slices=auto&wait_for_completion=false&timeout=72h"
  15. Note the returned task ID from the above: kbL5gZn-RKi2F3k_IaHQzA:3540867
  16. Note the time when the task started: 2022-10-24 12:04 UTC
  17. Track the progress of reindexing using the Tasks API curl $CLUSTER_URL/_tasks/$TASK_ID
    1. If failures happen only in some slices it's possible to retry those slices only following the steps used last time
  18. Note the time when the task finishes: 2022-10-26 23:25 UTC
  19. Note the total time taken to reindex: 3561 minutes or 60 hours
  20. Change the refresh_interval and number of replicas setting on Destination Index to default
    1. curl -XPUT -d '{"index":{"number_of_replicas":"1", "refresh_interval": null}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-20221024-1119/_settings"
  21. Verify number of documents in Destination index = number of documents in Source index
    1. Be aware it may take 60s to refresh on the destination cluster
    2. curl $CLUSTER_URL/gitlab-production-202012160000/_count => {"count":826555716,"_shards":{"total":120,"successful":120,"skipped":0,"failed":0}}%
    3. curl $CLUSTER_URL/gitlab-production-20221024-1119/_count => {"count":826555716,"_shards":{"total":200,"successful":200,"skipped":0,"failed":0}}%
  22. Wait for cluster monitoring to show the replication has completed
  23. Set recovery max bytes back to default
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
  24. Set translog durability back to request on the destination index back to default:
    1. curl -XPUT -d '{"index":{"translog":{"durability":"request"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-20221024-1119/_settings"
  25. Note the size of the destination index gitlab-production-20221024-1119 index: 5.1 TB
  26. Update the alias gitlab-production to point to the new index
    1. curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-20221024-1119","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202012160000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
  27. Confirm it works curl $CLUSTER_URL/gitlab-production/_settings
  28. verify that there are no code search regressions
  29. Unpause indexing writes: ApplicationSetting.current.update!(elasticsearch_pause_indexing: false) https://gitlab.slack.com/archives/C101F3796/p1666828526346619
  30. Wait until the backlog of incremental updates gets below 10,000
    1. Chart Global search incremental indexing queue depth https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
  31. Create a file somewhere then search for it to ensure indexing still works (can take up to 2 minutes before it shows up in the search results)
    1. https://gitlab.com/gitlab-org/search-team/test-project/-/blob/71fd8fa84a462cc72da31762408f59b792b6ac64/searchable.txt#L30
    2. Found via https://gitlab.com/search?group_id=9970&repository_ref=master&scope=blobs&search=searchablecomment5
  32. remove the alert silence https://alerts.gitlab.net/#/silences/c69d2b4e-72e4-44b8-b504-baa52d7fd0d9, https://alerts.gitlab.net/#/silences/6df29b7f-c2b2-416e-896f-554e5a622f00
  33. delete the previous index gitlab-production-202012160000 when it is safe to do so

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 1m

  1. Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 60

  1. If the ongoing reindex is consuming too many resources it is possible to throttle the running reindex :
    1. You can check the index write throughput in ES monitoring to determine a sensible throttle. Since it defaults to no throttling at all it's safe to just set some throttle and observe the impact
    2. curl -XPOST "$CLUSTER_URL/_reindex/$TASK_ID/_rethrottle?requests_per_second=500
    • If you get past the step of updating the alias then simply switch the Alias to point back to the original index
      1. curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202012160000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-20221024-1119","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
      2. Confirm it works curl $CLUSTER_URL/gitlab-production/_count
    • Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track as well as any updates that went through sidekiq workers ElasticCommitIndexerWorker, ElasticDeleteProjectWorker.

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Change Reviewer checklist

C4 C3 C2 C1:

  • The scheduled day and time of execution of the change is appropriate.
  • The change plan is technically accurate.
  • The change plan includes estimated timing values based on previous testing.
  • The change plan includes a viable rollback plan.
  • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

Change Technician checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.
Edited by Dmitry Gruzd