Skip to content

Reindex GitLab.com Global Search Elasticsearch cluster

Production Change - Criticality 3 C3

Change Objective Describe the objective of the change
Change Type ConfigurationChange
Services Impacted GitLab.com , Redis SharedState, Redis Sidekiq, Elasticsearch cluster, Sidekiq
Change Team Members @DylanGriffith
Change Criticality C3
Change Reviewer or tested in staging
Dry-run output
Due Date 2020-06-04 00:32 UTC
Time tracking

Detailed steps for the change

Pre-check

  1. Run all the steps on staging
  2. Confirm the cluster storage is less than 33% full (more than 67% free)

Process

  1. Confirm the cluster storage is less than 33% full (more than 67% free)
  2. Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a reindex of our production Elasticsearch cluster which will re-index all of our production global search index to another index in the same cluster using the Elasticsearch reindex API. During the reindex we'll be pausing indexing to the cluster which will cause the incremental updates queue to grow but should not cause alerts since we don't have any for that queue. This will increase load on the Elasticsearch cluster but should not impact any other systems. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2213
  3. Pause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
  4. Take a snapshot of the cluster
  5. In any console set CLUSTER_URL and confirm that it is the expected cluster with expected indices:
    1. curl $CLUSTER_URL/_cat/indices
  6. Note the total size of the source gitlab-production-202004062333 index: 4.9 TB
  7. Note the total number of documents in the source gitlab-production-202004062333 index: 127170026
    • curl $CLUSTER_URL/gitlab-production-202004062333/_count
  8. Create new destination index gitlab-production-202006040000 with correct settings/mappings from rails console:
    1. Gitlab::Elastic::Helper.new(target_name: 'gitlab-production-202006040000').create_empty_index(with_alias: false)
  9. Set index settings in destination index to optimize for writes:
    • curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202006040000/_settings"
  10. Trigger re-index from source index gitlab-production-202004062333 to destination index gitlab-production-202006040000
    1. curl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production-202004062333" }, "dest": { "index": "gitlab-production-202006040000" }, "script": { "source": "ctx._source.remove(\"file_name\"); ctx._source.remove(\"content\");" } }' -X POST "$CLUSTER_URL/_reindex?slices=auto&wait_for_completion=false"
  11. Note the returned task ID from the above: kYs7t1EZTcGOQ_FTrLVubQ:1485991
  12. Note the time when the task started: 2020-06-04 01:09:22 UTC
  13. Track the progress of reindexing using the Tasks API curl $CLUSTER_URL/_tasks/$TASK_ID
  14. Note the time when the task finishes: 2020-06-04 08:33:10 UTC
  15. Note the total time taken to reindex: 7.4 hrs
  16. Change the refresh_interval setting on Destination Cluster to 60
    • curl -XPUT -d '{"index":{"refresh_interval":"60s"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202006040000/_settings"
  17. Verify number of documents in Destination index = number of documents in Source index
    • Be aware it may take 60s to refresh on the destination cluster
    • curl $CLUSTER_URL/gitlab-production-202004062333/_count => 127170026
    • curl $CLUSTER_URL/gitlab-production-202006040000/_count => 127170026
  18. Force merge the index to speed up replication curl -XPOST $CLUSTER_URL/gitlab-production-202006040000/_forcemerge
  19. Increase replication on Destination index to 1:
    • curl -XPUT -d '{"index":{"number_of_replicas":"1"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202006040000/_settings"
  20. Increase recovery max bytes to speed up replication:
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
  21. Wait for cluster monitoring to show the replication has completed
  22. Set recovery max bytes back to default
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
  23. Set translog durability back to request on the destination index back to default:
    • curl -XPUT -d '{"index":{"translog":{"durability":"request"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202006040000/_settings"
  24. Note the size of the destination index gitlab-production-202004062333 index: 1.1 TB
  25. Update the alias gitlab-production to point to the new index
    1. curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202006040000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202004062333","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
    2. Confirm it works curl $CLUSTER_URL/gitlab-production/_count
  26. Test that prefix searching now behaves as expected per gitlab-org/gitlab#27918 (comment 324005806)
  27. Unpause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
  28. Wait until the backlog of incremental updates gets below 10,000
  29. Create a comment somewhere then search for it to ensure indexing still works (can take up to 2 minutes before it shows up in the search results)
  30. Delete the old gitlab-production-202004062333 index
    1. curl -XDELETE $CLUSTER_URL/gitlab-production-202004062333
  31. Test again that searches work as expected
  32. Scale the cluster down again based on the current size

Monitoring

Key metrics to observe

Other metrics to observe

Rollback steps

  1. If you get past the step of updating the alias then simply switch the Alias to point back to the original index
    1. curl -XPOST -H 'Content-Type: application/json' -d '"actions":[{"add":{"index":"gitlab-production-202004062333","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202006040000,"alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
    2. Confirm it works curl $CLUSTER_URL/gitlab-production/_count
  2. Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • Person on-call has been informed prior to change being rolled out
Edited by Dylan Griffith