Skip to content

Reindex GitLab.com global search cluster with alias

Production Change - Criticality 3 C3

Change Objective Describe the objective of the change
Change Type ConfigurationChange
Services Impacted GitLab.com, Elasticsearch cluster
Change Team Members @DylanGriffith @aamarsanaa
Change Criticality C3
Change Reviewer or tested in staging #1907 (comment 318876101)
Dry-run output
Due Date 2020-04-07 XX:XX:00 UTC
Time tracking

Detailed steps for the change

Pre-check

  1. Run all the steps on staging

Process

  1. Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a reindex of our production Elasticsearch cluster which will re-index all of our production global search cluster to another index in the same cluster using the Elasticsearch reindex API. During the reindex we'll be disabling sidekiq on the Elasticsearch sidekiq node so the associated queues will grow and may cause alerts. This will increase load on the production cluster but should not impact any other systems. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1907
  2. Pause indexing writes (stop Elasticsearch sidekiq node): sudo gitlab-ctl stop sidekiq-cluster
  3. Take a snapshot of the cluster
  4. Note the total size of the source gitlab-production index:
    • 580.8 GB
  5. In any console set CLUSTER_URL that it is the expected cluster with expected indices:
    1. curl $CLUSTER_URL/_cat/indices
  6. Create index gitlab-production-202004062333 with correct settings/mappings:
    1. index-create.json => index-create.json
    2. curl -H 'Content-Type: application/json' -d @index-create.json -XPUT "$CLUSTER_URL/gitlab-production-202004062333?include_type_name=true"
  7. Set index settings in destination index to optimize for writes:
    • curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202004062333/_settings"
  8. Trigger re-index from source index gitlab-production to destination index gitlab-production-202004062333
    1. curl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production" }, "dest": { "index": "gitlab-production-202004062333" } }' -X POST "$CLUSTER_URL/_reindex?slices=auto&wait_for_completion=false"
  9. Note the returned task ID from the above: 5RluDddRTgq7jdqmfAYagg:4459919
  10. Note the time when the task started: 2020-04-07 06:55:23 UTC
  11. Track the progress of reindexing using the Tasks API curl $CLUSTER_URL/_tasks/$TASK_ID
  12. Note the time when the task finishes: 2020-04-07 09:59:42 UTC
  13. Note the total time taken to reindex: 3.1 hrs
  14. Change the refresh_interval setting on Destination Cluster to 60
    • curl -XPUT -d '{"index":{"refresh_interval":"60s"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202004062333/_settings"
  15. Verify number of documents in Destination Cluster gitlab-production index = number of documents in Source Cluster gitlab-production index
    • Be aware it may take 60s to refresh on the destination cluster
    • curl $CLUSTER_URL/gitlab-production/_count => 24765325
    • curl $CLUSTER_URL/gitlab-production-202004062333/_count => 24765325
  16. curl -XPOST $CLUSTER_URL/gitlab-production-202004062333/_forcemerge
  17. Increase replication on Destination Cluster to 1:
    • curl -XPUT -d '{"index":{"number_of_replicas":"1"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202004062333/_settings"
  18. Increase recovery max bytes:
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "200mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
  19. Wait for cluster monitoring to show the replication has completed
  20. Set recovery max bytes back to default
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
  21. Set translog durability back to request on the destination cluster to speed up writes:
    • curl -XPUT -d '{"index":{"translog":{"durability":"request"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202004062333/_settings"
  22. Note the size of the destination index gitlab-production-202004062333 index: 591.5 GB
  23. Delete the gitlab-production index
    1. curl -XDELETE $CLUSTER_URL/gitlab-production
  24. Create an alias gitlab-production that points to gitlab-production-202004062333
    1. curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202004062333","alias":"gitlab-production"}}]}}' $CLUSTER_URL/_aliases
    2. Confirm it works curl $CLUSTER_URL/gitlab-production/_count
  25. Enable search with Elasticsearch in GitLab > Admin > Settings > Integrations
  26. Test that phrase searching in issues works: https://gitlab.com/search?project_id=278964&repository_ref=master&scope=issues&search=%22slack+integration%22&snippets=
  27. Disable search with Elasticsearch in GitLab > Admin > Settings > Integrations to allow indexing to catch up
  28. Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster
  29. Wait until the backlog of incremental updates gets below 1000
  30. Wait for elastic_indexer queue to drop below 10k
  31. Enable search with Elasticsearch in GitLab > Admin > Settings > Integrations
  32. Create a comment somewhere then search for it to ensure indexing still works (can take a minute to catch up)
    1. Confirm it's caught up by checking Global search incremental indexing queue depth or the source of truth via rails console: Elastic::ProcessBookkeepingService.queue_size
  33. Test that phrase searching in issues works: https://gitlab.com/search?project_id=278964&repository_ref=master&scope=issues&search=%22slack+integration%22&snippets=

Rollback steps

  1. If you have gotten past deleting the initial gitlab-production index then you'll need to restore from the snapshot
  2. Ensure any updates that only went to Destination Cluster are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track

Monitoring

Key metrics to observe

Other metrics to observe

Rollback steps

  1. If we got past the step "Change settings in GitLab > Admin > Settings > Integrations to point to Destination Cluster" then:
    • Switch GitLab settings to point back to Source Cluster in GitLab > Admin > Settings > Integrations
  2. If we got past the step "Re-enable indexing writes":
  • If we stopped before "Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster" then all you need to do to rollback is:
    • "Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster"

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • Person on-call has been informed prior to change being rolled out
Edited by Dylan Griffith