Reindex GitLab.com Global Search Elasticsearch cluster with new regex pattern

Production Change - Criticality 3 C3

Change Objective Describe the objective of the change
Change Type ConfigurationChange
Services Impacted GitLab.com , Redis SharedState, Redis Sidekiq, Elasticsearch cluster, Sidekiq
Change Team Members @DylanGriffith
Change Criticality C3
Change Reviewer or tested in staging #2408 (comment 384849906)
Dry-run output
Due Date 2020-07-27 04:02:36 UTC
Time tracking

Detailed steps for the change

This is to rollout the indexing changes in gitlab-org/gitlab!36255 (merged)

Pre-check

  1. Practice dry-run on production with various reindexing settings until we can get time down to less than 24 hrs
  2. Run all the steps on staging
  3. Make the cluster larger if necessary. It should be less than 33% full (more than 67% free)

Process

  1. Confirm the cluster storage is less than 33% full (more than 67% free) -- 63% free will be fine considering we ended up using less storage when we've re-indexed staging
  2. Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a reindex of our production Elasticsearch cluster which will re-index all of our production global search index to another index in the same cluster using the Elasticsearch reindex API. During the reindex we'll be pausing indexing to the cluster which will cause the incremental updates queue to grow but should not cause alerts since we don't have any for that queue. This will increase load on the Elasticsearch cluster but should not impact any other systems. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2408
  3. Pause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
  4. In any console set CLUSTER_URL and confirm that it is the expected cluster with expected indices:
    1. curl $CLUSTER_URL/_cat/indices
  5. Note the total size of the source gitlab-production-202007070000 index: 5.3 TB
  6. Note the total number of documents in the source gitlab-production-202007070000 index: 661083549
    1. curl $CLUSTER_URL/gitlab-production-202007070000/_count
  7. Create new destination index gitlab-production-202007270000 with correct settings/mappings from rails console:
    1. Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false, options: { index_name: 'gitlab-production-202007270000' })
  8. Set index settings in destination index to optimize for writes:
    1. curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
  9. Trigger re-index from source index gitlab-production-202007070000 to destination index gitlab-production-202007270000
    1. curl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production-202007070000" }, "dest": { "index": "gitlab-production-202007270000" } } }' -X POST "$CLUSTER_URL/_reindex?slices=180&wait_for_completion=false&scroll=1h"
  10. Note the returned task ID from the above: JnbbPKxbRpqoWDmCj6fhcw:1169025
  11. Note the time when the task started: 2020-07-27 04:16:28 UTC
  12. -------------------------------------------------------------------
  13. PER #2408 (comment 386406235) this failed so I'm restarting the reindexing with the following steps
  14. Unpause indexing
  15. Delete the new gitlab-production-202007270000 index
  16. Wait until queues catch up
  17. -------------------------------------------------------------------
  18. Pause indexing
  19. Note the total size of the source gitlab-production-202007070000 index: 5.3 TB
  20. Note the total number of documents in the source gitlab-production-202007070000 index: 662660754
    1. curl $CLUSTER_URL/gitlab-production-202007070000/_count
  21. Create new destination index gitlab-production-202007270000 with correct settings/mappings from rails console:
    1. Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false, options: { index_name: 'gitlab-production-202007270000' })
  22. Set index settings in destination index to optimize for writes:
    1. curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
  23. Trigger re-index from source index gitlab-production-202007070000 to destination index gitlab-production-202007270000
    1. curl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production-202007070000" }, "dest": { "index": "gitlab-production-202007270000" } } }' -X POST "$CLUSTER_URL/_reindex?slices=180&wait_for_completion=false&scroll=1h"
  24. Note the returned task ID from the above: Gyp_pWWaShSBKGhC1OiT8w:4184769
  25. Note the time when the task started: 2020-07-28 06:34:51 UTC
  26. -------------------------------------------------------------------
  27. PER #2408 (comment 387128571) this failed so I'm restarting the reindexing with the following steps
  28. Delete the new gitlab-production-202007270000 index
  29. -------------------------------------------------------------------
  30. Note the total size of the source gitlab-production-202007070000 index: 5.3 TB
  31. Note the total number of documents in the source gitlab-production-202007070000 index: 662690754
    1. curl $CLUSTER_URL/gitlab-production-202007070000/_count
  32. Create new destination index gitlab-production-202007270000 with correct settings/mappings from rails console:
    1. Gitlab::Elastic::Helper.new.create_empty_index(with_alias: false, options: { index_name: 'gitlab-production-202007270000' })
  33. Set index settings in destination index to optimize for writes:
    1. curl -XPUT -d '{"index":{"number_of_replicas":"0","refresh_interval":"-1","translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
  34. Trigger re-index from source index gitlab-production-202007070000 to destination index gitlab-production-202007270000
    1. curl -H 'Content-Type: application/json' -d '{ "source": { "index": "gitlab-production-202007070000" }, "dest": { "index": "gitlab-production-202007270000" } } }' -X POST "$CLUSTER_URL/_reindex?slices=180&wait_for_completion=false&scroll=1h"
  35. Note the returned task ID from the above: Gyp_pWWaShSBKGhC1OiT8w:10843499
  36. Note the time when the task started: 2020-07-29 00:36:50 UTC
  37. Track the progress of reindexing using the Tasks API curl $CLUSTER_URL/_tasks/$TASK_ID
  38. Note the time when the task finishes:
  39. Note the total time taken to reindex:
  40. Change the refresh_interval setting on Destination Index to 60s
    1. curl -XPUT -d '{"index":{"refresh_interval":"60s"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
  41. Verify number of documents in Destination index = number of documents in Source index
    1. Be aware it may take 60s to refresh on the destination cluster
    2. curl $CLUSTER_URL/gitlab-production-202007070000/_count => 662690754
    3. curl $CLUSTER_URL/gitlab-production-202007270000/_count => 662690754
  42. Force merge the index to speed up replication:
    • curl -XPOST $CLUSTER_URL/gitlab-production-202007270000/_forcemerge
  43. Expunge the many deleted docs:
    • curl -XPOST $CLUSTER_URL/gitlab-production-202007070000/_forcemerge?only_expunge_deletes=true
  44. Increase replication on Destination index to 1:
    1. curl -XPUT -d '{"index":{"number_of_replicas":"1"}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
  45. Increase recovery max bytes to speed up replication:
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": "400mb"}}' -XPUT $CLUSTER_URL/_cluster/settings
  46. Wait for cluster monitoring to show the replication has completed
  47. Set recovery max bytes back to default
    1. curl -H 'Content-Type: application/json' -d '{"persistent":{"indices.recovery.max_bytes_per_sec": null}}' -XPUT $CLUSTER_URL/_cluster/settings
  48. Set translog durability back to request on the destination index back to default:
    1. curl -XPUT -d '{"index":{"translog":{"durability":"request"}}}' -H 'Content-Type: application/json' "$CLUSTER_URL/gitlab-production-202007270000/_settings"
  49. Note the size of the destination index gitlab-production-202007270000 index: 6.3 TB => Noting that due to retrying multiple times the index still contains many deleted docs that will be cleaned over time. There is currently 71459720 which probably accounts for the large size increase.
  50. Update the alias gitlab-production to point to the new index
    1. curl -XPOST -H 'Content-Type: application/json' -d '{"actions":[{"add":{"index":"gitlab-production-202007270000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202007070000","alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
    2. Confirm it works curl $CLUSTER_URL/gitlab-production/_count
  51. Test that searching for the new pattern works using https://gitlab.com/search?utf8=%E2%9C%93&snippets=false&scope=&repository_ref=master&search=hook_to_event&group_id=9970&project_id=278964 which should now find a match for https://gitlab.com/gitlab-org/gitlab/-/blob/c086bd8f75537a7f73fc386f2053962b389dc990/app/services/web_hook_service.rb#L20
  52. Unpause indexing writes: Admin > Settings > Integrations > Elasticsearch > Pause Elasticsearch indexing
  53. Wait until the backlog of incremental updates gets below 10,000
    1. Chart Global search incremental indexing queue depth https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
  54. Create a comment somewhere then search for it to ensure indexing still works (can take up to 2 minutes before it shows up in the search results)
  55. Delete the old gitlab-production-202007070000 index
    1. curl -XDELETE $CLUSTER_URL/gitlab-production-202007070000
  56. Test again that searches work as expected
  57. Scale the cluster down again based on the current size => Leaving scaled up for a few days to get a baseline to see if extra capacity helps performance

Monitoring

Key metrics to observe

Other metrics to observe

Rollback steps

  1. If the ongoing reindex is consuming too many resources it is possible to throttle the running reindex :
    1. You can check the index write throughput in ES monitoring to determine a sensible throttle. Since it defaults to no throttling at all it's safe to just set some throttle and observe the impact
    2. curl -XPOST "$CLUSTER_URL/_reindex/$TASK_ID/_rethrottle?requests_per_second=500
  2. If you get past the step of updating the alias then simply switch the Alias to point back to the original index
    1. curl -XPOST -H 'Content-Type: application/json' -d '"actions":[{"add":{"index":"gitlab-production-202007070000","alias":"gitlab-production"}}, {"remove":{"index":"gitlab-production-202007270000,"alias":"gitlab-production"}}]}' $CLUSTER_URL/_aliases
    2. Confirm it works curl $CLUSTER_URL/gitlab-production/_count
  3. Ensure any updates that only went to Destination index are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • Person on-call has been informed prior to change being rolled out
Edited by Dylan Griffith