Skip to content

Reindex GitLab.com global search cluster

Production Change - Criticality 3 C3

Change Objective Describe the objective of the change
Change Type ConfigurationChange
Services Impacted GitLab.com, Elasticsearch cluster
Change Team Members @DylanGriffith @aamarsanaa
Change Criticality C3
Change Reviewer or tested in staging #1902 (comment 318008031)
Dry-run output
Due Date 2020-04-06 XX:XX:00 UTC
Time tracking

Detailed steps for the change

Pre-check

  1. Run all the steps on staging

Process

  1. Create a new Elasticsearch cluster referred to as Destination Cluster. We will refer to the existing cluster as Source Cluster throughout the rest of the steps.
  2. Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a reindex of our production Elasticsearch cluster which will re-index all of our production global search cluster to another Elasticsearch cluster using the Elasticsearch reindex API. This will increase search load on the production cluster but should not impact any other systems. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1902
  3. Pause indexing writes (stop Elasticsearch sidekiq node): sudo gitlab-ctl stop sidekiq-cluster
  4. Note the size of the source cluster gitlab-production index:
    • 577.6 GB
  5. In any console set SOURCE_CLUSTER_URL and DESTINATION_CLUSTER_URL and validate they are both returning the expected data:
    1. curl $SOURCE_CLUSTER_URL/_cat/indices
    2. curl $DESTINATION_CLUSTER_URL/_cat/indices
  6. Set translog durability to async on the destination cluster to speed up writes:
    • curl -XPUT -d '{"index":{"translog":{"durability":"async"}}}' -H 'Content-Type: application/json' "$DESTINATION_CLUSTER_URL/gitlab-production/_settings"
  7. Trigger re-index from source cluster to destination cluster gitlab-rake gitlab:elastic:reindex_to_another_cluster[$SOURCE_CLUSTER_URL,$DESTINATION_CLUSTER_URL] (these are the full URL including the basic auth credentials as entered via the GitLab admin)
  8. Note the returned task ID from the above: g2hyA2s9RTixfzpYwtWgcQ:2444
  9. Note the time when the task started: 2020-04-06 03:54 UTC
  10. Track the progress of reindexing using the Tasks API curl $DESTINATION_CLUSTER_URL/_tasks/$TASK_ID
  11. Note the time when the task finishes: 2020-04-07 XX:XX UTC
  12. Note the total time taken to reindex: XX hrs
  13. Change the refresh_interval setting on Destination Cluster to 60
    • curl -XPUT -d '{"index":{"refresh_interval":"60s"}}' -H 'Content-Type: application/json' "$DESTINATION_CLUSTER_URL/gitlab-production/_settings"
  14. Verify number of documents in Destination Cluster gitlab-production index = number of documents in Source Cluster gitlab-production index
    • Be aware it may take 60s to refresh on the destination cluster
    • curl $SOURCE_CLUSTER_URL/gitlab-production/_count => XX
    • curl $DESTINATION_CLUSTER_URL/gitlab-production/_count => XX
  15. curl -XPOST $DESTINATION_CLUSTER_URL/gitlab-production/_forcemerge
  16. Increase replication on Destination Cluster to 1:
    • curl -XPUT -d '{"index":{"number_of_replicas":"1"}}' -H 'Content-Type: application/json' "$DESTINATION_CLUSTER_URL/gitlab-production/_settings"
  17. Wait for cluster monitoring to show the replication has completed
  18. Set translog durability back to request on the destination cluster to speed up writes:
    • curl -XPUT -d '{"index":{"translog":{"durability":"request"}}}' -H 'Content-Type: application/json' "$DESTINATION_CLUSTER_URL/gitlab-production/_settings"
  19. Note the size of the destination cluster gitlab-production index: XX GB`
  20. Change settings in GitLab > Admin > Settings > Integrations to point to Destination Cluster
  21. Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster
  22. Wait until the backlog of incremental updates gets below 1000
  23. Wait for elastic_indexer queue to drop below 10k
  24. Enable search with Elasticsearch in GitLab > Admin > Settings > Integrations
  25. Create a comment somewhere then search for it to ensure indexing still works (can take a minute to catch up)
    1. Confirm it's caught up by checking Global search incremental indexing queue depth or the source of truth via rails console: Elastic::ProcessBookkeepingService.queue_size
  26. Test that phrase searching in issues works: https://gitlab.com/search?project_id=278964&repository_ref=master&scope=issues&search=%22slack+integration%22&snippets=

Rollback steps

  1. Switch GitLab settings to point back to Source Cluster
  2. Ensure any updates that only went to Destination Cluster are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track

Monitoring

Key metrics to observe

Other metrics to observe

Rollback steps

  1. If we got past the step "Change settings in GitLab > Admin > Settings > Integrations to point to Destination Cluster" then:
    • Switch GitLab settings to point back to Source Cluster in GitLab > Admin > Settings > Integrations
  2. If we got past the step "Re-enable indexing writes":
  • If we stopped before "Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster" then all you need to do to rollback is:
    • "Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster"

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • Person on-call has been informed prior to change being rolled out
Edited by Dylan Griffith