Skip to content
GitLab
Next
    • Why GitLab
    • Pricing
    • Contact Sales
    • Explore
  • Why GitLab
  • Pricing
  • Contact Sales
  • Explore
  • Sign in
  • Get free trial
  • GitLab.comGitLab.com
  • GitLab Infrastructure TeamGitLab Infrastructure Team
  • production
  • Issues
  • #1862

Reindex GitLab.com global search cluster

Production Change - Criticality 3 C3

Change Objective Describe the objective of the change
Change Type ConfigurationChange
Services Impacted GitLab.com , Redis SharedState, Redis Sidekiq, Elasticsearch cluster, Sidekiq
Change Team Members @DylanGriffith @aamarsanaa
Change Criticality C3
Change Reviewer or tested in staging Tested in staging at gitlab-com/runbooks!2017 (comment 311596689)
Dry-run output
Due Date 2020-04-01 05:20:26 UTC
Time tracking

Detailed steps for the change

Pre-check

  1. Run all the steps on staging
    • gitlab-com/runbooks!2017 (comment 311596689)
  2. Run a dry run process in production where all we do is reindex the data to the Destination Cluster
    • #1862 (comment 313510680)
    • #1862 (comment 314402159)

Process

  1. Create a new Elasticsearch cluster referred to as Destination Cluster. We will refer to the existing cluster as Source Cluster throughout the rest of the steps.
    • prod-gitlab-com indexing-20200330\
    • CLUSTER_URL=https://80085d1a595e43ea9a53dd22adb4f406.us-central1.gcp.cloud.es.io:9243
    • USER=elastic
    • PASSWORD=<redacted>
  2. Let SRE on call know that we are triggering the re-index in #production: @sre-oncall please note we are doing a reindex of our production Elasticsearch cluster which will re-index all of our production global search cluster to another Elasticsearch cluster using the Elasticsearch reindex API. This will increase search load on the production cluster but should not impact any other systems. https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1862
  3. Disable search with Elasticsearch in GitLab > Admin > Settings > Integrations
  4. Pause indexing writes (stop Elasticsearch sidekiq node): sudo gitlab-ctl stop sidekiq-cluster
  5. Note the size of the source cluster gitlab-production index: 556.1 GB
  6. In any console set SOURCE_CLUSTER_URL and DESTINATION_CLUSTER_URL and validate they are both returning the expected data:
    1. curl $SOURCE_CLUSTER_URL/_cat/indices
    2. curl $DESTINATION_CLUSTER_URL/_cat/indices
  7. Trigger re-index from source cluster to destination cluster:
    • From production rails console monkey patch Gitlab::Elastic::Helper from gitlab-org/gitlab!28488 (diffs)
    • From rails console:
      • source_cluster_url = ... (as above)
      • destination_cluster_url = ... (as above)
      • task_id = Gitlab::Elastic::Helper.reindex_to_another_cluster(source_cluster_url, destination_cluster_url, 6000)
      • task_id = Gitlab::Elastic::Helper.reindex_to_another_cluster(source_cluster_url, destination_cluster_url, 1000)
  8. Note the returned task ID from the above:
  • CpFaAeAsQL6SOWYjJs1kLg:1319927 => failed with Remote responded with a chunk that was too large. Use a smaller batch size.
  • Triggered again with batch size of 1000 => AiFJYPk1Rn6eVaHTxVMTag:1198080
  1. Note the time when the task started:
  • 2020-04-01 05:20:26 UTC => failed with Remote responded with a chunk that was too large. Use a smaller batch size.
  • Triggered again with batch size of 1000 at 2020-04-01 15:10:46 UTC
  1. Track the progress of reindexing using the Tasks API curl $DESTINATION_CLUSTER_URL/_tasks/$TASK_ID
  2. Note the time when the task finishes: 2020-04-02 06:05 UTC
  3. Note the total time taken to reindex: 14.9hrs
  4. Change the refresh_interval setting on Destination Cluster to 60
    • curl -XPUT -d '{"index":{"refresh_interval":"60s"}}' -H 'Content-Type: application/json' "$DESTINATION_CLUSTER_URL/gitlab-production/_settings"
  5. Verify number of documents in Destination Cluster gitlab-production index = number of documents in Source Cluster gitlab-production index
    • Be aware it may take 60s to refresh on the destination cluster
    • curl $SOURCE_CLUSTER_URL/gitlab-production/_count => 23496604
    • curl $DESTINATION_CLUSTER_URL/gitlab-production/_count => 23496604
  6. curl -XPOST $DESTINATION_CLUSTER_URL/gitlab-production/_forcemerge
  7. Increase replication on Destination Cluster to 1:
    • curl -XPUT -d '{"index":{"number_of_replicas":"1"}}' -H 'Content-Type: application/json' "$DESTINATION_CLUSTER_URL/gitlab-production/_settings"
  8. Wait for cluster monitoring to show the replication has completed
  9. Note the size of the destination cluster gitlab-production index: 272.1 GB
  10. Update the durability to the default value
    • curl -XPUT -d '{"index":{"translog":{"durability":"request"}}}' -H 'Content-Type: application/json' "$DESTINATION_CLUSTER_URL/gitlab-production/_settings"
  11. Change settings in GitLab > Admin > Settings > Integrations to point to Destination Cluster
  12. Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster
  13. Wait until the backlog of incremental updates gets below 1000
    • Chart Global search incremental indexing queue depth https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
  14. Wait for elastic_indexer queue to drop below 10k
  15. Enable search with Elasticsearch in GitLab > Admin > Settings > Integrations
  16. Create a comment somewhere then search for it to ensure indexing still works (can take a minute to catch up)
    1. Confirm it's caught up by checking Global search incremental indexing queue depth or the source of truth via rails console: Elastic::ProcessBookkeepingService.queue_size

Rollback steps

  1. Switch GitLab settings to point back to Source Cluster
  2. Ensure any updates that only went to Destination Cluster are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track

Monitoring

Key metrics to observe

  • Gitlab admin panel:
    • queue for ElasticIndexerWorker
    • queue for ElasticCommitIndexerWorker
  • Grafana:
    • Platform triage
    • Sidekiq:
      • Sidekiq SLO dashboard overview
        • Sidekiq Queue Lengths per Queue
          • Expected to climb during initial indexing and sudden drop-off once initial indexing jobs are finished.
        • Sidekiq Inflight Operations by Queue
        • Node Maximum Single Core Utilization per Priority
          • expected to be 100% during initial indexing
    • Redis-sidekiq:
      • Redis-sidekiq SLO dashboard overview
        • Memory Saturation
          • If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs).
          • If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped.
  • Incremental updates queue:
    • Chart Global search incremental indexing queue depth https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1
    • From rails console Elastic::ProcessBookkeepingService.queue_size

Other metrics to observe

  • Elastic support diagnostics: In the event of any issues (eg. like last time seeing too many threads in https://support.elastic.co/customers/s/case/5004M00000cL8IJ/cluster-crashing-under-high-load-possibly-due-to-jvm-heap-size-too-large-again) we can grab the support diagnostics per instructions at https://support.elastic.co/customers/s/article/support-diagnostics
  • Grafana:
    • Rails:
      • Search controller performance
    • Postgres:
      • patroni SLO dashboard overview
      • postgresql overview
      • pgbouncer SLO dashboard overview
      • pgbouncer overview
      • "Waiting Sidekiq pgbouncer Connections"
        • If we see this increase to say 500 and stay that way then we should be concerned and disable indexing at that point
    • Gitaly:
      • Gitaly SLO dashboard overview
      • Gitaly latency
      • Gitaly saturation overview
      • Gitaly single node saturation
        • If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down ElasticCommitIndexerWorker that it will help and then stop if it's clearly correlated.

Rollback steps

  1. If we got past the step "Change settings in GitLab > Admin > Settings > Integrations to point to Destination Cluster" then:
    • Switch GitLab settings to point back to Source Cluster in GitLab > Admin > Settings > Integrations
  2. If we got past the step "Re-enable indexing writes":
    • Ensure any updates that only went to Destination Cluster are replayed against Source Cluster by searching the logs for the updates https://gitlab.com/gitlab-org/gitlab/-/blob/e8e2c02a6dbd486fa4214cb8183d428102dc1156/ee/app/services/elastic/process_bookkeeping_service.rb#L23 and triggering those updates again using ProcessBookkeepingService#track
  • If we stopped before "Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster" then all you need to do to rollback is:
    • "Re-enable indexing writes (start Elasticsearch sidekiq node) sudo gitlab-ctl start sidekiq-cluster"

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • Person on-call has been informed prior to change being rolled out
Edited Apr 02, 2020 by Dylan Griffith
Assignee
Assign to
Time tracking