Skip to content

Tune Zero downtime reindexing for large instances

What does this MR do and why?

Zero downtime reindexing failed for the main index in change request: gitlab-com/gl-infra/production#18080 (closed)

This MR is changing a few settings to tune the feature for larger instances. These changes will allow us to dogfood the feature in production and make the feature more stable for large SM instances. Settings changed:

  1. Increase max times a task can fail from 10 to 20
    • Reason: one task out of 600 caused the entire reindexing to fail (near the end of completion)
  2. Set scroll context expiration to 2h, default is 5 min
    • Reason: It's recommended to change this setting based on the error we saw in the failures "scroll context expired" - context
  3. Change how often the ElasticClusterReindexingCronWorker runs from every 10 minutes to every 5 minutes
    • we would like to run the cron worker more often to speed up the process. the worker is responsible for making changes to gitlab settings (pausing indexing), kicking off tasks to Elasticsearch cluster, and asking for task status from Elasticsearch cluster. The main portion of work is happening in the Elasticsearch cluster so we feel this change is safe.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

N/A these are all backend changes

Before After

How to set up and validate locally

  1. setup gdk for elasticsearch, enable advanced search and run gitlab:elastic:reindex to create/populate all indices
  2. set the ELASTIC_CLIENT_DEBUG env variable to true
    export ELASTIC_CLIENT_DEBUG=1
  3. start rails console: rails c
  4. create a new reindexing task
    Elastic::ReindexingTask.create!(targets: %w[Repository], max_slices_running: 30, slice_multiplier: 3)
  5. ensure the task is in initial state
    Elastic::ReindexingTask.current
    => #<Elastic::ReindexingTask:0x0000000170822460
     id: 15,
     created_at: Mon, 17 Jun 2024 14:40:13.683304000 UTC +00:00,
     updated_at: Mon, 17 Jun 2024 14:40:13.683304000 UTC +00:00,
     state: "initial",
     in_progress: true,
     error_message: nil,
     delete_original_index_at: nil,
     max_slices_running: 30,
     slice_multiplier: 3,
     targets: ["Repository"],
     options: {}>
  6. Move the process along so that the Elasticsearch tasks get created, the task should move to indexing_paused
      > service = Elastic::ClusterReindexingService.new
      service.execute
  7. move the process along again with the service.execute command. note the tasks should be created with scroll=2h as a URL parameter
      TRANSACTION (1.1ms)  COMMIT /*application:console,db_config_name:main,console_hostname:terrichus-   MBP.localdomain,console_username:terrichu,line:/lib/gitlab/database.rb:392:in `commit'*/
    2024-06-17 10:42:47 -0400: POST http://localhost:9200/_reindex?scroll=2h&wait_for_completion=false [status:200,    request:0.009s, query:n/a]
    2024-06-17 10:42:47 -0400: > {"source":{"index":"gitlab-development-20240606-2327","slice":   {"id":7,"max":15}},"dest":{"index":"gitlab-development-20240617-1442-reindex-15-0"}}
    2024-06-17 10:42:47 -0400: < {"task":"_kTIXOaWQbGaNYF5opwUqA:2298846"}
  8. run service.execute commands until you get a true returned, when Elastic::ReindexingTask.current returns nil the reindexing is done
  > Elastic::ReindexingTask.current
  Elastic::ReindexingTask Load (0.6ms)  SELECT "elastic_reindexing_tasks".* FROM "elastic_reindexing_tasks" WHERE "elastic_reindexing_tasks"."in_progress" = TRUE ORDER BY "elastic_reindexing_tasks"."id" DESC LIMIT 1 /*application:console,db_config_name:main,console_hostname:terrichus-MBP.localdomain,console_username:terrichu,line:/ee/app/models/elastic/reindexing_task.rb:29:in `current'*/
=> nil
Edited by Terri Chu

Merge request reports