Tune Zero downtime reindexing for large instances (!156519) · Merge requests · GitLab.org / GitLab

Terri Chu requested to merge tchu-updates-for-zero-downtime-reindexing-on-large-instances into master Jun 17, 2024

What does this MR do and why?

Zero downtime reindexing failed for the main index in change request: gitlab-com/gl-infra/production#18080 (closed)

This MR is changing a few settings to tune the feature for larger instances. These changes will allow us to dogfood the feature in production and make the feature more stable for large SM instances. Settings changed:

Increase max times a task can fail from 10 to 20
- Reason: one task out of 600 caused the entire reindexing to fail (near the end of completion)
Set scroll context expiration to 2h, default is 5 min
- Reason: It's recommended to change this setting based on the error we saw in the failures "scroll context expired" - context
Change how often the ElasticClusterReindexingCronWorker runs from every 10 minutes to every 5 minutes
- we would like to run the cron worker more often to speed up the process. the worker is responsible for making changes to gitlab settings (pausing indexing), kicking off tasks to Elasticsearch cluster, and asking for task status from Elasticsearch cluster. The main portion of work is happening in the Elasticsearch cluster so we feel this change is safe.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

N/A these are all backend changes

Before	After

How to set up and validate locally

setup gdk for elasticsearch, enable advanced search and run gitlab:elastic:reindex to create/populate all indices
set the ELASTIC_CLIENT_DEBUG env variable to true
```
export ELASTIC_CLIENT_DEBUG=1
```
start rails console: rails c

create a new reindexing task

Elastic::ReindexingTask.create!(targets: %w[Repository], max_slices_running: 30, slice_multiplier: 3)

ensure the task is in initial state

Elastic::ReindexingTask.current
=> #<Elastic::ReindexingTask:0x0000000170822460
 id: 15,
 created_at: Mon, 17 Jun 2024 14:40:13.683304000 UTC +00:00,
 updated_at: Mon, 17 Jun 2024 14:40:13.683304000 UTC +00:00,
 state: "initial",
 in_progress: true,
 error_message: nil,
 delete_original_index_at: nil,
 max_slices_running: 30,
 slice_multiplier: 3,
 targets: ["Repository"],
 options: {}>

Move the process along so that the Elasticsearch tasks get created, the task should move to indexing_paused
```
  > service = Elastic::ClusterReindexingService.new
  service.execute
```

move the process along again with the service.execute command. note the tasks should be created with scroll=2h as a URL parameter

  TRANSACTION (1.1ms)  COMMIT /*application:console,db_config_name:main,console_hostname:terrichus-   MBP.localdomain,console_username:terrichu,line:/lib/gitlab/database.rb:392:in `commit'*/
2024-06-17 10:42:47 -0400: POST http://localhost:9200/_reindex?scroll=2h&wait_for_completion=false [status:200,    request:0.009s, query:n/a]
2024-06-17 10:42:47 -0400: > {"source":{"index":"gitlab-development-20240606-2327","slice":   {"id":7,"max":15}},"dest":{"index":"gitlab-development-20240617-1442-reindex-15-0"}}
2024-06-17 10:42:47 -0400: < {"task":"_kTIXOaWQbGaNYF5opwUqA:2298846"}

run service.execute commands until you get a true returned, when Elastic::ReindexingTask.current returns nil the reindexing is done

  > Elastic::ReindexingTask.current
  Elastic::ReindexingTask Load (0.6ms)  SELECT "elastic_reindexing_tasks".* FROM "elastic_reindexing_tasks" WHERE "elastic_reindexing_tasks"."in_progress" = TRUE ORDER BY "elastic_reindexing_tasks"."id" DESC LIMIT 1 /*application:console,db_config_name:main,console_hostname:terrichus-MBP.localdomain,console_username:terrichu,line:/ee/app/models/elastic/reindexing_task.rb:29:in `current'*/
=> nil

Edited Jun 17, 2024 by Terri Chu

Tune Zero downtime reindexing for large instances

What does this MR do and why?

MR acceptance checklist

Screenshots or screen recordings

How to set up and validate locally

Merge request reports