Tune Zero downtime reindexing for large instances
What does this MR do and why?
Zero downtime reindexing failed for the main index in change request: gitlab-com/gl-infra/production#18080 (closed)
This MR is changing a few settings to tune the feature for larger instances. These changes will allow us to dogfood the feature in production and make the feature more stable for large SM instances. Settings changed:
- Increase max times a task can fail from 10 to 20
- Reason: one task out of 600 caused the entire reindexing to fail (near the end of completion)
- Set scroll context expiration to 2h, default is 5 min
- Reason: It's recommended to change this setting based on the error we saw in the failures "scroll context expired" - context
- Change how often the
ElasticClusterReindexingCronWorker
runs from every 10 minutes to every 5 minutes- we would like to run the cron worker more often to speed up the process. the worker is responsible for making changes to gitlab settings (pausing indexing), kicking off tasks to Elasticsearch cluster, and asking for task status from Elasticsearch cluster. The main portion of work is happening in the Elasticsearch cluster so we feel this change is safe.
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Screenshots or screen recordings
N/A these are all backend changes
Before | After |
---|---|
How to set up and validate locally
- setup gdk for elasticsearch, enable advanced search and run
gitlab:elastic:reindex
to create/populate all indices - set the
ELASTIC_CLIENT_DEBUG
env variable to trueexport ELASTIC_CLIENT_DEBUG=1
- start rails console:
rails c
- create a new reindexing task
Elastic::ReindexingTask.create!(targets: %w[Repository], max_slices_running: 30, slice_multiplier: 3)
- ensure the task is in initial state
Elastic::ReindexingTask.current => #<Elastic::ReindexingTask:0x0000000170822460 id: 15, created_at: Mon, 17 Jun 2024 14:40:13.683304000 UTC +00:00, updated_at: Mon, 17 Jun 2024 14:40:13.683304000 UTC +00:00, state: "initial", in_progress: true, error_message: nil, delete_original_index_at: nil, max_slices_running: 30, slice_multiplier: 3, targets: ["Repository"], options: {}>
- Move the process along so that the Elasticsearch tasks get created, the task should move to
indexing_paused
> service = Elastic::ClusterReindexingService.new service.execute
- move the process along again with the service.execute command. note the tasks should be created with
scroll=2h
as a URL parameterTRANSACTION (1.1ms) COMMIT /*application:console,db_config_name:main,console_hostname:terrichus- MBP.localdomain,console_username:terrichu,line:/lib/gitlab/database.rb:392:in `commit'*/ 2024-06-17 10:42:47 -0400: POST http://localhost:9200/_reindex?scroll=2h&wait_for_completion=false [status:200, request:0.009s, query:n/a] 2024-06-17 10:42:47 -0400: > {"source":{"index":"gitlab-development-20240606-2327","slice": {"id":7,"max":15}},"dest":{"index":"gitlab-development-20240617-1442-reindex-15-0"}} 2024-06-17 10:42:47 -0400: < {"task":"_kTIXOaWQbGaNYF5opwUqA:2298846"}
- run
service.execute
commands until you get atrue
returned, whenElastic::ReindexingTask.current
returns nil the reindexing is done
> Elastic::ReindexingTask.current
Elastic::ReindexingTask Load (0.6ms) SELECT "elastic_reindexing_tasks".* FROM "elastic_reindexing_tasks" WHERE "elastic_reindexing_tasks"."in_progress" = TRUE ORDER BY "elastic_reindexing_tasks"."id" DESC LIMIT 1 /*application:console,db_config_name:main,console_hostname:terrichus-MBP.localdomain,console_username:terrichu,line:/ee/app/models/elastic/reindexing_task.rb:29:in `current'*/
=> nil
Edited by Terri Chu