Zero downtime reindexing make setting aysc-durability optional (!226356) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

This MR fixes a bug in the zero-downtime reindexing feature for Advanced Search.

During reindexing, GitLab pauses indexing and creates new target indices with translog: { durability: async } as a write performance optimization. However, some search backends reject this setting — including AWS OpenSearch Service (which restricts async durability at the infrastructure level) and certain self-managed OpenSearch clusters with cluster.remote_store.index.restrict.async-durability=true enabled. When index creation failed with BadRequest, there was no error handling, leaving the reindex task paused indefinitely with no way to recover.

This MR:

Detects whether the connected search backend supports async translog durability before attempting to use it
Skips the setting for AWS OpenSearch and self-managed OpenSearch clusters with the restriction enabled
Adds BadRequest error handling in indexing_paused! so a failed index creation aborts cleanly instead of leaving the system stuck

References

Related to Allow translog durability mode to be configurab... (#552633 - closed)

Engine	Can restrict `async` natively?	Mechanism	Condition	Documentation / Supporting Links
OpenSearch (self-managed, remote store enabled)	✅ Yes	Set `cluster.remote_store.index.restrict.async-durability: true` in `opensearch.yml` at node startup; detectable via `GET /_nodes/settings`	OpenSearch ≥ 2.11, remote store must be enabled	PR #10189 · Remote-backed storage docs
OpenSearch (self-managed, no remote store)	❌ No	Setting exists but is a no-op without remote store	—	PR #10189
AWS OpenSearch Service	✅ Effectively yes	AWS manages remote store internally; async durability is restricted at infrastructure level; not user-configurable or detectable via API	—	Supported operations
Elasticsearch (self-managed / Cloud Hosted)	❌ No	No native mechanism; RBAC + monitoring only	—	Translog settings
Elasticsearch Serverless	✅ Effectively yes	`index.translog.durability` is not in the Serverless allowed settings; attempting to set it returns `BadRequest`	Serverless only	Serverless index settings · Differences from other offerings

Screenshots or screen recordings

Before	After

How to set up and validate locally

Prerequisites:

Docker and Docker Compose installed. I use Colima to do this.
gdk must be setup for elasticsearch

Note: AWS OpenSearch cannot be tested locally. The OpenSearch with remote store restriction scenario can be tested using Docker as described below.

Setup

Create the following files in a working directory:

`docker-compose.yml`


services:
  minio:
    image: minio/minio:latest
    container_name: minio
    ports:
      - "9010:9000"
      - "9011:9001"
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    command: server /data --console-address ":9001"
    networks:
      - opensearch-net

  opensearch:
    build: .
    container_name: opensearch-remote
    ports:
      - "9202:9200"
      - "9601:9600"
    environment:
      - discovery.type=single-node
      - DISABLE_SECURITY_PLUGIN=true
      - cluster.remote_store.index.restrict.async-durability=true
      - node.attr.remote_store.segment.repository=my-repo
      - node.attr.remote_store.translog.repository=my-repo
      - node.attr.remote_store.state.repository=my-repo
      - node.attr.remote_store.repository.my-repo.type=s3
      - node.attr.remote_store.repository.my-repo.settings.bucket=opensearch-remote
      - node.attr.remote_store.repository.my-repo.settings.endpoint=http://minio:9000
      - node.attr.remote_store.repository.my-repo.settings.path_style_access=true
      - node.attr.remote_store.repository.my-repo.settings.region=us-east-1

    networks:
      - opensearch-net
    depends_on:
      - minio

networks:
  opensearch-net:

`Dockerfile`

FROM opensearchproject/opensearch:2.11.0

RUN /usr/share/opensearch/bin/opensearch-plugin install --batch repository-s3

RUN /usr/share/opensearch/bin/opensearch-keystore create && \
    echo minioadmin | /usr/share/opensearch/bin/opensearch-keystore add --stdin s3.client.default.access_key && \
    echo minioadmin | /usr/share/opensearch/bin/opensearch-keystore add --stdin s3.client.default.secret_key

Build and start MinIO:

docker-compose build opensearch
docker-compose up minio -d

Create the MinIO bucket:

Open http://localhost:9011 (login: minioadmin/minioadmin)
Create a bucket named opensearch-remote

Start OpenSearch:

docker-compose up opensearch -d
docker-compose logs -f opensearch  # wait for GREEN

Verify the restriction is active:

curl -s "http://localhost:9202/_nodes/settings?flat_settings=true" | grep async
# Expected: "cluster.remote_store.index.restrict.async-durability": "true"

Configure GDK to use this OpenSearch instance

In the Rails console:

ApplicationSetting.current.update!(
  elasticsearch_url: 'http://localhost:9202',
  elasticsearch_indexing: true,
  elasticsearch_search: true,
  elasticsearch_aws: false
)

Index the instance:

bundle exec rake gitlab:elastic:index

# monitor indexing
bundle exec rake gitlab:elastic:info

Once indexing is complete, trigger zero-downtime reindexing:

gitlab:elastic:reindex_cluster

Monitor reindexing in the log/elasticsearch.log

# Move through states using the worker
ElasticClusterReindexingCronWorker.new.perform

What to verify

With the fix: when in reindexing it progresses through states without getting stuck. Verify the created indices do not have translog.durability: async in their settings:

curl -s "http://localhost:9202/*/_settings" | grep -i "durability"
# Should return nothing (setting omitted entirely)

# in rails console verify the last task is successful
Search::Elastic::ReindexingTask.last.state
# state should be success

Without the fix (to confirm the bug existed):

Checkout master branch

Reproduce the bug (confirm it fails without the fix):

# Rails console
helper = Gitlab::Elastic::Helper.default
helper.client.indices.create(
  index: 'test-async-bug',
  body: { settings: { 'index.translog.durability' => 'async' } }
)
# => Elasticsearch::Transport::Transport::Errors::BadRequest: [400]
#    index setting [index.translog.durability=async] is not allowed as
#    cluster setting [cluster.remote_store.index.restrict.async-durability=true]

Repeat zero-downtime reindexing — the reindex task should get stuck in indexing_paused state with a BadRequest error in the logs.

# in rails console verify the last task 
Search::Elastic::ReindexingTask.last.state
# state is `indexing_paused`, this is the bug

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited Mar 12, 2026 by Terri Chu

Zero downtime reindexing make setting aysc-durability optional