Zero downtime reindexing make setting aysc-durability optional

What does this MR do and why?

This MR fixes a bug in the zero-downtime reindexing feature for Advanced Search.

During reindexing, GitLab pauses indexing and creates new target indices with translog: { durability: async } as a write performance optimization. However, some search backends reject this setting — including AWS OpenSearch Service (which restricts async durability at the infrastructure level) and certain self-managed OpenSearch clusters with cluster.remote_store.index.restrict.async-durability=true enabled. When index creation failed with BadRequest, there was no error handling, leaving the reindex task paused indefinitely with no way to recover.

This MR:

  • Detects whether the connected search backend supports async translog durability before attempting to use it
  • Skips the setting for AWS OpenSearch and self-managed OpenSearch clusters with the restriction enabled
  • Adds BadRequest error handling in indexing_paused! so a failed index creation aborts cleanly instead of leaving the system stuck

References

Engine Can restrict async natively? Mechanism Condition Documentation / Supporting Links
OpenSearch (self-managed, remote store enabled) Yes Set cluster.remote_store.index.restrict.async-durability: true in opensearch.yml at node startup; detectable via GET /_nodes/settings OpenSearch ≥ 2.11, remote store must be enabled PR #10189 · Remote-backed storage docs
OpenSearch (self-managed, no remote store) No Setting exists but is a no-op without remote store PR #10189
AWS OpenSearch Service Effectively yes AWS manages remote store internally; async durability is restricted at infrastructure level; not user-configurable or detectable via API Supported operations
Elasticsearch (self-managed / Cloud Hosted) No No native mechanism; RBAC + monitoring only Translog settings
Elasticsearch Serverless Effectively yes index.translog.durability is not in the Serverless allowed settings; attempting to set it returns BadRequest Serverless only Serverless index settings · Differences from other offerings

Screenshots or screen recordings

Before After

How to set up and validate locally

Prerequisites:

  • Docker and Docker Compose installed. I use Colima to do this.
  • gdk must be setup for elasticsearch

Note: AWS OpenSearch cannot be tested locally. The OpenSearch with remote store restriction scenario can be tested using Docker as described below.

Setup

  1. Create the following files in a working directory:
`docker-compose.yml`

services:
  minio:
    image: minio/minio:latest
    container_name: minio
    ports:
      - "9010:9000"
      - "9011:9001"
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    command: server /data --console-address ":9001"
    networks:
      - opensearch-net

  opensearch:
    build: .
    container_name: opensearch-remote
    ports:
      - "9202:9200"
      - "9601:9600"
    environment:
      - discovery.type=single-node
      - DISABLE_SECURITY_PLUGIN=true
      - cluster.remote_store.index.restrict.async-durability=true
      - node.attr.remote_store.segment.repository=my-repo
      - node.attr.remote_store.translog.repository=my-repo
      - node.attr.remote_store.state.repository=my-repo
      - node.attr.remote_store.repository.my-repo.type=s3
      - node.attr.remote_store.repository.my-repo.settings.bucket=opensearch-remote
      - node.attr.remote_store.repository.my-repo.settings.endpoint=http://minio:9000
      - node.attr.remote_store.repository.my-repo.settings.path_style_access=true
      - node.attr.remote_store.repository.my-repo.settings.region=us-east-1

    networks:
      - opensearch-net
    depends_on:
      - minio

networks:
  opensearch-net:
`Dockerfile`
FROM opensearchproject/opensearch:2.11.0

RUN /usr/share/opensearch/bin/opensearch-plugin install --batch repository-s3

RUN /usr/share/opensearch/bin/opensearch-keystore create && \
    echo minioadmin | /usr/share/opensearch/bin/opensearch-keystore add --stdin s3.client.default.access_key && \
    echo minioadmin | /usr/share/opensearch/bin/opensearch-keystore add --stdin s3.client.default.secret_key
  1. Build and start MinIO:
docker-compose build opensearch
docker-compose up minio -d
  1. Create the MinIO bucket:
  1. Start OpenSearch:
docker-compose up opensearch -d
docker-compose logs -f opensearch  # wait for GREEN
  1. Verify the restriction is active:
curl -s "http://localhost:9202/_nodes/settings?flat_settings=true" | grep async
# Expected: "cluster.remote_store.index.restrict.async-durability": "true"
  1. Configure GDK to use this OpenSearch instance

In the Rails console:

ApplicationSetting.current.update!(
  elasticsearch_url: 'http://localhost:9202',
  elasticsearch_indexing: true,
  elasticsearch_search: true,
  elasticsearch_aws: false
)
  1. Index the instance:
bundle exec rake gitlab:elastic:index

# monitor indexing
bundle exec rake gitlab:elastic:info
  1. Once indexing is complete, trigger zero-downtime reindexing:
gitlab:elastic:reindex_cluster
  1. Monitor reindexing in the log/elasticsearch.log
# Move through states using the worker
ElasticClusterReindexingCronWorker.new.perform

What to verify

With the fix: when in reindexing it progresses through states without getting stuck. Verify the created indices do not have translog.durability: async in their settings:

curl -s "http://localhost:9202/*/_settings" | grep -i "durability"
# Should return nothing (setting omitted entirely)
# in rails console verify the last task is successful
Search::Elastic::ReindexingTask.last.state
# state should be success

Without the fix (to confirm the bug existed):

Checkout master branch

Reproduce the bug (confirm it fails without the fix):

# Rails console
helper = Gitlab::Elastic::Helper.default
helper.client.indices.create(
  index: 'test-async-bug',
  body: { settings: { 'index.translog.durability' => 'async' } }
)
# => Elasticsearch::Transport::Transport::Errors::BadRequest: [400]
#    index setting [index.translog.durability=async] is not allowed as
#    cluster setting [cluster.remote_store.index.restrict.async-durability=true]

Repeat zero-downtime reindexing — the reindex task should get stuck in indexing_paused state with a BadRequest error in the logs.

# in rails console verify the last task 
Search::Elastic::ReindexingTask.last.state
# state is `indexing_paused`, this is the bug

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Terri Chu

Merge request reports

Loading