Zero downtime reindexing make setting aysc-durability optional
What does this MR do and why?
This MR fixes a bug in the zero-downtime reindexing feature for Advanced Search.
During reindexing, GitLab pauses indexing and creates new target indices with translog: { durability: async } as a write performance optimization. However, some search backends reject this setting — including AWS OpenSearch Service (which restricts async durability at the infrastructure level) and certain self-managed OpenSearch clusters with cluster.remote_store.index.restrict.async-durability=true enabled.
When index creation failed with BadRequest, there was no error handling, leaving the reindex task paused indefinitely with no way to recover.
This MR:
- Detects whether the connected search backend supports async translog durability before attempting to use it
- Skips the setting for AWS OpenSearch and self-managed OpenSearch clusters with the restriction enabled
- Adds BadRequest error handling in indexing_paused! so a failed index creation aborts cleanly instead of leaving the system stuck
References
| Engine | Can restrict async natively? |
Mechanism | Condition | Documentation / Supporting Links |
|---|---|---|---|---|
| OpenSearch (self-managed, remote store enabled) |
|
Set cluster.remote_store.index.restrict.async-durability: true in opensearch.yml at node startup; detectable via GET /_nodes/settings
|
OpenSearch ≥ 2.11, remote store must be enabled | PR #10189 · Remote-backed storage docs |
| OpenSearch (self-managed, no remote store) |
|
Setting exists but is a no-op without remote store | — | PR #10189 |
| AWS OpenSearch Service |
|
AWS manages remote store internally; async durability is restricted at infrastructure level; not user-configurable or detectable via API | — | Supported operations |
| Elasticsearch (self-managed / Cloud Hosted) |
|
No native mechanism; RBAC + monitoring only | — | Translog settings |
| Elasticsearch Serverless |
|
index.translog.durability is not in the Serverless allowed settings; attempting to set it returns BadRequest
|
Serverless only | Serverless index settings · Differences from other offerings |
Screenshots or screen recordings
| Before | After |
|---|---|
How to set up and validate locally
Prerequisites:
- Docker and Docker Compose installed. I use Colima to do this.
- gdk must be setup for
elasticsearch
Note: AWS OpenSearch cannot be tested locally. The OpenSearch with remote store restriction scenario can be tested using Docker as described below.
Setup
- Create the following files in a working directory:
`docker-compose.yml`
services:
minio:
image: minio/minio:latest
container_name: minio
ports:
- "9010:9000"
- "9011:9001"
environment:
MINIO_ROOT_USER: minioadmin
MINIO_ROOT_PASSWORD: minioadmin
command: server /data --console-address ":9001"
networks:
- opensearch-net
opensearch:
build: .
container_name: opensearch-remote
ports:
- "9202:9200"
- "9601:9600"
environment:
- discovery.type=single-node
- DISABLE_SECURITY_PLUGIN=true
- cluster.remote_store.index.restrict.async-durability=true
- node.attr.remote_store.segment.repository=my-repo
- node.attr.remote_store.translog.repository=my-repo
- node.attr.remote_store.state.repository=my-repo
- node.attr.remote_store.repository.my-repo.type=s3
- node.attr.remote_store.repository.my-repo.settings.bucket=opensearch-remote
- node.attr.remote_store.repository.my-repo.settings.endpoint=http://minio:9000
- node.attr.remote_store.repository.my-repo.settings.path_style_access=true
- node.attr.remote_store.repository.my-repo.settings.region=us-east-1
networks:
- opensearch-net
depends_on:
- minio
networks:
opensearch-net:
`Dockerfile`
FROM opensearchproject/opensearch:2.11.0
RUN /usr/share/opensearch/bin/opensearch-plugin install --batch repository-s3
RUN /usr/share/opensearch/bin/opensearch-keystore create && \
echo minioadmin | /usr/share/opensearch/bin/opensearch-keystore add --stdin s3.client.default.access_key && \
echo minioadmin | /usr/share/opensearch/bin/opensearch-keystore add --stdin s3.client.default.secret_key
- Build and start MinIO:
docker-compose build opensearch
docker-compose up minio -d
- Create the MinIO bucket:
- Open http://localhost:9011 (login:
minioadmin/minioadmin) - Create a bucket named
opensearch-remote
- Start OpenSearch:
docker-compose up opensearch -d
docker-compose logs -f opensearch # wait for GREEN
- Verify the restriction is active:
curl -s "http://localhost:9202/_nodes/settings?flat_settings=true" | grep async
# Expected: "cluster.remote_store.index.restrict.async-durability": "true"
- Configure GDK to use this OpenSearch instance
In the Rails console:
ApplicationSetting.current.update!(
elasticsearch_url: 'http://localhost:9202',
elasticsearch_indexing: true,
elasticsearch_search: true,
elasticsearch_aws: false
)
- Index the instance:
bundle exec rake gitlab:elastic:index
# monitor indexing
bundle exec rake gitlab:elastic:info
- Once indexing is complete, trigger zero-downtime reindexing:
gitlab:elastic:reindex_cluster
- Monitor reindexing in the
log/elasticsearch.log
# Move through states using the worker
ElasticClusterReindexingCronWorker.new.perform
What to verify
With the fix: when in reindexing it progresses through states without getting stuck. Verify the created indices do not have translog.durability: async in their settings:
curl -s "http://localhost:9202/*/_settings" | grep -i "durability"
# Should return nothing (setting omitted entirely)
# in rails console verify the last task is successful
Search::Elastic::ReindexingTask.last.state
# state should be success
Without the fix (to confirm the bug existed):
Checkout master branch
Reproduce the bug (confirm it fails without the fix):
# Rails console
helper = Gitlab::Elastic::Helper.default
helper.client.indices.create(
index: 'test-async-bug',
body: { settings: { 'index.translog.durability' => 'async' } }
)
# => Elasticsearch::Transport::Transport::Errors::BadRequest: [400]
# index setting [index.translog.durability=async] is not allowed as
# cluster setting [cluster.remote_store.index.restrict.async-durability=true]
Repeat zero-downtime reindexing — the reindex task should get stuck in indexing_paused state with a BadRequest error in the logs.
# in rails console verify the last task
Search::Elastic::ReindexingTask.last.state
# state is `indexing_paused`, this is the bug
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.