fix: prevent data integrity issues in ResolveReindexing, add logs
What does this MR do and why?
This MR introduces updates to elastic.ResolveReindexing function
- To ensure data integrity
-
deleteByQueryandupdateByQuerywaits for completion -
deleteByQueryrefreshes the index to ensure that the subsequentupdateByQueryoperates on the latest version of the index
-
- For monitoring
- log the number of deleted and updated documents with
log_level=INFO
- log the number of deleted and updated documents with
- Other updates
- add routing to the
deleteByQueryandupdateByQuery
- add routing to the
References
Related issue: Address elastic.ResolveReindexing data integrit... (#175 - closed)
How to set up and validate locally
We are testing this against gitlab-org/gitlab to validate the changes around data integrity:
Setup
-
Obtain a copy of the
gitlabrepo in your local GDK, and make sure it allows force-pushes (go toSettings -> Repository -> Branch Rules) -
Follow these setup and validation steps, making sure you are setting it up for your local
gitlabproject -
Run initial indexing on your local
gitlabproject, the key parameters beingfrom_sha="",to_sha=<latest commit or "">,force_reindex=falseExpand for example command
make && \ GITLAB_INDEXER_MODE=chunk \ GITLAB_INDEXER_DEBUG_LOGGING=1 \ ./bin/gitlab-elasticsearch-indexer \ -adapter "elasticsearch" \ -connection '{"url": ["http://localhost:9200"]}' \ -options '{ "timeout": "30m", "chunk_size": 1000, "gitaly_batch_size": 1000, "from_sha": "", "to_sha": "cffa80231d2b1b4ca0ee3f2c355fdb1b1560f140", "force_reindex": false, "project_id": 75, "partition_name": "gitlab_active_context_code", "partition_number": 0, "gitaly_config": { "address": "unix:/Users/pamartiaga/Code/gitlab-development-kit/praefect.socket", "storage": "default", "relative_path": "@hashed/f3/69/f369cb89fc627e668987007d121ed1eacdc01db9e28f8bb26f358b7d8c4f08ac.git", "project_path": "gitlab-duo/gitlab" } }'
Testing
-
Create a new commit in your local
gitlabrepo, ensure that you delete some files and update some files -
Run the indexer with
from_sha="",to_sha=<the last commit you created>,force_reindex=trueExpand for example command
make && \ GITLAB_INDEXER_MODE=chunk \ GITLAB_INDEXER_DEBUG_LOGGING=1 \ ./bin/gitlab-elasticsearch-indexer \ -adapter "elasticsearch" \ -connection '{"url": ["http://localhost:9200"]}' \ -options '{ "timeout": "30m", "chunk_size": 1000, "gitaly_batch_size": 1000, "from_sha": "", "to_sha": "b1e3f635afbf82550b9dc7361f85bac340fbd4dd", "force_reindex": true, "project_id": 75, "partition_name": "gitlab_active_context_code", "partition_number": 0, "gitaly_config": { "address": "unix:/Users/pamartiaga/Code/gitlab-development-kit/praefect.socket", "storage": "default", "relative_path": "@hashed/f3/69/f369cb89fc627e668987007d121ed1eacdc01db9e28f8bb26f358b7d8c4f08ac.git", "project_path": "gitlab-duo/gitlab" } }' -
Check the
resolve_reindexinglogs. These should be the last logs to be outputted:{"time":"2025-09-19T10:00:51.280386+10:00","level":"DEBUG","msg":"resolve_reindexing refreshing index before delete"} {"time":"2025-09-19T10:00:51.438544+10:00","level":"DEBUG","msg":"resolve_reindexing purging files not in reindex"} # this shows info about the deleted documents {"time":"2025-09-19T10:00:51.476813+10:00","level":"INFO","msg":"resolve_reindexing deleted documents","batches":1,"total":5,"updated":0,"created":0,"deleted":5,"noops":0,"took":37,"timed_out":false} {"time":"2025-09-19T10:00:51.476853+10:00","level":"DEBUG","msg":"resolve_reindexing set all documents back to reindexing=false"} # this shows info about the updated documents # verify that `updated` is equal to `total`, ie all documents are updated {"time":"2025-09-19T10:01:17.820686+10:00","level":"INFO","msg":"resolve_reindexing updated documents","batches":411,"total":410729,"updated":410729,"created":0,"deleted":0,"noops":0,"took":26342,"timed_out":false}