feat: handle force reindex
What does this MR do and why?
Related issue: [Code indexing pipeline] Handle force_reindex (#172 - closed)
While writing the ActiveContext Code Embeddings runbook, we realized that force-pushes are not yet handled in the the Embeddings Indexing pipeline. We eventually agreed that force-pushes should be handled as any reindex.
This MR addresses performing a reindex given the options from_sha="", force_reindex=true.
Solution Summary
- Add a
reindexingfield to thegitlab_active_context_codeindex (MR: gitlab!201428 (merged)) -
Handle force reindexing in the Go Indexer (this MR)
- follow the usual indexing process, but mark all found documents with
reindexing=true - after the usual indexing process: 1.) delete all documents with
reindexing=false, 2.) update the remaining documents to setreindexing=falseafter deletion
- follow the usual indexing process, but mark all found documents with
- In Rails, when there is a force-push, call the Go Indexer with options
from_sha=""andforce_reindex=true(Issue: gitlab#560713 (closed))
Also see proposal: #172 (closed)
Solution Discussion
See thread: !704 (comment 2687171065)
How to set up and validate locally
Note: in this validation, we are particularly testing a force-push scenario, but this should apply to any scenario where the Go Indexer options are from_sha="", force_reindexing=true
1- Follow the setup for running the Go Indexer in chunk mode
Setup steps: gitlab#550418 (comment 2594554040)
2- On your local GDK, create a test project and add files
Expand for example project
In this example, I've created the project gitlab-duo/force-push-test with the following commits and files:
| Initial Commits | Initial Files |
|---|---|
![]() |
![]() |
3- Run the initial indexing
Go Indexer command
# note that for `to_sha`, we are using the last commit sha of the initial tree
# we have also set `force_reindex=false`
make && \
GITLAB_INDEXER_MODE=chunk \
GITLAB_INDEXER_DEBUG_LOGGING=1 \
./bin/gitlab-elasticsearch-indexer \
-adapter "elasticsearch" \
-connection '{"url": ["http://localhost:9200"]}' \
-options '{
"timeout": "30m",
"chunk_size": 1000,
"gitaly_batch_size": 1000,
"from_sha": "",
"to_sha": "1cd70bdd72f37ea6fe0f4a7337dcd0381c9e50a6",
"force_reindex": false,
"project_id": 79,
"partition_name": "gitlab_active_context_code",
"partition_number": 0,
"gitaly_config": {
"address": "unix:/Users/pamartiaga/Code/gitlab-development-kit/praefect.socket",
"storage": "default",
"relative_path": "@hashed/98/a3/98a3ab7c340e8a033e7b37b6ef9428751581760af67bbab2b9e05d4964a8874a.git",
"project_path": "gitlab-duo/force-push-test"
}
}'
4- Update the test project with force pushes
Make sure to delete one file for testing
5- Run the Go Indexer in reindexing mode (from_sha="", force_reindexing=true)
Go Indexer Command
# note that for `to_sha`, we are using the last commit sha of the new tree
# we have also set `force_reindex=true`
make && \
GITLAB_INDEXER_MODE=chunk \
GITLAB_INDEXER_DEBUG_LOGGING=1 \
./bin/gitlab-elasticsearch-indexer \
-adapter "elasticsearch" \
-connection '{"url": ["http://localhost:9200"]}' \
-options '{
"timeout": "30m",
"chunk_size": 1000,
"gitaly_batch_size": 1000,
"from_sha": "",
"to_sha": "15b8bcd618b1ee84664a73f9be2f2c169dc4f580",
"force_reindex": true,
"project_id": 79,
"partition_name": "gitlab_active_context_code",
"partition_number": 0,
"gitaly_config": {
"address": "unix:/Users/pamartiaga/Code/gitlab-development-kit/praefect.socket",
"storage": "default",
"relative_path": "@hashed/98/a3/98a3ab7c340e8a033e7b37b6ef9428751581760af67bbab2b9e05d4964a8874a.git",
"project_path": "gitlab-duo/force-push-test"
}
}'
Verify that the files are reindexed correctly
The logs show that:
- there were only 3 total files indexed
- logs show that a "resolve reindexing" was called, to delete documents with files that are no longer in the git tree
Files on the vector store:
Closes #172 (closed)






