[Code indexing pipeline] Handle force_reindex
Context
While writing the ActiveContext Code Embeddings runbook, we realized that force-pushes are not yet handled in the the Embeddings Indexing pipeline.
From @terrichu in gitlab-com/runbooks!9235 (comment 2667767913):
What happens when a force push happens and the commit SHA/history is changed? Does the comparison for the
last_commitSHA detect this and reindex the whole thing from zero-SHA?It doesn't sound like there is any difference between indexing being enabled and indexing being paused
From @dgruzd in gitlab-com/runbooks!9235 (comment 2669315655):
I believe that we should test the force pushes with the
chunkmode. The indexer should treat it exactly the same as initial indexing and perform a full indexing🤝
Continued conversation in initial MR
From @partiaga in !698 (comment 2679690769):
Thinking about this further, what we can do in the Go Indexer instead is handle the scenario where the "total" changes of a force-push deletes some files. Because if we send in something like
from_sha="" & to_sha=latestShafrom Rails, since git is reading the new tree from the force-push, it won't be able to "know" which files in the old tree are actually no longer in the new tree.
From @partiaga in !698 (comment 2682013166):
If it's a force-push, and we send in
from_sha="", to_sha=latest_commit,gitalyitself will not know what the files in the old tree are, so it won't know which files are for deletion. Thegitaly.EachFileChangeBatchedfunction will ultimately summarize the changes of""...latest_shaas just a list of additions/updates without deletions.The chunk-mode indexer's
internal/mode/chunk/indexer/elasticsearch/indexer.godoes not support deleting chunks with non-existent file paths. It hasdeleteOrphanedChunks, but this is for deleting chunks in an updated file.So we still need to update the chunk-mode indexer to support deleting chunks with file paths that are no longer in git
References
- See solution discussion: !704 (comment 2687171065)
- Chosen solution: !704 (comment 2687184492)
Proposal
We can make use of a reindexing field in the documents such that:
-
Add a
reindexingfield to indicate a document is in the process of reindexing -
In the Go Indexer (this issue)
- When running the Go Indexer with the options
from_sha=""andforce_reindex=true, this should run inreindexingmode - During the reindexing, follow the usual indexing process, but mark all found documents with
reindexing=true - Afterwards the usual indexing process
- delete all documents with
reindexing=false(essentially deleting all documents that were not found during reindexing) - after deletion, update the remaining documents to set
reindexing=false(indicating that we are no longer inreindexingmode
- delete all documents with
- When running the Go Indexer with the options
-
In Rails, when there is a force-push, call the Go Indexer with options
from_sha=""andforce_reindex=true