[Code indexing pipeline] Handle force_reindex
Context
While writing the ActiveContext Code Embeddings runbook, we realized that force-pushes are not yet handled in the the Embeddings Indexing pipeline.
From @terrichu
in gitlab-com/runbooks!9235 (comment 2667767913):
What happens when a force push happens and the commit SHA/history is changed? Does the comparison for the
last_commit
SHA detect this and reindex the whole thing from zero-SHA?It doesn't sound like there is any difference between indexing being enabled and indexing being paused
From @dgruzd
in gitlab-com/runbooks!9235 (comment 2669315655):
I believe that we should test the force pushes with the
chunk
mode. The indexer should treat it exactly the same as initial indexing and perform a full indexing🤝
Continued conversation in initial MR
From @partiaga
in !698 (comment 2679690769):
Thinking about this further, what we can do in the Go Indexer instead is handle the scenario where the "total" changes of a force-push deletes some files. Because if we send in something like
from_sha="" & to_sha=latestSha
from Rails, since git is reading the new tree from the force-push, it won't be able to "know" which files in the old tree are actually no longer in the new tree.
From @partiaga
in !698 (comment 2682013166):
If it's a force-push, and we send in
from_sha="", to_sha=latest_commit
,gitaly
itself will not know what the files in the old tree are, so it won't know which files are for deletion. Thegitaly.EachFileChangeBatched
function will ultimately summarize the changes of""...latest_sha
as just a list of additions/updates without deletions.The chunk-mode indexer's
internal/mode/chunk/indexer/elasticsearch/indexer.go
does not support deleting chunks with non-existent file paths. It hasdeleteOrphanedChunks
, but this is for deleting chunks in an updated file.So we still need to update the chunk-mode indexer to support deleting chunks with file paths that are no longer in git
References
- See solution discussion: !704 (comment 2687171065)
- Chosen solution: !704 (comment 2687184492)
Proposal
We can make use of a reindexing
field in the documents such that:
-
Add a
reindexing
field to indicate a document is in the process of reindexing -
In the Go Indexer (this issue)
- When running the Go Indexer with the options
from_sha=""
andforce_reindex=true
, this should run inreindexing
mode - During the reindexing, follow the usual indexing process, but mark all found documents with
reindexing=true
- Afterwards the usual indexing process
- delete all documents with
reindexing=false
(essentially deleting all documents that were not found during reindexing) - after deletion, update the remaining documents to set
reindexing=false
(indicating that we are no longer inreindexing
mode
- delete all documents with
- When running the Go Indexer with the options
-
In Rails, when there is a force-push, call the Go Indexer with options
from_sha=""
andforce_reindex=true