Skip to content

[Code indexing pipeline] Handle force_reindex

Context

While writing the ActiveContext Code Embeddings runbook, we realized that force-pushes are not yet handled in the the Embeddings Indexing pipeline.

From @terrichu in gitlab-com/runbooks!9235 (comment 2667767913):

What happens when a force push happens and the commit SHA/history is changed? Does the comparison for the last_commit SHA detect this and reindex the whole thing from zero-SHA?

It doesn't sound like there is any difference between indexing being enabled and indexing being paused

From @dgruzd in gitlab-com/runbooks!9235 (comment 2669315655):

I believe that we should test the force pushes with the chunk mode. The indexer should treat it exactly the same as initial indexing and perform a full indexing 🤝

Continued conversation in initial MR

From @partiaga in !698 (comment 2679690769):

Thinking about this further, what we can do in the Go Indexer instead is handle the scenario where the "total" changes of a force-push deletes some files. Because if we send in something like from_sha="" & to_sha=latestSha from Rails, since git is reading the new tree from the force-push, it won't be able to "know" which files in the old tree are actually no longer in the new tree.

From @partiaga in !698 (comment 2682013166):

If it's a force-push, and we send in from_sha="", to_sha=latest_commit, gitaly itself will not know what the files in the old tree are, so it won't know which files are for deletion. The gitaly.EachFileChangeBatched function will ultimately summarize the changes of ""...latest_sha as just a list of additions/updates without deletions.

The chunk-mode indexer's internal/mode/chunk/indexer/elasticsearch/indexer.go does not support deleting chunks with non-existent file paths. It has deleteOrphanedChunks, but this is for deleting chunks in an updated file.

So we still need to update the chunk-mode indexer to support deleting chunks with file paths that are no longer in git

References

Proposal

We can make use of a reindexing field in the documents such that:

  • Add a reindexing field to indicate a document is in the process of reindexing

  • In the Go Indexer (this issue)

    1. When running the Go Indexer with the options from_sha="" and force_reindex=true, this should run in reindexing mode
    2. During the reindexing, follow the usual indexing process, but mark all found documents with reindexing=true
    3. Afterwards the usual indexing process
      • delete all documents with reindexing=false (essentially deleting all documents that were not found during reindexing)
      • after deletion, update the remaining documents to set reindexing=false (indicating that we are no longer in reindexing mode
  • In Rails, when there is a force-push, call the Go Indexer with options from_sha="" and force_reindex=true

Edited by Pam Artiaga