Skip to content

feat: handle force reindex

What does this MR do and why?

Related issue: [Code indexing pipeline] Handle force_reindex (#172 - closed)

While writing the ActiveContext Code Embeddings runbook, we realized that force-pushes are not yet handled in the the Embeddings Indexing pipeline. We eventually agreed that force-pushes should be handled as any reindex.

This MR addresses performing a reindex given the options from_sha="", force_reindex=true.

Solution Summary

  1. Add a reindexing field to the gitlab_active_context_code index (MR: gitlab!201428 (merged))
  2. Handle force reindexing in the Go Indexer (this MR)
    1. follow the usual indexing process, but mark all found documents with reindexing=true
    2. after the usual indexing process: 1.) delete all documents with reindexing=false, 2.) update the remaining documents to set reindexing=false after deletion
  3. In Rails, when there is a force-push, call the Go Indexer with options from_sha="" and force_reindex=true (Issue: gitlab#560713 (closed))

Also see proposal: #172 (closed)

Solution Discussion

See thread: !704 (comment 2687171065)

How to set up and validate locally

Note: in this validation, we are particularly testing a force-push scenario, but this should apply to any scenario where the Go Indexer options are from_sha="", force_reindexing=true

1- Follow the setup for running the Go Indexer in chunk mode

Setup steps: gitlab#550418 (comment 2594554040)

2- On your local GDK, create a test project and add files

Expand for example project

In this example, I've created the project gitlab-duo/force-push-test with the following commits and files:

Initial Commits Initial Files
initial_commits initial_files

3- Run the initial indexing

Go Indexer command
# note that for `to_sha`, we are using the last commit sha of the initial tree
# we have also set `force_reindex=false`
make && \
GITLAB_INDEXER_MODE=chunk \
GITLAB_INDEXER_DEBUG_LOGGING=1 \
./bin/gitlab-elasticsearch-indexer \
-adapter "elasticsearch" \
-connection '{"url": ["http://localhost:9200"]}' \
-options '{
  "timeout": "30m",
  "chunk_size": 1000,
  "gitaly_batch_size": 1000,
  "from_sha": "",
  "to_sha": "1cd70bdd72f37ea6fe0f4a7337dcd0381c9e50a6",
  "force_reindex": false,
  "project_id": 79,
  "partition_name": "gitlab_active_context_code",
  "partition_number": 0,
  "gitaly_config": {
    "address": "unix:/Users/pamartiaga/Code/gitlab-development-kit/praefect.socket",
    "storage": "default",
    "relative_path": "@hashed/98/a3/98a3ab7c340e8a033e7b37b6ef9428751581760af67bbab2b9e05d4964a8874a.git",
    "project_path": "gitlab-duo/force-push-test"
  }
}'
Verify that the files are indexed correctly: initial_indexed_files

4- Update the test project with force pushes

Make sure to delete one file for testing

Expand for example changes
Commits Files
forcepush_commits forcepush_files

5- Run the Go Indexer in reindexing mode (from_sha="", force_reindexing=true)

Go Indexer Command
# note that for `to_sha`, we are using the last commit sha of the new tree
# we have also set `force_reindex=true`
make && \
GITLAB_INDEXER_MODE=chunk \
GITLAB_INDEXER_DEBUG_LOGGING=1 \
./bin/gitlab-elasticsearch-indexer \
-adapter "elasticsearch" \
-connection '{"url": ["http://localhost:9200"]}' \
-options '{
  "timeout": "30m",
  "chunk_size": 1000,
  "gitaly_batch_size": 1000,
  "from_sha": "",
  "to_sha": "15b8bcd618b1ee84664a73f9be2f2c169dc4f580",
  "force_reindex": true,
  "project_id": 79,
  "partition_name": "gitlab_active_context_code",
  "partition_number": 0,
  "gitaly_config": {
    "address": "unix:/Users/pamartiaga/Code/gitlab-development-kit/praefect.socket",
    "storage": "default",
    "relative_path": "@hashed/98/a3/98a3ab7c340e8a033e7b37b6ef9428751581760af67bbab2b9e05d4964a8874a.git",
    "project_path": "gitlab-duo/force-push-test"
  }
}'
Verify that the files are reindexed correctly

The logs show that:

  • there were only 3 total files indexed
  • logs show that a "resolve reindexing" was called, to delete documents with files that are no longer in the git tree

force_reindex_logs

Files on the vector store:

force_reindexed_files

Closes #172 (closed)

Edited by Pam Artiaga

Merge request reports

Loading