Zoekt indexer does not detect force pushes correctly, causing stale search results
## Summary
The Zoekt indexer uses a commit **existence check** (`FindCommit`) instead of an **ancestor check** (`CommitIsAncestor`) when determining whether to perform an incremental or full reindex. This means that after a force push, the indexer incorrectly performs an incremental index against a commit that is no longer in the branch's history, resulting in stale or missing content in search results until git garbage collection removes the old commits.
## Steps to reproduce
1. Have a repository indexed by Zoekt with commits `A → B → C` on `main` (Zoekt has indexed up to commit `C`)
2. Force push a rebase to `main`: `A → B' → C'` (rewritten history)
3. Wait for the Zoekt indexer to process the repository
4. Search for content that was changed or removed during the rebase
## What is the current *bug* behavior?
After the force push:
1. The indexer reads the last-indexed SHA (`C`) from Zoekt index metadata
2. It calls [`IsValidSHA(C)`](https://gitlab.com/gitlab-org/gitlab-zoekt-indexer/-/blob/main/internal/gitaly/gitaly.go#L128-137) which uses Gitaly's `FindCommit` — this returns `true` because the old commit `C` still exists in the git object store (pre-GC)
3. The indexer [sets `FromHash = C`](https://gitlab.com/gitlab-org/gitlab-zoekt-indexer/-/blob/main/internal/indexer/indexer.go#L114-122) and performs an **incremental index** from `C` to `C'`
4. The diff between the old `C` and the new `C'` does not correctly capture all changes introduced by the rewrite — files changed in rewritten commits may be missed
This results in:
- **Stale content**: Files deleted during a history rewrite remain searchable
- **Missing content**: Files changed during the rebase may not be re-indexed
- **Security concern**: Sensitive content removed via force push (e.g., secrets scrubbed from history) remains searchable until git GC runs
The bug window persists from the force push until git garbage collection removes the unreachable objects, which can be **hours to days** on GitLab.com.
## What is the expected *correct* behavior?
After a force push, the indexer should detect that the previously-indexed commit is **not an ancestor** of the new HEAD and trigger a **full reindex**, ensuring search results accurately reflect the current state of the repository.
This is how the Elasticsearch indexer correctly handles this scenario — it uses [`repository.ancestor?(from_sha, to_sha)`](https://gitlab.com/gitlab-org/gitlab/-/blob/705719dd492411bffa31c88c8b301b0006150db7/ee/lib/gitlab/elastic/indexer.rb#L239-246) to verify the ancestor relationship before proceeding with an incremental index.
## Relevant logs and/or screenshots
### Current Zoekt indexer logic (buggy)
```go
// internal/indexer/indexer.go lines 114-122
if !ok || i.ForceReindex {
i.ForceReindex = true
i.gitalyClient.FromHash = ""
} else if i.gitalyClient.IsValidSHA(zoektSHA) { // ← existence check only
i.gitalyClient.FromHash = zoektSHA
} else {
i.ForceReindex = true
i.gitalyClient.FromHash = ""
}
```
```go
// internal/gitaly/gitaly.go lines 128-137
func (gc *GitalyClient) IsValidSHA(SHA string) bool {
request := &pb.FindCommitRequest{
Repository: gc.repository,
Revision: []byte(SHA),
}
commit, err := gc.commitServiceClient.FindCommit(gc.ctx, request)
return err == nil && commit.Commit != nil
}
```
### Comparison: ES-indexer logic (correct)
```ruby
# ee/lib/gitlab/elastic/indexer.rb lines 239-246
def last_commit_ancestor_of?(to_sha)
return true if Gitlab::Git.blank_ref?(from_sha)
return false unless repository_contains_last_indexed_commit?
from_sha == repository.empty_tree_id || repository.ancestor?(from_sha, to_sha)
end
```
### Impact comparison
| Scenario | ES-Indexer | Zoekt Indexer |
|----------|-----------|---------------|
| Force push (pre-GC) | ✅ Detects via ancestor check → full reindex | ❌ Incorrect incremental index |
| Force push (post-GC) | ✅ Detects via commit existence check | ✅ Detects via `IsValidSHA` |
| Normal push | ✅ Incremental | ✅ Incremental |
## Possible fixes
Replace `IsValidSHA` (existence check) with `CommitIsAncestor` (ancestor check) in `internal/indexer/indexer.go`:
```go
} else if i.gitalyClient.IsAncestor(zoektSHA, i.TargetSHA) {
i.gitalyClient.FromHash = zoektSHA
} else {
i.ForceReindex = true
i.gitalyClient.FromHash = ""
}
```
Add a new `IsAncestor` method to `GitalyClient` in `internal/gitaly/gitaly.go`:
```go
func (gc *GitalyClient) IsAncestor(ancestorSHA, childSHA string) bool {
request := &pb.CommitIsAncestorRequest{
Repository: gc.repository,
AncestorId: ancestorSHA,
ChildId: childSHA,
}
response, err := gc.commitServiceClient.CommitIsAncestor(gc.ctx, request)
return err == nil && response.Value
}
```
The `CommitIsAncestor` RPC is already available via the existing `commitServiceClient` in `GitalyClient` — no new gRPC connections or dependencies are needed. If `ancestorSHA` doesn't exist, the RPC returns `false`, correctly triggering a full reindex.
### Schema version bump (full reindex)
Fixing the ancestor check only prevents **future** force pushes from being handled incorrectly. Repositories that have already been incrementally indexed after a force push will remain stale. To correct these, we need to bump the `schemaVersion` constant in `internal/task_request/task_request.go` (currently `2531`) to trigger a full reindex of all repositories:
```go
// internal/task_request/task_request.go
const schemaVersion = YYWW // bump from 2531 to current YYWW
```
This causes every Zoekt node to report the new version in its heartbeat. Rails then detects that all `zoekt_repositories.schema_version < zoekt_nodes.schema_version` and automatically schedules bulk reindex tasks for every repository.
issue