Fix deduplication to use repository context
What does this MR do and why?
Fix Zoekt search result deduplication to respect repository boundaries
This MR fixes a bug in gitlab-zoekt-indexer v1.7.0 where search result deduplication was incorrectly removing files from different repositories that had identical content. The original implementation only used file checksums for deduplication, causing files with the same checksum across different repositories to be treated as duplicates.
Zoekt: Problem with the deduplication logic (gitlab#573331 - closed)
Problem
When searching across multiple projects with identical file content:
- Only half the expected search results were returned (5 instead of 10)
- File counts were incorrect (2 instead of 4)
- Search results were missing files from some repositories entirely
This affected the GitLab specs:
Search::Zoekt::SearchResults#objects finds blobs by regex search
Search::Zoekt::SearchResults#objects sets file_count on the instance equal to the count of files with matches
Search::Zoekt::SearchResults#objects correctly handles pagination
Root Cause
The deduplication logic in internal/search/search.go
used only file checksums as the deduplication key:
checksumKey := fmt.Sprintf("%x", f.Checksum)
This caused files with identical content from different repositories (e.g., project_1
and project_2
using the same repository fixtures) to be incorrectly deduped since they had the same checksum.
Solution
Include the repository name in the deduplication key to ensure files are only deduped within the same repository:
checksumKey := fmt.Sprintf("%s:%x", f.Repository, f.Checksum)
The Repository
field is already available in zoekt.FileMatch
and is set during gRPC response conversion in internal/search/grpc.go:236
.
References
Related commits in gitlab-zoekt-indexer:
- Original deduplication:
c7b1860e15eed4e81b1bf585fb3866cfff5ce608
- Revert (merged to main):
1bc0acc92bdb00aff98fe141ee5d3d968edab038
- This fix: branch
jm-dedupe-zoekt-results-fixed
Changes
gitlab-zoekt-indexer:
- internal/search/search.go - Updated deduplication key to include repository context
- internal/search/search_test.go - Added test to verify files from different repos are not deduped, updated existing tests
- internal/search/search_test_helpers.go - Updated helper to include Repository field
- internal/mode/webserver/webserver_v2_test.go - Updated test fixtures
How to set up and validate locally
-
Update gitlab-zoekt-indexer to the fixed version:
cd /path/to/gitlab-zoekt-indexer git fetch origin git checkout jm-dedupe-zoekt-results-fixed
-
Update GITLAB_ZOEKT_VERSION in GitLab:
cd /path/to/gitlab # Get the commit SHA from the fixed branch echo "jm-dedupe-zoekt-results-fixed" > GITLAB_ZOEKT_VERSION
-
Rebuild and restart Zoekt:
# In your GDK or test environment gdk restart zoekt
-
Run the failing specs:
bin/rspec ee/spec/lib/search/zoekt/search_results_spec.rb:48 bin/rspec ee/spec/lib/search/zoekt/search_results_spec.rb:53 bin/rspec ee/spec/lib/search/zoekt/search_results_spec.rb:100
All specs should now pass ✓
-
Verify deduplication behavior:
- Search across multiple projects with identical files
- Confirm all results from all projects are returned
- Verify files with same content from same repository are still deduped correctly
Test Results
All tests pass in gitlab-zoekt-indexer:
$ go test -v ./internal/search -run "TestSearch.*Deduplicate"
=== RUN TestSearchDeduplicatesIdenticalFiles_GRPC
--- PASS: TestSearchDeduplicatesIdenticalFiles_GRPC (0.00s)
=== RUN TestSearchDoesNotDeduplicateFilesDifferentRepos_GRPC
--- PASS: TestSearchDoesNotDeduplicateFilesDifferentRepos_GRPC (0.00s)
PASS
MR acceptance checklist
This MR has been evaluated against the MR acceptance checklist.
Specific considerations:
-
✅ Correctness: Fixes the bug while maintaining proper deduplication within repositories -
✅ Performance: Minimal impact - only changes the map key format -
✅ Reliability: Comprehensive test coverage including new edge case test -
✅ Testing: Added test for cross-repository deduplication behavior -
✅ Backward compatibility: Maintains deduplication functionality, just scoped correctly -
✅ Code quality: Follows existing patterns and Go conventions