Skip to content

Fix deduplication to use repository context

What does this MR do and why?

Fix Zoekt search result deduplication to respect repository boundaries

This MR fixes a bug in gitlab-zoekt-indexer v1.7.0 where search result deduplication was incorrectly removing files from different repositories that had identical content. The original implementation only used file checksums for deduplication, causing files with the same checksum across different repositories to be treated as duplicates.

Zoekt: Problem with the deduplication logic (gitlab#573331 - closed)

Problem

When searching across multiple projects with identical file content:

  • Only half the expected search results were returned (5 instead of 10)
  • File counts were incorrect (2 instead of 4)
  • Search results were missing files from some repositories entirely

This affected the GitLab specs:

  • Search::Zoekt::SearchResults#objects finds blobs by regex search
  • Search::Zoekt::SearchResults#objects sets file_count on the instance equal to the count of files with matches
  • Search::Zoekt::SearchResults#objects correctly handles pagination

Root Cause

The deduplication logic in internal/search/search.go used only file checksums as the deduplication key:

checksumKey := fmt.Sprintf("%x", f.Checksum)

This caused files with identical content from different repositories (e.g., project_1 and project_2 using the same repository fixtures) to be incorrectly deduped since they had the same checksum.

Solution

Include the repository name in the deduplication key to ensure files are only deduped within the same repository:

checksumKey := fmt.Sprintf("%s:%x", f.Repository, f.Checksum)

The Repository field is already available in zoekt.FileMatch and is set during gRPC response conversion in internal/search/grpc.go:236.

References

Related commits in gitlab-zoekt-indexer:

  • Original deduplication: c7b1860e15eed4e81b1bf585fb3866cfff5ce608
  • Revert (merged to main): 1bc0acc92bdb00aff98fe141ee5d3d968edab038
  • This fix: branch jm-dedupe-zoekt-results-fixed

Changes

gitlab-zoekt-indexer:

  1. internal/search/search.go - Updated deduplication key to include repository context
  2. internal/search/search_test.go - Added test to verify files from different repos are not deduped, updated existing tests
  3. internal/search/search_test_helpers.go - Updated helper to include Repository field
  4. internal/mode/webserver/webserver_v2_test.go - Updated test fixtures

How to set up and validate locally

  1. Update gitlab-zoekt-indexer to the fixed version:

    cd /path/to/gitlab-zoekt-indexer
    git fetch origin
    git checkout jm-dedupe-zoekt-results-fixed
  2. Update GITLAB_ZOEKT_VERSION in GitLab:

    cd /path/to/gitlab
    # Get the commit SHA from the fixed branch
    echo "jm-dedupe-zoekt-results-fixed" > GITLAB_ZOEKT_VERSION
  3. Rebuild and restart Zoekt:

    # In your GDK or test environment
    gdk restart zoekt
  4. Run the failing specs:

    bin/rspec ee/spec/lib/search/zoekt/search_results_spec.rb:48
    bin/rspec ee/spec/lib/search/zoekt/search_results_spec.rb:53
    bin/rspec ee/spec/lib/search/zoekt/search_results_spec.rb:100

    All specs should now pass ✓

  5. Verify deduplication behavior:

    • Search across multiple projects with identical files
    • Confirm all results from all projects are returned
    • Verify files with same content from same repository are still deduped correctly

Test Results

All tests pass in gitlab-zoekt-indexer:

$ go test -v ./internal/search -run "TestSearch.*Deduplicate"
=== RUN   TestSearchDeduplicatesIdenticalFiles_GRPC
--- PASS: TestSearchDeduplicatesIdenticalFiles_GRPC (0.00s)
=== RUN   TestSearchDoesNotDeduplicateFilesDifferentRepos_GRPC
--- PASS: TestSearchDoesNotDeduplicateFilesDifferentRepos_GRPC (0.00s)
PASS

MR acceptance checklist

This MR has been evaluated against the MR acceptance checklist.

Specific considerations:

  • Correctness: Fixes the bug while maintaining proper deduplication within repositories
  • Performance: Minimal impact - only changes the map key format
  • Reliability: Comprehensive test coverage including new edge case test
  • Testing: Added test for cross-repository deduplication behavior
  • Backward compatibility: Maintains deduplication functionality, just scoped correctly
  • Code quality: Follows existing patterns and Go conventions
Edited by Dmitry Gruzd

Merge request reports

Loading