Zoekt: Fix incorrect project filtering in Zoekt initial indexing
What does this MR do and why?
This MR fixes a critical bug in the Zoekt initial indexing process where projects could be incorrectly excluded from indices if they already had a repository in a different index.
The current implementation uses the without_zoekt_repositories scope which checks if a project has ANY Zoekt repositories, regardless of which index they belong to. This means a project that should be indexed in multiple indices would only get indexed in the first one it was processed for.
This MR introduces a new scope without_zoekt_repositories_for_index that uses a NOT EXISTS query to efficiently find only projects that don't have repositories for the specific index being processed. This ensures that projects are correctly indexed in all required indices.
References
How to set up and validate locally
- Create multiple Zoekt indices in your development environment
- Add several projects to your GitLab instance
- Run the initial indexing process for one index
- Verify that projects are correctly added to the zoekt_repositories table
- Run the initial indexing process for a second index
- Verify that all projects are indexed in both indices (before the fix, some would be missing)
Performance Considerations
I tested several approaches against our production replica to determine the most efficient solution:
- NOT IN: Using a subquery with where.not
- LEFT JOIN: Using a conditional left join with null check
- NOT EXISTS: Using a SQL NOT EXISTS condition
The NOT EXISTS approach was the most performant for our dataset and has been implemented in this MR.