Skip to content

Zoekt integration for code search

Dylan Griffith requested to merge zoekt-code-search into master

What does this MR do and why?

This MR introduces a new backend Zoekt to power code searches in GitLab. Previously we supported Elasticsearch and Gitaly backed searches. We found Gitaly didn't scale for searching across large groups and Elasticsearch has limitations in how indexing and searching works that makes it a poor choice for code. Zoekt provides a more "grep-like" user experience and solves a lot of the problems we had with Elasticsearch. You can read more about the motivation at !107891 (merged) . Since the rollout of this will be gradual and it is somewhat experimental it will not be the default and we will not be removing support for Elasticsearch any time soon.

Notes for reviewers

This is a fairly large change as it covers a complete MVC of all of the following necessary to integrate Zoekt as a code search database:

  1. Enabling Zoekt per namespace
  2. Indexing
  3. Search
  4. Docs

It may be beneficial to review those parts independently 1 at a time as it will likely be easier to understand. As such I'll give a brief rundown on each below and you can use that to determine which files to review when trying to understand this behaviour. Given the size of this merge request you may want to take a break between reviewing the different parts.

It will be useful to know that Zoekt is a simple search engine that only understands how to index repositories of code. The API for searching simply takes the user's query string and returns all the matching lines of code. We have expanded the API to allow filtering by repository ids. All interactions with Zoekt are via JSON HTTP APIs.

Enabling Zoekt per namespace

The following files are the main important pieces:

  1. ee/app/models/zoekt/shard.rb
  2. ee/app/models/zoekt/indexed_namespace.rb
  3. Project#use_zoekt? and Namespace#use_zoekt?

The first 2 are the new database models that allow enabling namespaces to use zoekt.

The Zoekt::Shard stores information about the URLs needed for indexing and searching a Zoekt database. We needed to store this in some table and I considered just storing this as 2 columns in application_settings like we do for Elasticsearch but we are going to quickly need many instances of Zoekt in the near future so I wanted to avoid forcing myself to migrate the 2 columns later on to a new table that allows many Zoekt shards in future.

The Zoekt::IndexedNamespace is a join table between a top-level namespace and a Zoekt::Shard. Basically this serves 2 purposes:

  1. Indicates that Zoekt is enabled for this namespace
  2. Provides a join to the correct Shard that will be used for indexing and search for this namespace

This data modelling was probably also a little over the top for what we needed as we won't actually implement all the necessary parts of supporting multiple shards in this first MVC. It might have sufficed to use a feature flag for enabling namespaces, but again it seemed I'd need to migrate to this model soon enough anyway.

The Project#use_zoekt? and Namespace#use_zoekt? methods simply delegate to Zoekt::IndexedNamespace to check if a record exists for the root_ancestor. Enabling a Zoekt::IndexedNamespace for a subgroup won't work properly as we only ever delegate to the root_ancestor. This is by design to simplify the MVC as we would otherwise need to recursiely look up the hierachy which is far more complicated.

Indexing

The following files are the main important pieces:

  1. ee/app/models/concerns/zoekt/searchable_repository.rb
  2. ee/app/services/ee/git/branch_push_service.rb
  3. ee/app/workers/zoekt/indexer_worker.rb

Indexing is built on a new API we have added to Zoekt in a fork with an open PR. The API accepts a repository URL and repository ID and then Zoekt is responsible for cloning this repository (or fetching updates if it has already cloned) and indexing it. The ID we send it the projects.id in our database which is later used as a filter in searches.

The Zoekt API request looks like:

$ curl -XPOST -d '{"CloneUrl":"https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer.git","RepoId":2953390}' -H 'Content-Type: application/json' http://127.0.0.1:6060/index

The code that calls this indexing API is in ee/app/models/concerns/zoekt/searchable_repository.rb which is included into the Repository model so that you can call Repository#update_zoekt_index! and it will call the API. Apart from tests, this is only ever called async from the ::Zoekt::IndexerWorker. The ::Zoekt::IndexerWorker is called on any push for a repository from EE::Git::BranchPushService.

sequenceDiagram
   participant user as User
   participant gitlab_git as GitLab Git
   participant gitlab_sidekiq as GitLab Sidekiq
   participant zoekt as Zoekt
   user->>gitlab_git: git push git@gitlab.com:gitlab-org/gitlab.git
   gitlab_git->>gitlab_sidekiq: ZoektIndexerWorker.perform_async(278964)
   gitlab_sidekiq->>zoekt: POST /index {"CloneUrl":"https://gitlab.com/gitlab-org/gitlab.git","RepoId":278964}'
   zoekt->>gitlab_git: git clone https://gitlab.com/gitlab-org/gitlab.git

Search

The following files are the main important pieces:

  1. ee/app/services/ee/search/group_service.rb
  2. ee/app/services/ee/search/project_service.rb
  3. ee/app/services/concerns/search/zoekt_searchable.rb
  4. ee/lib/gitlab/zoekt/search_results.rb

Search follows a very similar code organization to our Elasticsearch code. We hook into the existing EE::Search::GroupService and EE::Search::ProjectService by including ::Search::ZoektSearchable. Before returning results we check if use_zoekt? and if so we return Zoekt::SearchResults. The Zoekt::SearchResults contains all the logic for searching. Searching uses a Zoekt API we have in a branch that allows filtering by repository ID. The API takes the repository IDs as arguments and the query string. Searches to this API look like:

curl -XPOST -d '{"Q":"username_regex","RepoIds":[7,9],"Opts":{"TotalMaxMatchCount":40,"NumContextLines":1}}' 'http://127.0.0.1:6070/api/search'

The Zoekt::SearchResults transforms the Zoekt API response into the same ::Gitlab::Search::FoundBlob objects that are used in our Elasticsearch code so that there was no need to change any presentation logic for presenting search results.

flowchart LR
   SearchController --> SearchService --> Search::GroupService --> Search::ZoektSearchable --> Zoekt::SearchResults

TODO to extract to issues

TODO
  1. Sidekiq or other process to index projects (similar to current indexer but call http://127.0.0.1:6060/index)
  2. Move all Zoekt files to a separate set of files that aren't elastic/
  3. Actually filter by the current scoped projects for group search
  4. Implement project searches
  5. Remove hardcoded Zoekt URL
  6. Switch from feature flag to using the same data model as indexed namespaces/projects for Elasticsearch. Maybe defer this until after the first successful trial with 1 group or maybe do it upfront since the code will be similar to Elasticsearch and it may simplify handling enabling
  7. Test and development need to be configured with different Zoekt URL and data dir so that tests can cleanup Zoekt data without impacting development
  8. Remove feature flag and only use the DB model
  9. One end-to-end test in spec/features
  10. Dry up tests doing ::Zoekt::IndexedNamespace.create! and ::Zoekt::Shard.create!, before and after project_1.repository.truncate_zoekt_index!, repository.update_zoekt_index!(use_local_disk_path: true)
  11. Allow truncating an index using the shard like Repository.truncate_index(shard)
  12. Minimal docs stating it's alpha status and that it can only be configured manually from the Rails console
  13. Test coverage for all new files
  14. Easily deleteable tests for all Feature.enabled?(:zoekt_for_code_search) occurrences
  15. Add license check to use_zoekt?
  16. Trigger indexing for imported projects
  17. Either skip Zoekt tests in CI or make it work by running Zoekt somehow
  18. Pass credentials to Zoekt for cloning the repos so that it works for private repos
  19. Handle syntax errors like unterminated quotes. Probably easiest to just handle non-200 status code and return empty result set
  20. Would it be safer to add a single feature flag around all the code paths as an extra safety net for turning off misbehaving code?
  21. Support both regex and non-regex mode. Should this be done in Rails or in Zoekt? If it's already supported in Zoekt it would be more efficient probably
  22. Case sensisitive and insensitive switch (probably later)
  23. Decide if we need to handle filter by language (see commit 74d453d7 for removing test). The feature flag search_blobs_language_aggregation is not currently enabled globally anyway so we can maybe defer this
  24. Implement Zoekt global search or create a separate issue
  25. Perhaps we can first ship an iteration where Zoekt is enabled globally and takes precedence over Elasticsearch when enabled then move all the Feature flag (or other) logic to a separate MR
  26. Appropriate caching of methods that are called very frequently (like in hooks on push)
  27. Should we add security to the Zoekt server or will this be at the infra level?
  28. Group scoped search still only works for projects with Elasticsearch enabled even if Zoekt is enabled. We should allow it to run if Zoekt is enabled. I was hoping changing EE::Group::SearchService#allowed_scopes would be enough but it seems the logic in app/helpers/search_helper.rb does not use this and it may be a little more complicated. Probably the helper should delegate to allowed_scopes though...
  29. Determine how/if we can support aggregations with Zoekt
  30. Need to ensure that project scoped searches when selecting ref uses gitaly instead of Zoekt
  31. Indicator in the UI to show that the results are served as part of new "Exact Code Search" feature
  32. Some way for user to toggle between Zoekt and non-Zoekt searches where available
  33. Add ability to pause Zoekt indexing without disabling the feature flag and losing updates (similar to Elasticsearch pausing)
  34. All existing projects should be queued for indexing at the same time as creating Zoekt::IndexedNamespace
  35. Implement more pagination in Zoekt search. Right now we just don't use zoekt if they request for a page other than the first
  36. Developer docs for setting up Zoekt in GDK
  37. GDK improvements to automate some/all of Zoekt setup
  38. Separate log file for Zoekt and log Faraday errors
  39. Handle deletes so that we purge data from Zoekt when a project is deleted (requires a new API in Zoekt)
  40. Handle project/group transfers if an indexed project is moved out of an indexed namespace or a non-indexed one is moved into an indexed namespace
  41. Check HTTP reponses codes and raise error so Sidekiq can retry when updating index
  42. Handle default branch changing. If Zoekt has already cloned the bare repo and just does a git fetch then does HEAD change correctly or do we end up with the wrong code in the index? https://stackoverflow.com/questions/29296232/why-does-bare-repository-have-head

Migration output

migrate:up
main: == 20230104201524 AddZoektShardsAndIndexedNamespaces: migrating ===============
main: -- create_table(:zoekt_shards, {})
main: -- quote_column_name(:index_base_url)
main:    -> 0.0000s
main: -- quote_column_name(:search_base_url)
main:    -> 0.0000s
main:    -> 0.0042s
main: -- create_table(:zoekt_indexed_namespaces, {})
main:    -> 0.0043s
main: == 20230104201524 AddZoektShardsAndIndexedNamespaces: migrated (0.0134s) ======

main: == 20230107125328 AddZoektIndexedNamespacesForeignKey: migrating ==============
main: -- transaction_open?()
main:    -> 0.0000s
main: -- foreign_keys(:zoekt_indexed_namespaces)
main:    -> 0.0874s
main: -- transaction_open?()
main:    -> 0.0001s
main: -- execute("ALTER TABLE zoekt_indexed_namespaces\nADD CONSTRAINT fk_3bebdb4efc\nFOREIGN KEY (namespace_id)\nREFERENCES namespaces (id)\nON DELETE CASCADE\nNOT VALID;\n")
main:    -> 0.0013s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0001s
main: -- execute("ALTER TABLE zoekt_indexed_namespaces VALIDATE CONSTRAINT fk_3bebdb4efc;")
main:    -> 0.0017s
main: -- execute("RESET statement_timeout")
main:    -> 0.0007s
main: == 20230107125328 AddZoektIndexedNamespacesForeignKey: migrated (0.0955s) =====

ci: == 20230104201524 AddZoektShardsAndIndexedNamespaces: migrating ===============
ci: -- create_table(:zoekt_shards, {})
ci: -- quote_column_name(:index_base_url)
ci:    -> 0.0000s
ci: -- quote_column_name(:search_base_url)
ci:    -> 0.0000s
ci:    -> 0.0032s
ci: -- create_table(:zoekt_indexed_namespaces, {})
ci:    -> 0.0028s
ci: == 20230104201524 AddZoektShardsAndIndexedNamespaces: migrated (0.0106s) ======

ci: == 20230107125328 AddZoektIndexedNamespacesForeignKey: migrating ==============
ci: -- transaction_open?()
ci:    -> 0.0000s
ci: -- foreign_keys(:zoekt_indexed_namespaces)
ci:    -> 0.0019s
ci: -- transaction_open?()
ci:    -> 0.0000s
ci: -- execute("ALTER TABLE zoekt_indexed_namespaces\nADD CONSTRAINT fk_3bebdb4efc\nFOREIGN KEY (namespace_id)\nREFERENCES namespaces (id)\nON DELETE CASCADE\nNOT VALID;\n")
ci:    -> 0.0008s
ci: -- execute("SET statement_timeout TO 0")
ci:    -> 0.0005s
ci: -- execute("ALTER TABLE zoekt_indexed_namespaces VALIDATE CONSTRAINT fk_3bebdb4efc;")
ci:    -> 0.0023s
ci: -- execute("RESET statement_timeout")
ci:    -> 0.0004s
ci: == 20230107125328 AddZoektIndexedNamespacesForeignKey: migrated (0.0113s) =====
migrate:down
$ bin/rake db:migrate:down:main VERSION=20230107125328
main: == 20230107125328 AddZoektIndexedNamespacesForeignKey: reverting ==============
main: -- transaction_open?()
main:    -> 0.0000s
main: -- remove_foreign_key(:zoekt_indexed_namespaces, {:column=>:namespace_id})
main:    -> 0.0862s
main: == 20230107125328 AddZoektIndexedNamespacesForeignKey: reverted (0.1017s) =====

$ bin/rake db:migrate:down:main VERSION=20230104201524
main: == 20230104201524 AddZoektShardsAndIndexedNamespaces: reverting ===============
main: -- drop_table(:zoekt_indexed_namespaces, {})
main:    -> 0.0016s
main: -- drop_table(:zoekt_shards, {})
main:    -> 0.0007s
main: == 20230104201524 AddZoektShardsAndIndexedNamespaces: reverted (0.0040s) ======

Screenshots or screen recordings

Zoekt_Demo

How to set up and validate locally

Zoekt installation from forks

  1. Install zoekt-indexserver, zoekt-webserver, universal-ctags
    1. brew install universal-ctags
    2. git clone git@github.com:DylanGriffith/zoekt.git
    3. cd zoekt
    4. git checkout dynamic-indexserver-api
    5. go install github.com/sourcegraph/zoekt/cmd/zoekt-git-index
    6. go install github.com/sourcegraph/zoekt/cmd/zoekt-git-clone
    7. go install github.com/sourcegraph/zoekt/cmd/zoekt-dynamic-indexserver
    8. git checkout search-api-with-reposbranches-filter
    9. go install github.com/sourcegraph/zoekt/cmd/zoekt-webserver

Running tests

  1. In GDK directory run the indexserver
    1. zoekt-dynamic-indexserver -data_dir zoekt/test/data -index_dir zoekt/test/index -listen :6060
  2. In GDK directory run the webserver
    1. zoekt-webserver -index zoekt/test/index -rpc -listen :6070
  3. Run some specs
    1. bin/rspec ee/spec/lib/gitlab/zoekt/search_results_spec.rb ee/spec/features/search/zoekt/search_spec.rb ee/spec/models/concerns/zoekt/searchable_repository_spec.rb

Running development

  1. In GDK directory run the indexserver
    1. zoekt-dynamic-indexserver -data_dir zoekt/dev/data -index_dir zoekt/dev/index -listen :6080
  2. In GDK directory run the webserver
    1. zoekt-webserver -index zoekt/dev/index -rpc -listen :6090
  3. Using the example group below or a different one in your instance ensure that the group and all projects in it are public
  4. In the rails console:
    ::Feature.enable(:index_code_with_zoekt)
    ::Feature.enable(:search_code_with_zoekt)
    zoekt_shard = ::Zoekt::Shard.find_or_create_by!(index_base_url: 'http://127.0.0.1:6080/', search_base_url: 'http://127.0.0.1:6090/')
    namespace = Namespace.find_by_full_path("flightjs") # Some namespace you want to enable
    ::Zoekt::IndexedNamespace.find_or_create_by!(shard: zoekt_shard, namespace: namespace.root_ancestor)
  5. Make code changes to any project in the namespace you indexed to ensure the index gets updated
  6. Perform a code search at the group or project level and results should be coming from Zoekt. You can confirm by trying a regex search.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Dylan Griffith

Merge request reports