We have a strong integration with SourceGraph: https://docs.gitlab.com/ee/integration/sourcegraph.html. Along with advanced features like Code Intelligence, SourceGraph also provides Code Navigation (jump to definition, etc.). For Code Search, we also integrate with ElasticSearch.
Presently if a user wants both code navigation and global search, they need to operate both ElasticSearch and SourceGraph. This is expensive and unwieldy.
Proposal
We should determine if SourceGraph's search features are comparable to ElasticSearch. If so, it may be possible to simplify our stack and reduce significant operational costs when both code search and navigation are required.
Considerations
Access control / multi-tenancy
Ability to consume data types other than code (wiki, snippets, issues, comments, etc.)
We should determine if SourceGraph's search features are comparable to ElasticSearch
@joshlambert we're always going to need to index things that aren't code (eg. issues, comments, merge requests etc). Is SourceGraph ever going to be able to replace Elasticsearch for all of that? If not then I don't see how we can reduce the operational overhead here if we aren't fully replacing Elasticsearch.
I'm not sure I can see anything about indexing arbitrary documents in sourcegraph documentation. All I'm seeing is regarding searching repositories, commits, code etc.
Given that I don't think we'll ever be able to operate Sourcegraph easily across all types of GitLab installations - I'd rather work with groupsource code to see where they land on Find Ref/Jump to Def and see if we can leverage that.
It feels like long term we're going to need one path for Code Search and one Path for the rest of things.
@joshlambert I think we'll need something that's more focused on code search and providing more value there (like Jump to Def and Find Ref). Given the overhead of operating Elasticsearch if it ends up not being capable of giving us those features, I'd expect we'd want to shift our search path for code to whatever is capable. Mostly this is an operational problem because as it stands today Elasticsearch is duplicative of code which relates to some of our index size woes and early testing of language server implementations also require a copy of the code.
This means in a scenario where we had two things doing code functions we'd have the actual source on Gitaly, Code Ref, Elasticsearch. Which is a 3x requirement. If you wanted HA you'd have a 6x requirement which I think is just too large.
However, if there were a way to generate the language server information directly from the Gitaly nodes as something being consumed for Code Ref, then it'd make sense to try and also consume that for search.
This would leave us with some kind of code search solution, and then potentially Elasticsearch for everything else (comments, issues, etc...). Although it might not be the best tool for that and maybe something else is better suited there.
All of this is to say - I don't think we know enough about what Code Ref is going to take and how we might be able to leverage/lead that effort.
I was searching for more information about Sourcegraph and tried to see what the search engine it's built upon, and I came across this our own discussion, gitlab-foss#41450 (moved). It seems we looked at some options for code search, including Zoekt which is used by Sourcegraph. But, I don't see any conclusion about Gitaly + Zoekt. @joshlambert , do you recall anything about it?
It seems we looked at some options for code search, including Zoekt which is used by Sourcegraph. But, I don't see any conclustion about Gitaly + Zoekt. @joshlambert , do you recall anything about it?
@changzhengliu - I do not, @phikai may know. My guess is that we've been shipping ES for 2 years and is the current "boring solution".
@joshlambert I think we'll need something that's more focused on code search and providing more value there (like Jump to Def and Find Ref). Given the overhead of operating Elasticsearch if it ends up not being capable of giving us those features, I'd expect we'd want to shift our search path for code to whatever is capable. Mostly this is an operational problem because as it stands today Elasticsearch is duplicative of code which relates to some of our index size woes and early testing of language server implementations also require a copy of the code.
I think the reason this particular comparison is interesting, is that if Sourcegraph's search features are comparable it may make sense to see if a deeper partnership of some kind would be mutually beneficial, to bring this to GitLab in a more out of the box way.
@joshlambert, @jramsay - we are following this and &2327 (closed)/&1576 (closed) and would love to help with the evaluation, can you add us to shared docs with criteria? Also, we can change our pricing and packaging so that it makes sense for a deeper integration of Sourcegraph + GitLab because this is important to us on both the search and code nav/intelligence side.
@joshlambert@phikai, I tried Sourcegraph and it's pretty impressive. I can see it will be very beneficial for code review. From a developer's perspective, I had to check out the code to my local and use local editor + some code referencing tool to navigate through class/method definition and references. Now, with Sourcegraph, such tasks can be done just in the web browser, which will save time and increase productivity. I am not sure about the global search use case, and need to understand how scalability can be solved by Sourcegraph vs Elasticsearch.
@joshlambert , nice writeup. It sounds right to me. It seems to me that Sourcegraph uses Zoekt for code indexing and search. It's like Lucene to Elasticsearch? Other than indexing, Sourcegraph utilizes Language Server Protocol for code intelligence. It'll be nice to know how Sourcegraph integration with GitLab goes.
For ranking in ES, I don't know any particular algorithm than default we use. I think other folks can give more information on this.
The security piece of this is particularly interesting as it basically assumes everything is public. We'd likely need to build all of our logic in to that tool as well and we've already seen how problematic that might be.
The security piece of this is particularly interesting as it basically assumes everything is public. We'd likely need to build all of our logic in to that tool as well and we've already seen how problematic that might be.
Yes, but in some ways it may also be easier. Rather than duplicating all of our logic into ES, we simply rely on our existing checks in Rails as a SSOT to filter.
Yes, but in some ways it may also be easier. Rather than duplicating all of our logic into ES, we simply rely on our existing checks in Rails as a SSOT to filter.
@joshlambert well we could already do that with Elasticsearch (and kinda do with the redaction logic) but the fundamental problem is that you need to request a page of results at a time from the search server. If a user peforms a search with thousands of results and the first thousand of them are private projects (very likely depending on context and type of search) then there is no performant way to skip past those thousand results and get straight to their result as we need to query the database to check permissions for every single one.
This is fundamentally why we duplicate permissions in Elasticsearch today and I wish there was a better way but I haven't seen any other proposal yet.
I wonder if moving to a model where search clusters were group specific would be the answer here. If we just did individual clusters for paid groups with a sort of explicit disclaimer around "everyone in your group can search for everything" would work. Could be worth floating that idea by a few people... I'm not sure if people who are using opengrok are replicating permissions so it may be a non-issue on the code side for most orgs.