Semantic Code Search / Intelligence (natural language code search) (ML powered)

Problem

Currently, most code search solutions are limited to exact substring match or potentially regexes. This is limiting, because you need to already know the code you are looking for, or at least part of it. (Function name, API call, etc.)

While this type of search is helpful, it would be nice to be able to search without knowing the specific piece of code. For example:

  • Show me all locations where an MD5 hash is calculated
  • Show me where we upload a file to object storage
  • Show me where we authenticate users
  • Where do we call the GraphQL API?

To do this, GitLab needs to understand the meaning of the code.

Solution

Today's large language models are starting to be capable of understanding, explaining, and producing code and are maturity rapidly. While the results are not always reliable, large language models are maturing rapidly.

Some examples of LLM's which can do this today:

This is also an area of active research: https://paperswithcode.com/task/code-search.

Current gap

The current gap is a model which has ingested the corpus of code within the GitLab server, and can provide a query interface.

As far as we currently understand, there is little publicly available projects/source which can ingest a corpus of projects and then provide semantic / natural language search. For example if you ask ChatGPT to search the GitLab repository, it will tell you it is not connected to the internet and therefore cannot.

What can be done is to send a code snippet to one of the above services, then ask it a question (including search or explanation) for just that snippet.

Overlap with other NL-PL use cases

This type of ML model is sometimes referred to as NL-PL, or a model which understands both natural language and programming language and can convert between the two.

There are multiple other use cases within GitLab for this, including:

  • Code Intelligence. Ask GitLab to explain a piece of code or the call patterns.
  • Many outlined in the AI Assist group/page. This includes CoPilot like functionality, as well as other use cases like automatic documentation generation, and so on.