Skip to content

Investigate embedding storage and vector search of X-Ray

Background

In the MVC of X-Ray a cornerstone was laid for RAG system, with building automated documentation about repositories 3rd party libraries. However X-Ray lacks right now any means of ranking different data pieces to asses their relevance to task at hand

Goal

In order to expand X-Ray capabilities in enhancing code generation request context with additional information by providing it with ability to rank for relevance different bits of information it is required to consider semantic search over different snippets. In order to provide semantic search to X-Ray it will be required to:

  1. Process snippets with model capable of generating embeddings vectors
  2. Store generated vectors
  3. Support knn search over stored vectors
  4. Introduce chunking strategy to break big blocks of content into smaller snippets (to be delivered in follow up iteration)

Additional considerations

  1. While working around support for X-Ray it will be good to at least consider cross compatibility with GitLab Duo Chat, to enable broader use of RAG system &12362
  2. When selecting embeddings model it is important to consider if such model could be used along IDE extension to convert user changes on local environment (prospect npm package https://www.npmjs.com/package/ml-knn, https://www.npmjs.com/package/voy-search)

Example read / write paths

sequenceDiagram
    actor USR as  User
    participant IDE
    participant GLR as GitLabRails
    participant GLR as GitLabRails
    participant ES as ElasticSearch
    participant VER as VertexAI
    participant AIGW as AIGateway
    USR->>+GLR: commits changes to utils_spec.rb
    GLR->>GLR: triggers indexing job
    GLR->>ES: invalidates embeddings for chunks of utils_spec.rb
    GLR->>GLR: generate new chunks for utils_spec.rb
    GLR->>VER: fetches embeddings for a new utils_spec.rb chunks
    VER->>GLR: embeddings vectors for utils_spec.rb chunks
    GLR->>-ES: index new embeddings for utils_spec.rb chunks    

    USR->>+IDE: types: describe ".add"
    IDE->>+GLR: trigger code generation for line `describe ".add"` 
    GLR->>VER: fetch embedding for instruction "in utils_spec.rb generate test for method add"
    VER->>GLR: embeddings vector for instruction
    GLR->>ES: fetches KNN chunks for instruction embeddings vector
    ES->>GLR: line "def add(a, d)" from utils.rb
    GLR->>AIGW: trigger code generation for line `describe ".add"` includes `def add(a,b)` from `utils.rb` 
    AIGW->>GLR: "do \n it 'adds two numbers' do \n expect(add(1, 2)).to eq 3\nend"
    GLR->>-IDE:"do \n it 'adds two numbers' do \n expect(add(1, 2)).to eq 3\nend"
    IDE->>-USR: Show ghost text  "do \n it 'adds two numbers'...."
Edited by Mikołaj Wawrzyniak