Integrate X Ray report into GitLab RAG Platform
Background
Group groupglobal search is leading effort to build RAG platform for all GitLab (see related epic, and architecture blueprint proposal) With global effort being in place it is only in the spirit of efficiency and collaboration values for groupcode creation to join that effort and integrate Repository X Ray report data into GitLab RAG. Doing so should not only result in more efficient resource allocation, but also enable other AI features to integrate and reuse Repository X Ray data, for example users could ask questions to GitLab Duo Chat that could be answered with X Ray report data.
Goal
Integrate existing Repository X Ray scan flow with RAG platform.
Proof of Concept
Proof of concept has been built at !144715 (closed) it contains plethora of implementation details that can be helpful during implementation
Implementation
Desired outcome
After this effort is completed Repository X Ray scan will be processed in following manner:
sequenceDiagram
actor USR as User
participant RN as GitLab Runner
participant GLR as GitLab Rails
participant ES as Elasticsearch
participant PG as GitLab PostgreSQL DB
participant AIGW as AI Gateway
USR->>GLR: commits changes <br> to a package manager file <br>eg. Gemfile.lock
GLR->>+RN: triggers Repository X Ray CI scanner job
loop for each batch of packages
RN->>GLR: Request packages description by AI
GLR->>AIGW: Forward request for packages description
AIGW->>GLR: Packages description
GLR->>RN: Forwards packages description
end
RN->>-GLR: Repository X Ray report
GLR->>+GLR: triggers Repository X Ray ingestion background job
rect rgb(0, 223, 0, .1)
note right of RN: Embeddings flow
opt with Elasticsearch available on an instance
GLR->>ES: Remove all packages in old X Ray report from ES index
loop for each package in a X Ray new report
GLR->>AIGW: Request embeddings for package description
AIGW->>GLR: Embeddings for package description
GLR->>ES: Adds new package document into ES index
end
end
end
GLR->>-PG: upserts xray_reports record
And later on Repository X Ray report will be used as follows:
sequenceDiagram
actor USR as User
participant IDE
participant GLR as GitLabRails
participant ES as Elasticsearch
participant PG as PostgreSQL
participant AIGW as AI Gateway
USR->>+IDE: types: "#35; generate method that fetches <br>top charts from Spotify"
IDE->>+GLR: trigger code generation for "#35; generate method <br>that fetches top charts from Spotify"
alt with Elasticsearch available on an instance
rect rgb(0, 223, 0, .1)
note left of GLR: new embeddings flow
GLR->>AIGW: fetch embedding for instruction "in utils.js generate method that ...
AIGW->>GLR: embeddings vector for instruction
GLR->>ES: fetches KNN chunks for instruction embeddings vector
ES->>GLR: spotify/web-api-ts-sdk - A package that wraps ...
GLR->>AIGW: code generation request with prompt including <br>spotify/web-api-ts-sdk - A package that wraps...
end
else
rect rgb(128, 128, 128, .1)
note left of GLR: current flow as fallback
GLR->>PG: fetch X Ray report for project and language
PG->>GLR: xray_reports record
GLR->>AIGW: code generation request with prompt first 50 <br> entities from xray report
end
end
**Current state**
At this moment Repository X Ray scan is processed as shown on the diagram below
sequenceDiagram
actor USR as User
participant RN as GitLab Runner
participant GLR as GitLab Rails
participant PG as GitLab PostgreSQL DB
participant AIGW as AI Gateway
USR->>GLR: commits changes <br> to a package manager file <br>eg. Gemfile.lock
GLR->>+RN: triggers Repository X Ray CI scanner job
loop for each batch of packages
RN->>GLR: Request packages description by AI
GLR->>AIGW: Forward request for packages description
AIGW->>GLR: Packages description
GLR->>RN: Forwards packages description
end
RN->>-GLR: Repository X Ray report
GLR->>+GLR: triggers Repository X Ray ingestion background job
GLR->>-PG: upserts xray_reports record
The report is later used in as shown below:
sequenceDiagram
actor USR as User
participant IDE
participant PG as GitLab PostgreSQL DB
participant GLR as GitLab Rails
participant AIGW as AI Gateway
USR->>+IDE: types: "#35; generate a function that transposes a matrix"
IDE->>+GLR: trigger code generation for line ` "#35; generate function `
GLR->>PG: fetch X Ray report for project and language
PG->>GLR: xray_reports record
GLR->>GLR: include first 50 entities from xray report into code generation prompt
GLR->>-AIGW: trigger code generation ` "#35; generate function `
Required changes
On X Ray write path
- Create new AI Gateway endpoint that generate embeddings
- Make sure that Repository X Ray data can be saved into Elasticsearch within existing integration:
- Either #442197 (closed) gets delivered
- Or it would be required to migrate
xray_reports
table to new structure where each record will represent single entity (package/library) in Repository X Ray report which would be compatible with current Elasticsearch upload pipeline
- Modify
Ai::StoreRepositoryXrayService
to perform following actions:- Check if Elasticsearch is available on an instance, if not later steps can be skipped
- Checks if Repository X Ray report for given project and language exists. When true remove documents from Elasticsearch index that represents old X Ray report
- From AI Gateway request embeddings for entities in a new X Ray report
- Add documents for new X Ray report entities into ES index
On X Ray read path
- When generation trigger type is
comment
IDE / LS should send code generation instruction (the content of a comment that triggered code generation) - GitLab Rails Code Suggestions API needs to add optional string parameter
instruction
- GitLab Rails needs to detect if Elasticsearch is available and:
- When Elasticsearch is available use semantic search to retrieve most relevant context
- When Elasticsearch is not available fallback to existing flow based on PostgreSQL DB