WIP - Context and embeddings abstraction layer : Product requirements (#15265) · Epics · GitLab.org

WIP - Context and embeddings abstraction layer : Product requirements

**Note:** This issue is meant to pair with the [Context and embeddings layer Blueprint MR](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/8567 "Added abstraction layer design doc"), providing product requirements to map the technical requirements to. ## Problems to solve The current approach to context creation and embeddings at GitLab lacks the flexibility and adaptability needed to support diverse AI-driven features across different domains. Specifically: 1. Domain teams need to be able to easily customize indexing and embedding processes for their specific use cases. 2. Creating and managing embeddings should be simple for domain teams, requiring no specialized knowledge. 3. Provide support for creating and managing rich meta-information, especially for repositories. 4. Retrieving embeddings should be straightforward and multiple approaches should be supported (e.g., hybrid search) 5. Content processing for embeddings should be flexible, providing options for what and how to index. 6. Support long-term context management for user interactions and workflows. ## Initial use cases <details> <summary> \[Click to expand\] We have identified three initial use cases the abstraction layer will support, focused on source code context: </summary> 1. Where does code not adhere to a standard (e.g, a style guide or API docs)? 2. Where can I reduce code duplication? 3. Where can a codebase be optimized? These use cases also leverage our platform strengths, like access to customers' entire codebase. ### Other suggestions from the comments From `@achueshev` : * **Better CI Logs analyses:** We can apply semantic chunking to our CI logs to retrieve definitions of all existing problems and analyze them in isolation (tree search) before creating the final Duo Workflow plan. * **Issues, epics, and MR content:** Searching for issues, epics, and MRs related to the given query. Duo Workflow can explore recently merged MRs and related issues to find potential solutions and similar problems. * **Code Search:** For Duo Workflow, we potentially need to identify and analyze the context related to the problem. We can use several approaches to locate the right context: summarizing code files with LLM, creating embeddings, and code parsing. I believe all three approaches are useful simultaneously. From `@mikolaj_wawrzyniak` * **Broadly useful, "general" search tools:** "I believe there is great opportunity to get two goals for a price of one. What I mean is that any improvement to existing one or a new search tools available to GitLab users, can also almost seamlessly be plugged into Duo Workflow. In that context building Duo Workflow only search tools seems as wasted opportunity (at least for a majority of use cases)." </details> ## Iteration plan <details> <summary> \[Click to expand\] **Note:** Iteration plan discussion and decision documented [here](https://gitlab.com/gitlab-org/gitlab/-/issues/500108 "[Discussion] Vector Store and Abstraction Layer Iteration Plan"). </summary> ### Guidelines 1. Make sure that feature teams access underlying vector stores via an abstraction. We don't want to `build a direct connection to Elastic and then migrate to abstraction later`. 1. Current plan is to proceed with Elastic as the initial store due to Global Search's team experience and existing implementation efforts. 2. Ensure that we only embed information relevant to paying Duo users as opposed to all information across dot-com. 3. Timeline estimates have been provided in https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/9225#note_2171940021. ### Initial Iteration 1. Global Search will align with feature teams on a minimal set of context-related use cases for our initial iteration 2. Global Search will identify / define: 1. Underlying platform capabilities to be made available via the _abstraction layer_ in order to enable the prioritized use cases defined in (1) 2. Identify the fastest time-to-develop _vector store_ to enable use cases from (1) 3. Global Search will align with feature teams on the stable interface for an abstraction layer that implementing the agreed-upon platform capabilities from (2). 4. The definition-of-done for releasing our initial vector store features include: 1. Context related features, built on top of an abstraction layer, 2. AI Abstraction layer, built on top of a single initial vector store, and 3. One vector store (Elastic) ### Subsequent Iterations 1. Feature teams identify additional capabilities required from the abstraction layer. 2. Global search prioritizes either: 1. Additional capabilities extending the abstraction layer, and/or 2. Additional vector stores (ie PGVector) </details> ### Maturity stages * **Minimal:** The smallest deliverable value that allows teams to develop against the abstraction layer * **Viable:** Once this stage is reached, the features developed against the abstraction layer should be functional at dot-com scale, and there should be a coherent story for self-managed customers to adopt it as well. At this stage, it should be possible for teams to self-serve future features, as long as minimal customization is required. * **Competitive:** Each phase of development work to get us to this stage can be done in any order, should priorities change. ### Development phases <table> <tr> <th>Phases to be completed</th> <th>Minimal</th> <th>Viable</th> <th>Competitive</th> </tr> <tr> <td> [**Phase 1 : Foundational Capabilities**](https://gitlab.com/groups/gitlab-org/-/epics/15289 "Phase 1 : Foundational Capabilities")**:** Deliver capabilities for indexing, embedding generation, and retrieval across various content types using both keyword and semantic search methods. * **Iteration 2:** Code and repository context, including code embedding generation, indexing, and hybrid search * **Iteration 3:** Merge request * **Iteration 1:** GitLab documentation indexing and hybrid search </td> <td> :large_green_circle: </td> <td> </td> <td> </td> </tr> <tr> <td> [**Phase 2 : Platformization**](https://gitlab.com/groups/gitlab-org/-/epics/15291 "Phase 2 : Platformization")**:** Deliver operational aspects of a vector search platform, including hybrid search functionality, real-time updates, scalability, security, API usability, and monitoring capabilities. * **Iteration 1:** Expand API and interface for domain teams * **Iteration 2:** Access control and security * **Iteration 3:** Scalability and performance </td> <td> </td> <td> :large_green_circle: </td> <td> </td> </tr> <tr> <td> [**Phase 3 : Advanced Repository and Code Context Handling**](https://gitlab.com/groups/gitlab-org/-/epics/15293 "Phase 3 : Advanced Repository and Code Context Handling") Deliver support for code-centric AI features, including LLM explanations for code elements, dependency tracking, hierarchical metadata management, and code quality metrics integration. * **Iteration 1:** Natural language code search support </td> <td> </td> <td> :large_green_circle: </td> <td> </td> </tr> <tr> <td> [**Phase 4 : Customization and Adaptability**](https://gitlab.com/groups/gitlab-org/-/epics/15290 "Phase 4 : Customization and Adaptability")**:** Deliver facilities for content processing, including configurable indexing strategies, chunking mechanisms, preprocessing options, and selectable embedding models to optimize search quality for different content types. </td> <td> </td> <td> </td> <td> :large_green_circle: </td> </tr> <tr> <td> [**Phase 5 : Enrichment and Meta-context Generation**](https://gitlab.com/groups/gitlab-org/-/epics/15292 "Phase 5 : Enrichment and Meta-context Generation") Deliver support for content enrichment features, including LLM-based summaries, metadata storage, and knowledge graph integration to enhance search context and relationships between content. </td> <td> </td> <td> </td> <td> :large_green_circle: </td> </tr> <tr> <td> [**Phase 6+ : Future Considerations**](https://gitlab.com/groups/gitlab-org/-/epics/15294 "Phase 6+ : Future Considerations")**:** Deliver user experience enhancements, including personalized search results, multilingual support, and integration capabilities with external platforms. </td> <td> </td> <td> </td> <td> :large_green_circle: </td> </tr> </table> ## Product requirements _The below epics in the_ :point_down: **_Child items section_** :point_down: _are broken into numbered phases in rough priority order and subject to change based on stakeholder feedback. We will split these items into multiple iterations. We are targeting 2024-10-04 for an initial iteration plan, but in the meantime the numbered phases are an approximation._ _There are two cross-initiative epics at the top of the section: _[**_Future Considerations_**](https://gitlab.com/groups/gitlab-org/-/epics/15294 "Phase 6+ : Future Considerations")_ and_ [**_Phase TBD Requirements_**](https://gitlab.com/groups/gitlab-org/-/epics/15291 "Phase 2 : Platformization")_, which contain requirements that could feasibly be put in any phase, depending on business priority._ **_Phase TBD requirements_**_, specifically, are ones we think need to be completed to call the system "done", but depending on priorities there is some flexibility on the "when"._

epic