DRAFT: LLM-Powered Smart File Context System
NOTE: This represents a combination of the `LLM Shortlist` and `Annotated File Tree` strategies in the [CodeChat Arena](https://codebase-chat.shekharpatnaik.uk/)
## Overview
The Smart File Context System aims to enhanced Duo Workflow's ability to understand codebases by creating and maintaining a context-aware index of repository files. This system empowers LLMs to quickly identify and include relevant files as context when responding to user queries, drastically improving codebase comprehension tasks.
Rather than requiring users to know exact file locations or tediously searching through repositories, this system leverages the LLM's understanding of natural language queries to identify the most relevant files in a codebase. The result is a more intuitive and effective code assistance experience.
## Objective
* Enable Duo Workflow to efficiently identify relevant files for user queries
* Create a persistent, annotated index of repository files with summaries and tags
## Architecture Components
### Language Server
* **Primary Role:** Create and maintain the file context database
* **Responsibilities:**
* Analyze file content and generate summaries via LLM calls (AI gateway calls)
* The LLM will provide concise summaries for each file based on content
* Store metadata in a local PostgreSQL database in the repository
* Using PostgreSQL provides robustness, concurrency support, and advanced query capabilities
* Update indexes when files change
* These tasks will be done incrementally in the background
* No need for eager updates; the system can prioritize files based on access patterns
### Duo Workflow Executor
* **Primary Role:** Provide secure access to database for Duo Workflow Service
* **Responsibilities:**
* Check for existence of file context database
* If no db exists, we would notify the duo workflow service to not include the file context tool
* Expose bounded db access actions:
* `GetFiles` - stream files back to the workflow service
* **Exposed Actions**
* `GetFiles`
* **Purpose**: Stream files with summaries back to the workflow service
* **Parameters**: Optional search parameters
* **Returns**: Stream of file path/summary pairs
### Duo Workflow Service
* **Primary Role:** Provide intelligence layer for file discovery
* **Responsibilities:**
* Process the streamed file results from the executor
* Chunk files appropriately for LLM context limitations
* Use LLM to evaluate relevance of files to the query
* Synthesize findings into a coherent response
* **Exposed Tools**
* `FindRelevantFiles`
## Flow Diagram
### Background Indexing Process (Language Server)
```mermaid
sequenceDiagram
participant LanguageServer as Language Server
participant DuoWorkflowExecutor as Duo Workflow Executor
Note over LanguageServer,DuoWorkflowExecutor: Background Indexing Process
LanguageServer->>DuoWorkflowExecutor: Scan repository for files to index
Note right of DuoWorkflowExecutor: Internally: List files in repo
DuoWorkflowExecutor-->>LanguageServer: File list
loop For each file
LanguageServer->>DuoWorkflowExecutor: Read file content
Note right of DuoWorkflowExecutor: Internally: Access file system
DuoWorkflowExecutor-->>LanguageServer: File content
LanguageServer->>LanguageServer: Generate summary via LLM
LanguageServer->>DuoWorkflowExecutor: Store file metadata and summary
Note right of DuoWorkflowExecutor: Internally: Update PostgreSQL DB
end
Note over LanguageServer,DuoWorkflowExecutor: Incremental updates
LanguageServer->>DuoWorkflowExecutor: Monitor for file changes
DuoWorkflowExecutor-->>LanguageServer: Change notifications
LanguageServer->>DuoWorkflowExecutor: Update affected entries
Note right of DuoWorkflowExecutor: Internally: Update DB entries
```
### User Query and File Discovery Process
```mermaid
sequenceDiagram
participant User
participant DuoWorkflowExecutor as Duo Workflow Executor
participant DuoWorkflowService as Duo Workflow Service
participant LLM
Note over User,LLM: File Discovery Process
User->>DuoWorkflowExecutor: "How does authentication work in this codebase?"
DuoWorkflowExecutor->>DuoWorkflowExecutor: Check if DB exists
Note right of DuoWorkflowExecutor: Internally: Check for PostgreSQL DB
DuoWorkflowExecutor->>DuoWorkflowService: Start workflow (with file context DB status)
alt File Context DB Available
DuoWorkflowService->>DuoWorkflowExecutor: GetFiles()
Note right of DuoWorkflowExecutor: Internally: Query PostgreSQL DB<br>for files with summaries
loop File streaming
DuoWorkflowExecutor-->>DuoWorkflowService: Stream file path/summary pairs
DuoWorkflowService->>DuoWorkflowService: Accumulate files for chunking
end
Note over DuoWorkflowService,LLM: Multi-stage filtering process based on summaries only
loop Process files in chunks
DuoWorkflowService->>LLM: Evaluate chunk relevance
Note right of LLM: Input: query + chunk of file summaries
LLM-->>DuoWorkflowService: Filtered candidates with relevance scores
end
DuoWorkflowService->>LLM: Final filtering round
Note right of LLM: Input: all promising candidates from previous rounds
LLM-->>DuoWorkflowService: Final ranked list of relevant files
DuoWorkflowService->>LLM: Generate synthesis
LLM-->>DuoWorkflowService: Coherent explanation of findings
else No File Context DB
Note over DuoWorkflowService: FindRelevantFiles tool is disabled
DuoWorkflowService->>DuoWorkflowService: Fall back to standard file operations
end
DuoWorkflowService-->>DuoWorkflowExecutor: Response with file analysis
DuoWorkflowExecutor-->>User: Relevant files with synthesis
```
## Future Enhancements
### Tag-Based File Organization
As a potential follow-up improvement, we could implement a tag-based approach:
* **File Tagging System**:
* Each file would be assigned multiple tags representing key concepts it contains
* Tags would be automatically generated by the LLM based on file content
* This would allow for more targeted file retrieval
* **Enhanced Query Flow**:
* Get all available tags from the database
* Use LLM to select the most relevant tags based on the user query
* Retrieve files based on tag matches
* This approach would provide a more semantic way to organize the codebase
* **Benefits**:
* More efficient filtering before content analysis
* Better handling of large codebases with thousands of files
* Potentially less LLM processing required
### Full-Text Search Integration
Another potential enhancement would be to integrate PostgreSQL's full-text search capabilities:
* Use text similarity search to find files with content related to the query
* Combine with LLM analysis for hybrid retrieval approach
* Potentially faster initial filtering with database-level search
### Code Structure Analysis (aka local knowledge graph)
If we integrate with a static analysis tool to extract structural information, we could enable retrieval based on architectural patterns, control flow, etc. This could satisfy queries like "find all callers of function X". Though this might belong as a separate tool, there may be justification for this belonging to the "find relevant files" use case. (cc: @michaelangeloio)
## Other Notes
### Fallback Mechanisms
If the file context database isn't available (e.g. the language server hasn't indexed the repository). The tool `FindRelevantFiles` should be disabled (TODO: I'm not sure that disabling tools has been implemented).
As a result, duo workflow will fallback to using existing methods.
epic