DRAFT: LLM-Powered Smart File Context System
NOTE: This represents a combination of the `LLM Shortlist` and `Annotated File Tree` strategies in the [CodeChat Arena](https://codebase-chat.shekharpatnaik.uk/) ## Overview The Smart File Context System aims to enhanced Duo Workflow's ability to understand codebases by creating and maintaining a context-aware index of repository files. This system empowers LLMs to quickly identify and include relevant files as context when responding to user queries, drastically improving codebase comprehension tasks. Rather than requiring users to know exact file locations or tediously searching through repositories, this system leverages the LLM's understanding of natural language queries to identify the most relevant files in a codebase. The result is a more intuitive and effective code assistance experience. ## Objective * Enable Duo Workflow to efficiently identify relevant files for user queries * Create a persistent, annotated index of repository files with summaries and tags ## Architecture Components ### Language Server * **Primary Role:** Create and maintain the file context database * **Responsibilities:** * Analyze file content and generate summaries via LLM calls (AI gateway calls) * The LLM will provide concise summaries for each file based on content * Store metadata in a local PostgreSQL database in the repository * Using PostgreSQL provides robustness, concurrency support, and advanced query capabilities * Update indexes when files change * These tasks will be done incrementally in the background * No need for eager updates; the system can prioritize files based on access patterns ### Duo Workflow Executor * **Primary Role:** Provide secure access to database for Duo Workflow Service * **Responsibilities:** * Check for existence of file context database * If no db exists, we would notify the duo workflow service to not include the file context tool * Expose bounded db access actions: * `GetFiles` - stream files back to the workflow service * **Exposed Actions** * `GetFiles` * **Purpose**: Stream files with summaries back to the workflow service * **Parameters**: Optional search parameters * **Returns**: Stream of file path/summary pairs ### Duo Workflow Service * **Primary Role:** Provide intelligence layer for file discovery * **Responsibilities:** * Process the streamed file results from the executor * Chunk files appropriately for LLM context limitations * Use LLM to evaluate relevance of files to the query * Synthesize findings into a coherent response * **Exposed Tools** * `FindRelevantFiles` ## Flow Diagram ### Background Indexing Process (Language Server) ```mermaid sequenceDiagram participant LanguageServer as Language Server participant DuoWorkflowExecutor as Duo Workflow Executor Note over LanguageServer,DuoWorkflowExecutor: Background Indexing Process LanguageServer->>DuoWorkflowExecutor: Scan repository for files to index Note right of DuoWorkflowExecutor: Internally: List files in repo DuoWorkflowExecutor-->>LanguageServer: File list loop For each file LanguageServer->>DuoWorkflowExecutor: Read file content Note right of DuoWorkflowExecutor: Internally: Access file system DuoWorkflowExecutor-->>LanguageServer: File content LanguageServer->>LanguageServer: Generate summary via LLM LanguageServer->>DuoWorkflowExecutor: Store file metadata and summary Note right of DuoWorkflowExecutor: Internally: Update PostgreSQL DB end Note over LanguageServer,DuoWorkflowExecutor: Incremental updates LanguageServer->>DuoWorkflowExecutor: Monitor for file changes DuoWorkflowExecutor-->>LanguageServer: Change notifications LanguageServer->>DuoWorkflowExecutor: Update affected entries Note right of DuoWorkflowExecutor: Internally: Update DB entries ``` ### User Query and File Discovery Process ```mermaid sequenceDiagram participant User participant DuoWorkflowExecutor as Duo Workflow Executor participant DuoWorkflowService as Duo Workflow Service participant LLM Note over User,LLM: File Discovery Process User->>DuoWorkflowExecutor: "How does authentication work in this codebase?" DuoWorkflowExecutor->>DuoWorkflowExecutor: Check if DB exists Note right of DuoWorkflowExecutor: Internally: Check for PostgreSQL DB DuoWorkflowExecutor->>DuoWorkflowService: Start workflow (with file context DB status) alt File Context DB Available DuoWorkflowService->>DuoWorkflowExecutor: GetFiles() Note right of DuoWorkflowExecutor: Internally: Query PostgreSQL DB<br>for files with summaries loop File streaming DuoWorkflowExecutor-->>DuoWorkflowService: Stream file path/summary pairs DuoWorkflowService->>DuoWorkflowService: Accumulate files for chunking end Note over DuoWorkflowService,LLM: Multi-stage filtering process based on summaries only loop Process files in chunks DuoWorkflowService->>LLM: Evaluate chunk relevance Note right of LLM: Input: query + chunk of file summaries LLM-->>DuoWorkflowService: Filtered candidates with relevance scores end DuoWorkflowService->>LLM: Final filtering round Note right of LLM: Input: all promising candidates from previous rounds LLM-->>DuoWorkflowService: Final ranked list of relevant files DuoWorkflowService->>LLM: Generate synthesis LLM-->>DuoWorkflowService: Coherent explanation of findings else No File Context DB Note over DuoWorkflowService: FindRelevantFiles tool is disabled DuoWorkflowService->>DuoWorkflowService: Fall back to standard file operations end DuoWorkflowService-->>DuoWorkflowExecutor: Response with file analysis DuoWorkflowExecutor-->>User: Relevant files with synthesis ``` ## Future Enhancements ### Tag-Based File Organization As a potential follow-up improvement, we could implement a tag-based approach: * **File Tagging System**: * Each file would be assigned multiple tags representing key concepts it contains * Tags would be automatically generated by the LLM based on file content * This would allow for more targeted file retrieval * **Enhanced Query Flow**: * Get all available tags from the database * Use LLM to select the most relevant tags based on the user query * Retrieve files based on tag matches * This approach would provide a more semantic way to organize the codebase * **Benefits**: * More efficient filtering before content analysis * Better handling of large codebases with thousands of files * Potentially less LLM processing required ### Full-Text Search Integration Another potential enhancement would be to integrate PostgreSQL's full-text search capabilities: * Use text similarity search to find files with content related to the query * Combine with LLM analysis for hybrid retrieval approach * Potentially faster initial filtering with database-level search ### Code Structure Analysis (aka local knowledge graph) If we integrate with a static analysis tool to extract structural information, we could enable retrieval based on architectural patterns, control flow, etc. This could satisfy queries like "find all callers of function X". Though this might belong as a separate tool, there may be justification for this belonging to the "find relevant files" use case. (cc: @michaelangeloio) ## Other Notes ### Fallback Mechanisms If the file context database isn't available (e.g. the language server hasn't indexed the repository). The tool `FindRelevantFiles` should be disabled (TODO: I'm not sure that disabling tools has been implemented). As a result, duo workflow will fallback to using existing methods.
epic