Trigger Code Generation API for Eval Dataset Generation
Problem to solve
As part of building a unit test server to generate enterprise grade codebase datasets for code generation evaluations, we need to call the production code generation API on selected projects to mimic how the feature is used in IDEs.
Current State
- We have existing evaluations that call the API, but they're not based on enterprise-grade codebases
- The current approach doesn't replicate the sophisticated context-gathering logic that IDEs use
Proposed Solution
- Mimic IDE behavior: Replicates the logic IDEs use when calling the code generation API
- Gathers comprehensive context:
- Content above the cursor position
- Context from open tabs/files (if applicable)
- Multi-file context as used in production
- Integrates with evaluation pipeline, i.e. pushes results to Langsmith for analysis
Implementation Requirements
- Investigate and document how IDEs gather context beyond cursor position
- Recreate the sophisticated context-gathering logic for evaluation purposes
- Integrate with existing evaluation framework (results pushed to Langsmith)
- Test with real enterprise codebases to ensure realistic evaluation datasets
Success Criteria
- Evaluation datasets generated from real codebases using production-like context
- Context gathering matches IDE behavior patterns
- Results successfully integrated into Langsmith for analysis
- Improved evaluation accuracy through realistic code generation scenarios
Links / references
Relates to gitlab-org/gitlab#553392
Edited by Shola Quadri