Trigger Code Generation API for Eval Dataset Generation

Problem to solve

As part of building a unit test server to generate enterprise grade codebase datasets for code generation evaluations, we need to call the production code generation API on selected projects to mimic how the feature is used in IDEs.

Current State

We have existing evaluations that call the API, but they're not based on enterprise-grade codebases
The current approach doesn't replicate the sophisticated context-gathering logic that IDEs use

Proposed Solution

Mimic IDE behavior: Replicates the logic IDEs use when calling the code generation API
Gathers comprehensive context:
- Content above the cursor position
- Context from open tabs/files (if applicable)
- Multi-file context as used in production
Integrates with evaluation pipeline, i.e. pushes results to Langsmith for analysis

Implementation Requirements

Investigate and document how IDEs gather context beyond cursor position
Recreate the sophisticated context-gathering logic for evaluation purposes
Integrate with existing evaluation framework (results pushed to Langsmith)
Test with real enterprise codebases to ensure realistic evaluation datasets

Success Criteria

Evaluation datasets generated from real codebases using production-like context
Context gathering matches IDE behavior patterns
Results successfully integrated into Langsmith for analysis
Improved evaluation accuracy through realistic code generation scenarios

Links / references

Relates to gitlab-org/gitlab#553392

Edited Sep 22, 2025 by Shola Quadri