Skip to content

Trigger Code Generation API for Eval Dataset Generation

Problem to solve

As part of building a unit test server to generate enterprise grade codebase datasets for code generation evaluations, we need to call the production code generation API on selected projects to mimic how the feature is used in IDEs.

Current State

  • We have existing evaluations that call the API, but they're not based on enterprise-grade codebases
  • The current approach doesn't replicate the sophisticated context-gathering logic that IDEs use

Proposed Solution

  • Mimic IDE behavior: Replicates the logic IDEs use when calling the code generation API
  • Gathers comprehensive context:
    • Content above the cursor position
    • Context from open tabs/files (if applicable)
    • Multi-file context as used in production
  • Integrates with evaluation pipeline, i.e. pushes results to Langsmith for analysis

Implementation Requirements

  • Investigate and document how IDEs gather context beyond cursor position
  • Recreate the sophisticated context-gathering logic for evaluation purposes
  • Integrate with existing evaluation framework (results pushed to Langsmith)
  • Test with real enterprise codebases to ensure realistic evaluation datasets

Success Criteria

  • Evaluation datasets generated from real codebases using production-like context
  • Context gathering matches IDE behavior patterns
  • Results successfully integrated into Langsmith for analysis
  • Improved evaluation accuracy through realistic code generation scenarios

Links / references

Relates to gitlab-org/gitlab#553392

Edited by Shola Quadri