Automatic Evals Dataset: Enterprise Codebase Focused (Code Generation)
Overview
Related issue: #508167 (closed)
This POC aims to create a more realistic testing environment for code generation by simulating real-world enterprise codebases rather than isolated code snippets.
Background
Our current code generation evaluation relies on standalone problem-solution pairs that don't adequately reflect how our customers interact with the feature in enterprise environments. Users typically need generated code that integrates with existing modules and infrastructures (i.e. database, etc.) and follows established patterns within their codebase.
Objectives
- Validate a reverse-engineering approach to dataset creation
- Establish a reproducible methodology for creating enterprise-relevant test scenarios
- Create a functional testing framework for LLM generated code during evaluations
Proposed Approach
We'll build a POC using:
- A simple public repository (proposed:
ELI5
) - A single programming language (proposed:
Python
) - Unit tests to validate code functionality
This combination offers an ideal testing environment because:
-
ELI5
is a compact "non-production" project with infrastructure similar to our target model - Python ranks among the most popular languages our customers use with code generation
- Team members already have Python experience
Implementation Plan
- Set up a test server with the selected repository including runtimes to run functional unit tests
- Use LLMs (e.g., ChatGPT) to generate problem-output pairs based on actual infrastructure in the test server
- Use another LLM (e.g., Claude) to develop unit tests for each problem
- Establish a methodology for injecting the LLM generated code into the server and running the unit tests against it
- Define and capture relevant evals metric, i.e. pass/fail rate.
- Document the process for future scaling
Key Questions to Address
- What's the most efficient way to inject LLM generated code snippets into the server?
- If using open-source projects, do we have proper authorization?
- How can we standardize this process for other AI features requiring functional tests?
Success Criteria
- Generated code is properly tested against real-world infrastructure
- Unit tests can validate the functionality of generated code
- Process is documented