Automatic Evals Dataset: Enterprise Codebase Focused (Code Generation)

Overview

This POC aims to create a more realistic testing environment for code generation by simulating real-world enterprise codebases rather than isolated code snippets.

Background

Our current code generation evaluation relies on standalone problem-solution pairs that don't adequately reflect how our customers interact with the feature in enterprise environments. Users typically need generated code that integrates with existing modules and infrastructures (i.e. database, etc.) and follows established patterns within their codebase.

Objectives

Validate a reverse-engineering approach to dataset creation
Establish a reproducible methodology for creating enterprise-relevant test scenarios
Create a functional testing framework for LLM generated code during evaluations

Proposed Approach

We'll build a POC using:

A simple public repository (proposed: ELI5)
A single programming language (proposed: Python)
Unit tests to validate code functionality

This combination offers an ideal testing environment because:

ELI5 is a compact "non-production" project with infrastructure similar to our target model
Python ranks among the most popular languages our customers use with code generation
Team members already have Python experience

Implementation Plan

Set up a test server with the selected repository including runtimes to run functional unit tests
Use LLMs (e.g., ChatGPT) to generate problem-output pairs based on actual infrastructure in the test server
Use another LLM (e.g., Claude) to develop unit tests for each problem
Establish a methodology for injecting the LLM generated code into the server and running the unit tests against it
Define and capture relevant evals metric, i.e. pass/fail rate.
Document the process for future scaling

Key Questions to Address

What's the most efficient way to inject LLM generated code snippets into the server?
If using open-source projects, do we have proper authorization?
How can we standardize this process for other AI features requiring functional tests?

Success Criteria

Generated code is properly tested against real-world infrastructure
Unit tests can validate the functionality of generated code
Process is documented