Build Evaluation Datasets for Duo Code Review
## Overview
We need evaluation datasets to reliably measure Duo Code Review quality and track the impact of prompt or model changes. Currently we lack a systematic way to assess improvements or regressions when making changes to the review agent.
## Current state
We make changes to Duo Code Review prompts and models without a reliable measurement system for quality impact. Other AI features use evaluation datasets to track performance over time.
## Implementation considerations
* Need to define what constitutes a "good" code review for evaluation purposes (accuracy, actionability, false positive rate, etc.)
* Should collect diverse merge request examples that cover different scenarios (bug fixes, new features, refactors, security issues, performance changes, etc.)
* Need to determine dataset size for statistical significance
* Should track multiple quality dimensions (security awareness, style consistency, test coverage feedback, etc.)
* Need infrastructure to run evaluations automatically when prompts or models change (could be a future iteration)
* Should integrate evaluation results into CI/CD pipeline or development workflow
* Need to handle dataset versioning and updates as product evolves
* Should consider privacy implications of storing merge request examples for evaluation
* Should establish baseline metrics before making future changes
## Expected outcome
Implementation of evaluation datasets and measurement infrastructure that:
* Provides reliable quality metrics for Duo Code Review
* Enables comparison of prompt and model changes
* Can be run automatically as part of development process (possible future iteration)
* Covers diverse merge request scenarios
issue