Build Evaluation Datasets for Duo Code Review
## Overview We need evaluation datasets to reliably measure Duo Code Review quality and track the impact of prompt or model changes. Currently we lack a systematic way to assess improvements or regressions when making changes to the review agent. ## Current state We make changes to Duo Code Review prompts and models without a reliable measurement system for quality impact. Other AI features use evaluation datasets to track performance over time. ## Implementation considerations * Need to define what constitutes a "good" code review for evaluation purposes (accuracy, actionability, false positive rate, etc.) * Should collect diverse merge request examples that cover different scenarios (bug fixes, new features, refactors, security issues, performance changes, etc.) * Need to determine dataset size for statistical significance * Should track multiple quality dimensions (security awareness, style consistency, test coverage feedback, etc.) * Need infrastructure to run evaluations automatically when prompts or models change (could be a future iteration) * Should integrate evaluation results into CI/CD pipeline or development workflow * Need to handle dataset versioning and updates as product evolves * Should consider privacy implications of storing merge request examples for evaluation * Should establish baseline metrics before making future changes ## Expected outcome Implementation of evaluation datasets and measurement infrastructure that: * Provides reliable quality metrics for Duo Code Review * Enables comparison of prompt and model changes * Can be run automatically as part of development process (possible future iteration) * Covers diverse merge request scenarios
issue