Build Evaluation Datasets for Duo Code Review

Overview

We need evaluation datasets to reliably measure Duo Code Review quality and track the impact of prompt or model changes. Currently we lack a systematic way to assess improvements or regressions when making changes to the review agent.

Current state

We make changes to Duo Code Review prompts and models without a reliable measurement system for quality impact. Other AI features use evaluation datasets to track performance over time.

Implementation considerations

  • Need to define what constitutes a "good" code review for evaluation purposes (accuracy, actionability, false positive rate, etc.)
  • Should collect diverse merge request examples that cover different scenarios (bug fixes, new features, refactors, security issues, performance changes, etc.)
  • Need to determine dataset size for statistical significance
  • Should track multiple quality dimensions (security awareness, style consistency, test coverage feedback, etc.)
  • Need infrastructure to run evaluations automatically when prompts or models change (could be a future iteration)
  • Should integrate evaluation results into CI/CD pipeline or development workflow
  • Need to handle dataset versioning and updates as product evolves
  • Should consider privacy implications of storing merge request examples for evaluation
  • Should establish baseline metrics before making future changes

Expected outcome

Implementation of evaluation datasets and measurement infrastructure that:

  • Provides reliable quality metrics for Duo Code Review
  • Enables comparison of prompt and model changes
  • Can be run automatically as part of development process (possible future iteration)
  • Covers diverse merge request scenarios
Edited by 🤖 GitLab Bot 🤖