Build Evaluation Datasets for Duo Code Review

Overview

We need evaluation datasets to reliably measure Duo Code Review quality and track the impact of prompt or model changes. Currently we lack a systematic way to assess improvements or regressions when making changes to the review agent.

Current state

We make changes to Duo Code Review prompts and models without a reliable measurement system for quality impact. Other AI features use evaluation datasets to track performance over time.

Implementation considerations

Need to define what constitutes a "good" code review for evaluation purposes (accuracy, actionability, false positive rate, etc.)
Should collect diverse merge request examples that cover different scenarios (bug fixes, new features, refactors, security issues, performance changes, etc.)
Need to determine dataset size for statistical significance
Should track multiple quality dimensions (security awareness, style consistency, test coverage feedback, etc.)
Need infrastructure to run evaluations automatically when prompts or models change (could be a future iteration)
Should integrate evaluation results into CI/CD pipeline or development workflow
Need to handle dataset versioning and updates as product evolves
Should consider privacy implications of storing merge request examples for evaluation
Should establish baseline metrics before making future changes

Expected outcome

Implementation of evaluation datasets and measurement infrastructure that:

Provides reliable quality metrics for Duo Code Review
Enables comparison of prompt and model changes
Can be run automatically as part of development process (possible future iteration)
Covers diverse merge request scenarios

Edited Feb 18, 2026 by 🤖 GitLab Bot 🤖