Consolidate: Evaluation Dataset Recommendation for Duo Code Review
Consolidate: Evaluation Dataset Recommendation for Duo Code Review
Parent: Survey: External Evaluation Datasets and Benchmarks for Duo Code Review Related: #588932
This consolidation incorporates findings from 6 individual dataset analyses, a hands-on inspection of every record in the Qodo dataset, and a statistical deep-dive into its distributions and biases.
Goal
Synthesize findings from the survey and recommend a concrete approach for building evaluation datasets for Duo Code Review.
Side-by-Side Comparison
Dataset x Quality Property Coverage
| Dataset | Bug Detection (P/R) | Comment Quality | Cross-File | Security | Languages | Custom Instructions | Scale |
|---|---|---|---|---|---|---|---|
| CodeFuse-CR-Bench | Yes (dual eval) | Yes (model-based) | Yes (full repo) | Partial | Python only | No | 601 |
| Qodo PR-Review-Bench | Yes (injection) | Partial (location) | Partial | Present (unlabeled) | 7 languages | Yes (134 rules) | 580 issues |
| Augment Benchmark | Yes (golden comments) | Implicit | Yes (complex repos) | No | 5 languages | No | 50 |
| CodeReviewer (MSFT) | No labels | No | No (method-level) | No | 9 languages | No | 1M+ |
| Greptile Benchmark | Recall only | Yes (impact req) | Implicit | No | 5 languages | No | 50 |
| CROP | No labels | No | Possible (has data) | No | 5 languages | No | 51K |
Verdict Summary
| Dataset | Verdict | What to take |
|---|---|---|
| CodeFuse-CR-Bench | Adapt + borrow | Repo-level context format, dual evaluation framework, problem domain taxonomy |
| Qodo PR-Review-Bench | Use as primary evaluation dataset | Location-level ground truth, 134 structured rules, dual evaluation (bugs + rules), 9 competitive tool baselines |
| Augment Benchmark | Supplementary (cross-validation) | Severity-weighted scoring methodology, competitive benchmarks, cross-validation signal |
| CodeReviewer (MSFT) | Reference only | 3-task decomposition, data quality lessons ("Too Noisy To Learn") |
| Greptile Benchmark | Borrow methodology | Real-bug-in-reverse technique for future GitLab-native data |
| CROP | Reference only | Codebase-linked review data concept, multi-revision tracking |
Key Findings
1. Methodology is more transferable than data
No external dataset can be dropped into our evaluation pipeline as-is. But the construction methodologies (bug injection, real-bug-in-reverse, golden comments, dual evaluation) are directly applicable to building our own.
2. No dataset tests GitLab-specific quality properties
5 of our 8 quality properties have gaps that no external dataset covers:
| Quality Property | External Coverage | Gap? |
|---|---|---|
| Bug detection precision | Yes (Qodo, Augment, CodeFuse) | No |
| Bug detection recall | Yes (Qodo, Greptile, Augment) | No |
| Comment specificity | Yes (Qodo hit def, Greptile line-level, CodeFuse rule-based) | No |
| Comment actionability | Partial (Greptile impact req, CodeFuse model-based) | Partial |
| Cross-file reasoning | Partial (CodeFuse full repo, Augment complex repos) | Partial |
| Security awareness | No | Yes |
| Custom instruction compliance | No | Yes |
| Robustness / injection resistance | No | Yes |
3. Context depth is the differentiator
Augment's key insight ("the defining challenge in AI code review isn't generation, it's context") is validated across the survey. Datasets with full repository context (CodeFuse) or complex repos (Augment, Greptile) produce more meaningful evaluations than method-level datasets (CodeReviewer).
Decision: Qodo PR-Review-Bench as Primary Dataset
After analyzing all 6 datasets hands-on, Qodo PR-Review-Bench is the clear choice for our primary evaluation dataset. No other public dataset comes close on the two dimensions that matter most: location-level ground truth and structured coding rules.
Why Qodo
1. Location-level ground truth (unique among public datasets)
94% of issues (544/580) include exact file_path + start_line + end_line + code_snippet. This enables triple-requirement scoring: file match + line range overlap + semantic description match. No other dataset provides this.
The schema difference is decisive:
Augment ground truth (2 fields, no file/line):
{"comment": "The function modifies config but returns original monitor.config...", "severity": "High"}
Qodo ground truth (7 fields with exact location):
{
"title": "JWT signature validation bypassed",
"description": "The JWT verification function was modified to skip signature validation...",
"file_path": "ghost/core/core/boot.js",
"start_line": 351,
"end_line": 377,
"code_snippet": "await Promise.all([...",
"rule_name": "NONE"
}
This means 2 of the 3 scoring dimensions (file match, line overlap) are fully deterministic. Only semantic matching requires an LLM judge. With Augment, all scoring requires LLM judgment.
2. Dual evaluation structure (bugs + rules)
The dataset contains two distinct issue types that test different quality properties in a single pass:
- 309 bug/logic issues (53%): Missing functionality, incorrect logic, race conditions, auth bypasses, null/type errors, memory leaks. Bug descriptions are impact-oriented (mean 557 chars).
-
271 rule violations (47%): Linked to 134 per-repo coding rules, each with explicit
success_criteriaandfailure_criteria. Maps directly to GitLab's.gitlab/duo/mr-review-instructions.yamlcustom instructions.
No other dataset tests both bug detection AND instruction-following.
3. Scale and diversity
| Metric | Value |
|---|---|
| Total PRs | 100 |
| Total issues | 580 |
| Repositories | 8 (Ghost, cal.com, dify, firefox-ios, prefect, tauri, aspnetcore, redis) |
| Languages | JavaScript, TypeScript, Python, Swift, Rust, C#, C |
| Issues per PR | 3-15 (mean 5.8, median 5) |
| Unique rules | 134 |
| License | MIT |
4. Competitive benchmarking
9 tools already evaluated on this dataset, including Augment, Cursor, GitHub Copilot, Greptile, Codex, CodeRabbit, and Sentry. Qodo's best F1: 60.1% (exhaustive mode). Running Duo against the same 100 PRs provides immediate competitive positioning.
5. Full reproducibility
All 100 PRs are open on GitHub (agentic-review-benchmarks org). Raw data is on HuggingFace under MIT license. Any result can be independently verified.
Statistical profile (from deep-dive analysis)
| Dimension | Distribution |
|---|---|
| Location precision | 51% narrow (1-5 lines), 36% medium (6-20 lines), 7% broad (21+ lines) |
| Completeness | 94% have full location data, 100% have code snippets, 100% have descriptions |
Known biases and mitigations
| Bias | Impact | Mitigation |
|---|---|---|
| cal.com overrepresented (16 PRs, 108 issues vs redis 9 PRs, 44 issues) | Per-repo metrics may be skewed | Report per-repo AND overall metrics; normalize by repo |
| "Biome formatting" rule appears 23x in cal.com | Inflates rule violation recall | Deduplicate to 3 representative instances per high-frequency rule |
| Bug description CV is 0.13 (suspiciously uniform) | May indicate template-generated descriptions | Acceptable for evaluation; note in methodology |
| 37 issues without file paths | Can't be scored on location accuracy | Use semantic-only scoring track for these |
Preprocessing completed
The following preprocessing has been done:
-
Evaluation-ready JSONL: 580 records with full schema (
qodo_eval_dataset.jsonl) -
Flat CSV: same 580 records for quick inspection (
qodo_eval_dataset.csv) - 12-sheet xlsx analysis: overview, all issues, per-repo sheets, rules, key insights
- Statistical deep-dive: distribution analysis, bias detection, preprocessing recommendations
For details, see the hands-on deep-dive (08) and statistical analysis (10).
Supplementary Datasets
Augment Golden Comments (cross-validation)
Augment's dataset provides complementary value, not as a primary evaluation source, but for cross-validation:
- 50 PRs across 5 large repos (Sentry, Grafana, Cal.com, Discourse, Keycloak)
- 137 human-curated golden issues with severity labels (HIGH/LOW)
- 7 tools benchmarked (best: Augment 59% F-score)
- Running Duo against both Qodo (100 PRs) and Augment (50 PRs) provides two independent quality signals
Limitation: No file/line locations. All scoring is semantic-only (LLM-as-judge), making results less deterministic than Qodo.
Greptile Methodology (for future GitLab-native data)
Greptile's real-bug-in-reverse technique should be applied to build future evaluation data from GitLab's own repositories:
- Find merged MRs that fix known bugs
- Recreate the pre-fix state as an evaluation MR
- Produces highest-signal ground truth (real bugs, not synthetic)
- Use for expanding beyond Qodo's 100 PRs
Scoring Framework
Dual-track evaluation combining strengths of all three datasets:
- Location-based scoring (544 Qodo issues with location): file match + line range overlap + semantic description match
- Semantic-only scoring (36 Qodo issues without location + all 137 Augment issues): LLM-judge semantic matching
- Per-dimension reporting: P/R/F1 by repo, issue type (bug vs rule)
Hit definition
A tool comment counts as a "hit" (true positive) only if it satisfies:
- File match: Comment references the same file as ground truth
- Line range overlap: Comment's line range overlaps with ground truth (at least 1 line)
- Semantic match: LLM judge confirms the comment describes the same underlying issue
For issues without location data, only requirement 3 applies.
Gap-Filling Needed
Properties not covered by any external dataset — must build ourselves:
| Gap | Target | Source |
|---|---|---|
| Security awareness | 5-10 MRs with known vulnerabilities (SQLi, XSS, auth bypass) | Build from OWASP patterns |
| Custom instruction compliance | 5-10 MRs with .gitlab/duo/mr-review-instructions.yaml
|
Adapt Qodo's 134 rules |
| Adversarial robustness | 3-5 MRs with prompt injection in code/comments | Build from known attack patterns |
Next Steps
Survey phase (this epic) — complete
- 6 dataset deep-dive analyses posted
- Qodo PR-Review-Bench hands-on deep-dive posted
- This consolidation and recommendation posted
Execution phase (#588932) — starting
- Post statistical deep-dive as issue under #588932
- Post dataset collection plan as issue under #588932
- Run Duo Code Review against 100 Qodo PRs
- Run Duo against 50 Augment PRs (cross-validation)
- Build security, custom instruction, and adversarial test cases
- Upload to LangSmith and integrate into CEF
- Establish baseline P/R/F1 metrics