Consolidate: Evaluation Dataset Recommendation for Duo Code Review

Consolidate: Evaluation Dataset Recommendation for Duo Code Review

Parent: Survey: External Evaluation Datasets and Benchmarks for Duo Code Review Related: #588932

This consolidation incorporates findings from 6 individual dataset analyses, a hands-on inspection of every record in the Qodo dataset, and a statistical deep-dive into its distributions and biases.


Goal

Synthesize findings from the survey and recommend a concrete approach for building evaluation datasets for Duo Code Review.


Side-by-Side Comparison

Dataset x Quality Property Coverage

Dataset Bug Detection (P/R) Comment Quality Cross-File Security Languages Custom Instructions Scale
CodeFuse-CR-Bench Yes (dual eval) Yes (model-based) Yes (full repo) Partial Python only No 601
Qodo PR-Review-Bench Yes (injection) Partial (location) Partial Present (unlabeled) 7 languages Yes (134 rules) 580 issues
Augment Benchmark Yes (golden comments) Implicit Yes (complex repos) No 5 languages No 50
CodeReviewer (MSFT) No labels No No (method-level) No 9 languages No 1M+
Greptile Benchmark Recall only Yes (impact req) Implicit No 5 languages No 50
CROP No labels No Possible (has data) No 5 languages No 51K

Verdict Summary

Dataset Verdict What to take
CodeFuse-CR-Bench Adapt + borrow Repo-level context format, dual evaluation framework, problem domain taxonomy
Qodo PR-Review-Bench Use as primary evaluation dataset Location-level ground truth, 134 structured rules, dual evaluation (bugs + rules), 9 competitive tool baselines
Augment Benchmark Supplementary (cross-validation) Severity-weighted scoring methodology, competitive benchmarks, cross-validation signal
CodeReviewer (MSFT) Reference only 3-task decomposition, data quality lessons ("Too Noisy To Learn")
Greptile Benchmark Borrow methodology Real-bug-in-reverse technique for future GitLab-native data
CROP Reference only Codebase-linked review data concept, multi-revision tracking

Key Findings

1. Methodology is more transferable than data

No external dataset can be dropped into our evaluation pipeline as-is. But the construction methodologies (bug injection, real-bug-in-reverse, golden comments, dual evaluation) are directly applicable to building our own.

2. No dataset tests GitLab-specific quality properties

5 of our 8 quality properties have gaps that no external dataset covers:

Quality Property External Coverage Gap?
Bug detection precision Yes (Qodo, Augment, CodeFuse) No
Bug detection recall Yes (Qodo, Greptile, Augment) No
Comment specificity Yes (Qodo hit def, Greptile line-level, CodeFuse rule-based) No
Comment actionability Partial (Greptile impact req, CodeFuse model-based) Partial
Cross-file reasoning Partial (CodeFuse full repo, Augment complex repos) Partial
Security awareness No Yes
Custom instruction compliance No Yes
Robustness / injection resistance No Yes

3. Context depth is the differentiator

Augment's key insight ("the defining challenge in AI code review isn't generation, it's context") is validated across the survey. Datasets with full repository context (CodeFuse) or complex repos (Augment, Greptile) produce more meaningful evaluations than method-level datasets (CodeReviewer).


Decision: Qodo PR-Review-Bench as Primary Dataset

After analyzing all 6 datasets hands-on, Qodo PR-Review-Bench is the clear choice for our primary evaluation dataset. No other public dataset comes close on the two dimensions that matter most: location-level ground truth and structured coding rules.

Why Qodo

1. Location-level ground truth (unique among public datasets)

94% of issues (544/580) include exact file_path + start_line + end_line + code_snippet. This enables triple-requirement scoring: file match + line range overlap + semantic description match. No other dataset provides this.

The schema difference is decisive:

Augment ground truth (2 fields, no file/line):

{"comment": "The function modifies config but returns original monitor.config...", "severity": "High"}

Qodo ground truth (7 fields with exact location):

{
  "title": "JWT signature validation bypassed",
  "description": "The JWT verification function was modified to skip signature validation...",
  "file_path": "ghost/core/core/boot.js",
  "start_line": 351,
  "end_line": 377,
  "code_snippet": "await Promise.all([...",
  "rule_name": "NONE"
}

This means 2 of the 3 scoring dimensions (file match, line overlap) are fully deterministic. Only semantic matching requires an LLM judge. With Augment, all scoring requires LLM judgment.

2. Dual evaluation structure (bugs + rules)

The dataset contains two distinct issue types that test different quality properties in a single pass:

  • 309 bug/logic issues (53%): Missing functionality, incorrect logic, race conditions, auth bypasses, null/type errors, memory leaks. Bug descriptions are impact-oriented (mean 557 chars).
  • 271 rule violations (47%): Linked to 134 per-repo coding rules, each with explicit success_criteria and failure_criteria. Maps directly to GitLab's .gitlab/duo/mr-review-instructions.yaml custom instructions.

No other dataset tests both bug detection AND instruction-following.

3. Scale and diversity

Metric Value
Total PRs 100
Total issues 580
Repositories 8 (Ghost, cal.com, dify, firefox-ios, prefect, tauri, aspnetcore, redis)
Languages JavaScript, TypeScript, Python, Swift, Rust, C#, C
Issues per PR 3-15 (mean 5.8, median 5)
Unique rules 134
License MIT

4. Competitive benchmarking

9 tools already evaluated on this dataset, including Augment, Cursor, GitHub Copilot, Greptile, Codex, CodeRabbit, and Sentry. Qodo's best F1: 60.1% (exhaustive mode). Running Duo against the same 100 PRs provides immediate competitive positioning.

5. Full reproducibility

All 100 PRs are open on GitHub (agentic-review-benchmarks org). Raw data is on HuggingFace under MIT license. Any result can be independently verified.

Statistical profile (from deep-dive analysis)

Dimension Distribution
Location precision 51% narrow (1-5 lines), 36% medium (6-20 lines), 7% broad (21+ lines)
Completeness 94% have full location data, 100% have code snippets, 100% have descriptions

Known biases and mitigations

Bias Impact Mitigation
cal.com overrepresented (16 PRs, 108 issues vs redis 9 PRs, 44 issues) Per-repo metrics may be skewed Report per-repo AND overall metrics; normalize by repo
"Biome formatting" rule appears 23x in cal.com Inflates rule violation recall Deduplicate to 3 representative instances per high-frequency rule
Bug description CV is 0.13 (suspiciously uniform) May indicate template-generated descriptions Acceptable for evaluation; note in methodology
37 issues without file paths Can't be scored on location accuracy Use semantic-only scoring track for these

Preprocessing completed

The following preprocessing has been done:

  • Evaluation-ready JSONL: 580 records with full schema (qodo_eval_dataset.jsonl)
  • Flat CSV: same 580 records for quick inspection (qodo_eval_dataset.csv)
  • 12-sheet xlsx analysis: overview, all issues, per-repo sheets, rules, key insights
  • Statistical deep-dive: distribution analysis, bias detection, preprocessing recommendations

For details, see the hands-on deep-dive (08) and statistical analysis (10).


Supplementary Datasets

Augment Golden Comments (cross-validation)

Augment's dataset provides complementary value, not as a primary evaluation source, but for cross-validation:

  • 50 PRs across 5 large repos (Sentry, Grafana, Cal.com, Discourse, Keycloak)
  • 137 human-curated golden issues with severity labels (HIGH/LOW)
  • 7 tools benchmarked (best: Augment 59% F-score)
  • Running Duo against both Qodo (100 PRs) and Augment (50 PRs) provides two independent quality signals

Limitation: No file/line locations. All scoring is semantic-only (LLM-as-judge), making results less deterministic than Qodo.

Greptile Methodology (for future GitLab-native data)

Greptile's real-bug-in-reverse technique should be applied to build future evaluation data from GitLab's own repositories:

  • Find merged MRs that fix known bugs
  • Recreate the pre-fix state as an evaluation MR
  • Produces highest-signal ground truth (real bugs, not synthetic)
  • Use for expanding beyond Qodo's 100 PRs

Scoring Framework

Dual-track evaluation combining strengths of all three datasets:

  1. Location-based scoring (544 Qodo issues with location): file match + line range overlap + semantic description match
  2. Semantic-only scoring (36 Qodo issues without location + all 137 Augment issues): LLM-judge semantic matching
  3. Per-dimension reporting: P/R/F1 by repo, issue type (bug vs rule)

Hit definition

A tool comment counts as a "hit" (true positive) only if it satisfies:

  1. File match: Comment references the same file as ground truth
  2. Line range overlap: Comment's line range overlaps with ground truth (at least 1 line)
  3. Semantic match: LLM judge confirms the comment describes the same underlying issue

For issues without location data, only requirement 3 applies.


Gap-Filling Needed

Properties not covered by any external dataset — must build ourselves:

Gap Target Source
Security awareness 5-10 MRs with known vulnerabilities (SQLi, XSS, auth bypass) Build from OWASP patterns
Custom instruction compliance 5-10 MRs with .gitlab/duo/mr-review-instructions.yaml Adapt Qodo's 134 rules
Adversarial robustness 3-5 MRs with prompt injection in code/comments Build from known attack patterns

Next Steps

Survey phase (this epic) — complete

  • 6 dataset deep-dive analyses posted
  • Qodo PR-Review-Bench hands-on deep-dive posted
  • This consolidation and recommendation posted

Execution phase (#588932) — starting

  • Post statistical deep-dive as issue under #588932
  • Post dataset collection plan as issue under #588932
  • Run Duo Code Review against 100 Qodo PRs
  • Run Duo against 50 Augment PRs (cross-validation)
  • Build security, custom instruction, and adversarial test cases
  • Upload to LangSmith and integrate into CEF
  • Establish baseline P/R/F1 metrics
Edited by 🤖 GitLab Bot 🤖