Consolidate: Evaluation Dataset Recommendation for Duo Code Review

Parent: Survey: External Evaluation Datasets and Benchmarks for Duo Code Review Related: #588932

This consolidation incorporates findings from 6 individual dataset analyses, a hands-on inspection of every record in the Qodo dataset, and a statistical deep-dive into its distributions and biases.

Goal

Synthesize findings from the survey and recommend a concrete approach for building evaluation datasets for Duo Code Review.

Side-by-Side Comparison

Dataset x Quality Property Coverage

Dataset	Bug Detection (P/R)	Comment Quality	Cross-File	Security	Languages	Custom Instructions	Scale
CodeFuse-CR-Bench	Yes (dual eval)	Yes (model-based)	Yes (full repo)	Partial	Python only	No	601
Qodo PR-Review-Bench	Yes (injection)	Partial (location)	Partial	Present (unlabeled)	7 languages	Yes (134 rules)	580 issues
Augment Benchmark	Yes (golden comments)	Implicit	Yes (complex repos)	No	5 languages	No	50
CodeReviewer (MSFT)	No labels	No	No (method-level)	No	9 languages	No	1M+
Greptile Benchmark	Recall only	Yes (impact req)	Implicit	No	5 languages	No	50
CROP	No labels	No	Possible (has data)	No	5 languages	No	51K

Verdict Summary

Dataset	Verdict	What to take
CodeFuse-CR-Bench	Adapt + borrow	Repo-level context format, dual evaluation framework, problem domain taxonomy
Qodo PR-Review-Bench	Use as primary evaluation dataset	Location-level ground truth, 134 structured rules, dual evaluation (bugs + rules), 9 competitive tool baselines
Augment Benchmark	Supplementary (cross-validation)	Severity-weighted scoring methodology, competitive benchmarks, cross-validation signal
CodeReviewer (MSFT)	Reference only	3-task decomposition, data quality lessons ("Too Noisy To Learn")
Greptile Benchmark	Borrow methodology	Real-bug-in-reverse technique for future GitLab-native data
CROP	Reference only	Codebase-linked review data concept, multi-revision tracking

Key Findings

1. Methodology is more transferable than data

No external dataset can be dropped into our evaluation pipeline as-is. But the construction methodologies (bug injection, real-bug-in-reverse, golden comments, dual evaluation) are directly applicable to building our own.

2. No dataset tests GitLab-specific quality properties

5 of our 8 quality properties have gaps that no external dataset covers:

Quality Property	External Coverage	Gap?
Bug detection precision	Yes (Qodo, Augment, CodeFuse)	No
Bug detection recall	Yes (Qodo, Greptile, Augment)	No
Comment specificity	Yes (Qodo hit def, Greptile line-level, CodeFuse rule-based)	No
Comment actionability	Partial (Greptile impact req, CodeFuse model-based)	Partial
Cross-file reasoning	Partial (CodeFuse full repo, Augment complex repos)	Partial
Security awareness	No	Yes
Custom instruction compliance	No	Yes
Robustness / injection resistance	No	Yes

3. Context depth is the differentiator

Augment's key insight ("the defining challenge in AI code review isn't generation, it's context") is validated across the survey. Datasets with full repository context (CodeFuse) or complex repos (Augment, Greptile) produce more meaningful evaluations than method-level datasets (CodeReviewer).

Decision: Qodo PR-Review-Bench as Primary Dataset

After analyzing all 6 datasets hands-on, Qodo PR-Review-Bench is the clear choice for our primary evaluation dataset. No other public dataset comes close on the two dimensions that matter most: location-level ground truth and structured coding rules.

Why Qodo

1. Location-level ground truth (unique among public datasets)

94% of issues (544/580) include exact file_path + start_line + end_line + code_snippet. This enables triple-requirement scoring: file match + line range overlap + semantic description match. No other dataset provides this.

The schema difference is decisive:

Augment ground truth (2 fields, no file/line):

{"comment": "The function modifies config but returns original monitor.config...", "severity": "High"}

Qodo ground truth (7 fields with exact location):

{
  "title": "JWT signature validation bypassed",
  "description": "The JWT verification function was modified to skip signature validation...",
  "file_path": "ghost/core/core/boot.js",
  "start_line": 351,
  "end_line": 377,
  "code_snippet": "await Promise.all([...",
  "rule_name": "NONE"
}

This means 2 of the 3 scoring dimensions (file match, line overlap) are fully deterministic. Only semantic matching requires an LLM judge. With Augment, all scoring requires LLM judgment.

2. Dual evaluation structure (bugs + rules)

The dataset contains two distinct issue types that test different quality properties in a single pass:

309 bug/logic issues (53%): Missing functionality, incorrect logic, race conditions, auth bypasses, null/type errors, memory leaks. Bug descriptions are impact-oriented (mean 557 chars).
271 rule violations (47%): Linked to 134 per-repo coding rules, each with explicit success_criteria and failure_criteria. Maps directly to GitLab's .gitlab/duo/mr-review-instructions.yaml custom instructions.

No other dataset tests both bug detection AND instruction-following.

3. Scale and diversity

Metric	Value
Total PRs	100
Total issues	580
Repositories	8 (Ghost, cal.com, dify, firefox-ios, prefect, tauri, aspnetcore, redis)
Languages	JavaScript, TypeScript, Python, Swift, Rust, C#, C
Issues per PR	3-15 (mean 5.8, median 5)
Unique rules	134
License	MIT

4. Competitive benchmarking

9 tools already evaluated on this dataset, including Augment, Cursor, GitHub Copilot, Greptile, Codex, CodeRabbit, and Sentry. Qodo's best F1: 60.1% (exhaustive mode). Running Duo against the same 100 PRs provides immediate competitive positioning.

5. Full reproducibility

All 100 PRs are open on GitHub (agentic-review-benchmarks org). Raw data is on HuggingFace under MIT license. Any result can be independently verified.

Statistical profile (from deep-dive analysis)

Dimension	Distribution
Location precision	51% narrow (1-5 lines), 36% medium (6-20 lines), 7% broad (21+ lines)
Completeness	94% have full location data, 100% have code snippets, 100% have descriptions

Known biases and mitigations

Bias	Impact	Mitigation
cal.com overrepresented (16 PRs, 108 issues vs redis 9 PRs, 44 issues)	Per-repo metrics may be skewed	Report per-repo AND overall metrics; normalize by repo
"Biome formatting" rule appears 23x in cal.com	Inflates rule violation recall	Deduplicate to 3 representative instances per high-frequency rule
Bug description CV is 0.13 (suspiciously uniform)	May indicate template-generated descriptions	Acceptable for evaluation; note in methodology
37 issues without file paths	Can't be scored on location accuracy	Use semantic-only scoring track for these

Preprocessing completed

The following preprocessing has been done:

Evaluation-ready JSONL: 580 records with full schema (qodo_eval_dataset.jsonl)
Flat CSV: same 580 records for quick inspection (qodo_eval_dataset.csv)
12-sheet xlsx analysis: overview, all issues, per-repo sheets, rules, key insights
Statistical deep-dive: distribution analysis, bias detection, preprocessing recommendations

For details, see the hands-on deep-dive (08) and statistical analysis (10).

Supplementary Datasets

Augment Golden Comments (cross-validation)

Augment's dataset provides complementary value, not as a primary evaluation source, but for cross-validation:

50 PRs across 5 large repos (Sentry, Grafana, Cal.com, Discourse, Keycloak)
137 human-curated golden issues with severity labels (HIGH/LOW)
7 tools benchmarked (best: Augment 59% F-score)
Running Duo against both Qodo (100 PRs) and Augment (50 PRs) provides two independent quality signals

Limitation: No file/line locations. All scoring is semantic-only (LLM-as-judge), making results less deterministic than Qodo.

Greptile Methodology (for future GitLab-native data)

Greptile's real-bug-in-reverse technique should be applied to build future evaluation data from GitLab's own repositories:

Find merged MRs that fix known bugs
Recreate the pre-fix state as an evaluation MR
Produces highest-signal ground truth (real bugs, not synthetic)
Use for expanding beyond Qodo's 100 PRs

Scoring Framework

Dual-track evaluation combining strengths of all three datasets:

Location-based scoring (544 Qodo issues with location): file match + line range overlap + semantic description match
Semantic-only scoring (36 Qodo issues without location + all 137 Augment issues): LLM-judge semantic matching
Per-dimension reporting: P/R/F1 by repo, issue type (bug vs rule)

Hit definition

A tool comment counts as a "hit" (true positive) only if it satisfies:

File match: Comment references the same file as ground truth
Line range overlap: Comment's line range overlaps with ground truth (at least 1 line)
Semantic match: LLM judge confirms the comment describes the same underlying issue

For issues without location data, only requirement 3 applies.

Gap-Filling Needed

Properties not covered by any external dataset — must build ourselves:

Gap	Target	Source
Security awareness	5-10 MRs with known vulnerabilities (SQLi, XSS, auth bypass)	Build from OWASP patterns
Custom instruction compliance	5-10 MRs with `.gitlab/duo/mr-review-instructions.yaml`	Adapt Qodo's 134 rules
Adversarial robustness	3-5 MRs with prompt injection in code/comments	Build from known attack patterns

Next Steps

Survey phase (this epic) — complete

6 dataset deep-dive analyses posted
Qodo PR-Review-Bench hands-on deep-dive posted
This consolidation and recommendation posted

Execution phase (#588932) — starting

Post statistical deep-dive as issue under #588932
Post dataset collection plan as issue under #588932
Run Duo Code Review against 100 Qodo PRs
Run Duo against 50 Augment PRs (cross-validation)
Build security, custom instruction, and adversarial test cases
Upload to LangSmith and integrate into CEF
Establish baseline P/R/F1 metrics

Edited Mar 20, 2026 by 🤖 GitLab Bot 🤖