Creation of Code Review Benchmark Dataset ( Rudimentary)

Problem to solve

Currently, the Code Review AI Feature would need preliminary benchmark . The initial work involves creating a rudimentary dataset to better understand:

which foundational model to use
how to tweak the prompts

Proposal

🚀 Code Review Dataset Creation with GitLab API

To ensure swift access to a robust validation dataset, we're working on a two-phase iterative journey of dataset creation with the below flow

🌟 Iteration One: Rudimentary Benchmark Dataset

In the initial phase, we'll leverage historical data from 14 GitLab projects to create a synthetic dataset. 🗃️ For each file, we'll extract the historic comments and hunks ( hunk ID) and ultimately group it to the MR ID. This approach allows for rapid development of a validation dataset, providing a benchmark to asses how is Code Review and how to further refine the prompts, for testing specific scenarios and enabling quick initial results. ⚡

P️rogress:

The dataset was created:

It has the following language distribution:

file_extension	count
.md	289
.yml	399
.js	179
.pot	1
.json	235
.lock	138
.rb	462
.sql	18
.rake	20
.vue	79
.go	1770
.mod	102
.sum	80
.tool-versions	23
.gitlab/CODEOWNERS	3
.mk	9
.crt	2
.key	1
.sh	5
.txt	68
.erb	61
.gemspec	1
.haml	5
.scss	67
.ts	128
.yaml	53
.mjs	13
.html	5
.rebuild	6
.ps1	2
.deb	1
.rpm	1
.Dockerfile	6
.tmpl	1
.svg	3
.nvmrc	1
.css	14
.cjs	3
.gitignore	2
.snap	8
.onbuild	1
.3	1
.toml	8
.proto	6
.template	5
.py	53
.ruby-version	1

🌈 Iteration Two: Production Benchmark Dataset

The second phase transitions to using exclusively historical production data, creating a more comprehensive and realistic dataset. 📊 We'll extract commits 📝, comments 💬, and pipeline information 🔄 to build a multi-step, ground-truth validation dataset. This phase ensures real-world applicability of various reviwers personas, various languages and methodologies, providing a robust testing and validation environment as proxy to production. 🏗️

💡 Pro Tip: This two-phase approach allows us to start with a quick, dataset for initial validation, then transition to a more comprehensive, real-world dataset for robust testing and validation.

Technical Implementation Details:

🔍 Data Pipeline for Code Review Dataset

We're expanding our Data Pipeline to harness the power of the GitLab API, creating a robust code review dataset from 14 open source projects. Here's how our data-gathering process works:

🛠️ API Interaction Functions

We've crafted these functions to seamlessly interact with GitLab's API endpoints:

Function	Description
`get_projects()`	🔎 Fetches public projects matching a search query
`get_merge_requests()`	📥 Retrieves merge requests for a project
`get_merge_request_changes()`	📝 Gets the changes (diffs) for a specific merge request
`get_merge_request_comments()`	💬 Fetches comments on a merge request

🏗️ Dataset Creation

Our create_dataset() function is the heart of the operation:

🔄 Iterates through the provided projects
⏳ For each project, fetches merge requests from the last 30 days (configurable)
📊 For each merge request, it collects:
- 📌 Project and MR metadata
- 🔀 Changes (code hunks)
- 💬 Comments
- ✅ Review state (approved or needs work)

🚀 Execution Process via Dataflow Pipeline

🔍 Searched and extracts for open source projects
🎯 Creates a dataset from the first 5 projects (POC) and extend to 14 projects.

Further details

This initial dataset serves as a starting point. We will expand in various areas to refine the feature further based on insights gained from this preliminary benchmark.

Links / references

Meeting Syncs: https://docs.google.com/document/d/1ilWXf-DjsJTbe0hVby7rDxb4NugziURN9o1FiUEfgCI/edit

Edited Jul 25, 2024 by Susie Bitters