Creation of Code Review Benchmark Dataset ( Rudimentary)
Problem to solve
Currently, the Code Review AI Feature would need preliminary benchmark . The initial work involves creating a rudimentary dataset to better understand:
- which foundational model to use
- how to tweak the prompts
Proposal
🚀 Code Review Dataset Creation with GitLab API
To ensure swift access to a robust validation dataset, we're working on a two-phase iterative journey of dataset creation with the below flow
🌟 Iteration One: Rudimentary Benchmark Dataset
In the initial phase, we'll leverage historical data from 14 GitLab projects to create a synthetic dataset.
P️rogress:
The dataset was created:
It has the following language distribution:
| file_extension | count |
|---|---|
| .md | 289 |
| .yml | 399 |
| .js | 179 |
| .pot | 1 |
| .json | 235 |
| .lock | 138 |
| .rb | 462 |
| .sql | 18 |
| .rake | 20 |
| .vue | 79 |
| .go | 1770 |
| .mod | 102 |
| .sum | 80 |
| .tool-versions | 23 |
| .gitlab/CODEOWNERS | 3 |
| .mk | 9 |
| .crt | 2 |
| .key | 1 |
| .sh | 5 |
| .txt | 68 |
| .erb | 61 |
| .gemspec | 1 |
| .haml | 5 |
| .scss | 67 |
| .ts | 128 |
| .yaml | 53 |
| .mjs | 13 |
| .html | 5 |
| .rebuild | 6 |
| .ps1 | 2 |
| .deb | 1 |
| .rpm | 1 |
| .Dockerfile | 6 |
| .tmpl | 1 |
| .svg | 3 |
| .nvmrc | 1 |
| .css | 14 |
| .cjs | 3 |
| .gitignore | 2 |
| .snap | 8 |
| .onbuild | 1 |
| .3 | 1 |
| .toml | 8 |
| .proto | 6 |
| .template | 5 |
| .py | 53 |
| .ruby-version | 1 |
🌈 Iteration Two: Production Benchmark Dataset
The second phase transitions to using exclusively historical production data, creating a more comprehensive and realistic dataset.
💡 Pro Tip: This two-phase approach allows us to start with a quick, dataset for initial validation, then transition to a more comprehensive, real-world dataset for robust testing and validation.
Technical Implementation Details:
🔍 Data Pipeline for Code Review Dataset
We're expanding our Data Pipeline to harness the power of the GitLab API, creating a robust code review dataset from 14 open source projects. Here's how our data-gathering process works:
🛠 ️ API Interaction Functions
We've crafted these functions to seamlessly interact with GitLab's API endpoints:
| Function | Description |
|---|---|
get_projects() |
|
get_merge_requests() |
|
get_merge_request_changes() |
|
get_merge_request_comments() |
|
🏗 ️ Dataset Creation
Our create_dataset() function is the heart of the operation:
-
🔄 Iterates through the provided projects -
⏳ For each project, fetches merge requests from the last 30 days (configurable) -
📊 For each merge request, it collects:-
📌 Project and MR metadata -
🔀 Changes (code hunks) -
💬 Comments -
✅ Review state (approved or needs work)
-
🚀 Execution Process via Dataflow Pipeline
-
🔍 Searched and extracts for open source projects -
🎯 Creates a dataset from the first 5 projects (POC) and extend to 14 projects.
Further details
This initial dataset serves as a starting point. We will expand in various areas to refine the feature further based on insights gained from this preliminary benchmark.
Links / references
Meeting Syncs: https://docs.google.com/document/d/1ilWXf-DjsJTbe0hVby7rDxb4NugziURN9o1FiUEfgCI/edit
