Skip to content

Creation of Code Review Benchmark Dataset ( Rudimentary)

Problem to solve

Currently, the Code Review AI Feature would need preliminary benchmark . The initial work involves creating a rudimentary dataset to better understand:

  • which foundational model to use
  • how to tweak the prompts

Proposal

🚀 Code Review Dataset Creation with GitLab API

To ensure swift access to a robust validation dataset, we're working on a two-phase iterative journey of dataset creation with the below flow

Screenshot_2024-07-01_at_3.31.19_PM

🌟 Iteration One: Rudimentary Benchmark Dataset

In the initial phase, we'll leverage historical data from 14 GitLab projects to create a synthetic dataset. 🗃️ For each file, we'll extract the historic comments and hunks ( hunk ID) and ultimately group it to the MR ID. This approach allows for rapid development of a validation dataset, providing a benchmark to asses how is Code Review and how to further refine the prompts, for testing specific scenarios and enabling quick initial results.

P️rogress:

The dataset was created:

It has the following language distribution:

file_extension count
.md 289
.yml 399
.js 179
.pot 1
.json 235
.lock 138
.rb 462
.sql 18
.rake 20
.vue 79
.go 1770
.mod 102
.sum 80
.tool-versions 23
.gitlab/CODEOWNERS 3
.mk 9
.crt 2
.key 1
.sh 5
.txt 68
.erb 61
.gemspec 1
.haml 5
.scss 67
.ts 128
.yaml 53
.mjs 13
.html 5
.rebuild 6
.ps1 2
.deb 1
.rpm 1
.Dockerfile 6
.tmpl 1
.svg 3
.nvmrc 1
.css 14
.cjs 3
.gitignore 2
.snap 8
.onbuild 1
.3 1
.toml 8
.proto 6
.template 5
.py 53
.ruby-version 1

🌈 Iteration Two: Production Benchmark Dataset

The second phase transitions to using exclusively historical production data, creating a more comprehensive and realistic dataset. 📊 We'll extract commits 📝, comments 💬, and pipeline information 🔄 to build a multi-step, ground-truth validation dataset. This phase ensures real-world applicability of various reviwers personas, various languages and methodologies, providing a robust testing and validation environment as proxy to production. 🏗


💡 Pro Tip: This two-phase approach allows us to start with a quick, dataset for initial validation, then transition to a more comprehensive, real-world dataset for robust testing and validation.

Technical Implementation Details:

🔍 Data Pipeline for Code Review Dataset

We're expanding our Data Pipeline to harness the power of the GitLab API, creating a robust code review dataset from 14 open source projects. Here's how our data-gathering process works:

🛠️ API Interaction Functions

We've crafted these functions to seamlessly interact with GitLab's API endpoints:

Function Description
get_projects() 🔎 Fetches public projects matching a search query
get_merge_requests() 📥 Retrieves merge requests for a project
get_merge_request_changes() 📝 Gets the changes (diffs) for a specific merge request
get_merge_request_comments() 💬 Fetches comments on a merge request

🏗️ Dataset Creation

Our create_dataset() function is the heart of the operation:

  1. 🔄 Iterates through the provided projects
  2. For each project, fetches merge requests from the last 30 days (configurable)
  3. 📊 For each merge request, it collects:
    • 📌 Project and MR metadata
    • 🔀 Changes (code hunks)
    • 💬 Comments
    • Review state (approved or needs work)

🚀 Execution Process via Dataflow Pipeline

  1. 🔍 Searched and extracts for open source projects
  2. 🎯 Creates a dataset from the first 5 projects (POC) and extend to 14 projects.

Further details

This initial dataset serves as a starting point. We will expand in various areas to refine the feature further based on insights gained from this preliminary benchmark.

Links / references

Meeting Syncs: https://docs.google.com/document/d/1ilWXf-DjsJTbe0hVby7rDxb4NugziURN9o1FiUEfgCI/edit

Edited by Susie Bitters