Set up model validation for Code Review Summary

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Problem

Before updating the prompt (#485502) and going GA (&10771), we should set up model evaluation so we are better equipped to assess the quality of the model's responses (beyond just direct user feedback).

We want to setup a validation process similar to what we have for Duo Code Review: https://gitlab.com/gitlab-com/create-stage/code-review-be/-/wikis/Duo-Code-Review-Human-Evaluation-Process

Proposal

  1. Define and build an initial dataset
    1. The dataset will be hosted in Langsmith
    2. We should start with a small dataset (handbook), e.g. pick 1-2 MRs and collect their code review comments
    3. Example: #490991 (comment 2124897461)
  2. Setup LangSmith to perform manual model evaluations using that dataset
    1. Define evaluation criteria in Langsmith, e.g. conciseness and correctness
    2. Run an experiment in Langsmith to validate the setup
    3. Example: #490991 (comment 2124960188)

Out of scope

We will not create evaluators in ELI5 yet, this will be a future iteration.

Edited by 🤖 GitLab Bot 🤖