Set up model validation for Code Review Summary
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problem
Before updating the prompt (#485502) and going GA (&10771), we should set up model evaluation so we are better equipped to assess the quality of the model's responses (beyond just direct user feedback).
We want to setup a validation process similar to what we have for Duo Code Review: https://gitlab.com/gitlab-com/create-stage/code-review-be/-/wikis/Duo-Code-Review-Human-Evaluation-Process
Proposal
- Define and build an initial dataset
- The dataset will be hosted in Langsmith
- We should start with a small dataset (handbook), e.g. pick 1-2 MRs and collect their code review comments
- Example: #490991 (comment 2124897461)
- Setup LangSmith to perform manual model evaluations using that dataset
- Define evaluation criteria in Langsmith, e.g. conciseness and correctness
- Run an experiment in Langsmith to validate the setup
- Example: #490991 (comment 2124960188)
Out of scope
We will not create evaluators in ELI5 yet, this will be a future iteration.
Edited by 🤖 GitLab Bot 🤖