Validate prompt improvements to Duo Code Review using model evaluation

Problem

Before making too many changes to the prompt (see other issues in &14143), we should set up model evaluation so we are better equip to assess the quality of the model's responses (beyond just direct user feedback).

Proposal

Build an initial dataset of code review examples, to establish a first benchmark
1. The first version of this dataset would be created by groupai model validation, Creation of Code Review Benchmark Dataset ( Rud... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#348 - closed)
Leverage LangSmith to perform a model evaluation using that dataset:
1. Docs: https://gitlab.com/gitlab-org/ai-powered/eli5/-/tree/main
2. Example for Duo Code Review: https://gitlab.com/gitlab-org/ai-powered/daily-updates/-/issues/7#note_1966668226
Make prompt improvements and verify their impact using the results

Validate prompt improvements to Duo Code Review using model evaluation

Problem

Proposal

Links