Validate prompt improvements to Duo Code Review using model evaluation
Problem
Before making too many changes to the prompt (see other issues in &14143), we should set up model evaluation so we are better equip to assess the quality of the model's responses (beyond just direct user feedback).
Proposal
- Build an initial dataset of code review examples, to establish a first benchmark
- The first version of this dataset would be created by groupai model validation, Creation of Code Review Benchmark Dataset ( Rud... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#348 - closed)
- Leverage LangSmith to perform a model evaluation using that dataset:
- Docs: https://gitlab.com/gitlab-org/ai-powered/eli5/-/tree/main
- Example for Duo Code Review: https://gitlab.com/gitlab-org/ai-powered/daily-updates/-/issues/7#note_1966668226
- Make prompt improvements and verify their impact using the results
Links
- Epic for support work by groupai model validation: gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&15
- Discussion on CEF vs LangSmith for this feature: https://gitlab.com/gitlab-org/ai-powered/daily-updates/-/issues/7#note_1965978061
- Discussion on CEF vs LangSmith in general: &13952
Edited by Michelle Gill