Skip to content

Validate prompt improvements to Duo Code Review using model evaluation

Problem

Before making too many changes to the prompt (see other issues in &14143), we should set up model evaluation so we are better equip to assess the quality of the model's responses (beyond just direct user feedback).

Proposal

  1. Build an initial dataset of code review examples, to establish a first benchmark
    1. The first version of this dataset would be created by groupai model validation, Creation of Code Review Benchmark Dataset ( Rud... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#348 - closed)
  2. Leverage LangSmith to perform a model evaluation using that dataset:
    1. Docs: https://gitlab.com/gitlab-org/ai-powered/eli5/-/tree/main
    2. Example for Duo Code Review: https://gitlab.com/gitlab-org/ai-powered/daily-updates/-/issues/7#note_1966668226
  3. Make prompt improvements and verify their impact using the results

Links

Edited by Michelle Gill