Curate sample MRs/diffs to validate Duo Code Review prompt

Goal

Discussed sync in https://docs.google.com/document/d/1VNNJypItfRUAoSZrCnCrdWyBvM-pzr73bmvv3Gdrkeg/edit (internal).

To iterate faster on the prompt, it would be useful to have a consistent set of MRs/file diffs that we can run the prompt against in the Anthropic Workbench. So far, everyone has tested on various MRs, we should consolidate this testing data set to see clearer trends on improvements/regressions.

This will also help create a shared vision for the desired end state required to open this wider as an Experiment.

Proposal

Pick a few MRs that have been reviewed (ruby? golang?)
Bundle the diffs in a CSV file that can be loaded in Anthropic Workbench and run against the prompt
(nice to have) Ensure the CSV is also compatible with ELI5, as we'll likely want to use it there in the future as well

Edited Sep 18, 2024 by François Rosé