Run Duo Chat Evaluation (Prompt Library) on CI
Problem
We're currently using Prompt Library as an evaluation tool for GitLab Duo features. It works well locally or for production scores, however, one of the pain points is that we can't do anything while the evaluation is running locally. To illustrate:
- A developer got an idea to improve a prompt of Duo Chat.
- The developer creates an merge request for it.
- The developer runs an evaluation locally. It could take long time (e.g. 20 minutes).
- While the evaluation runs, the developer can't change GDK, hence can't work on the other MR.
This problem becomes more prominent if the developer got multiple ideas. They can't work in parallel.
Also, we should run the evaluation automatically by default whenever a new MR (Duo Chat related) is opened. So that developers/reviewers won't forget to run it for comparing the before/after scores.
Proposal
- Spin up a review app in MR. It should be able to specify AI Gateway version as well.
- Run Prompt Library, which requests to the review app as a blackbox test.
- Report the score in the MR. Comparing the MR's score and Production's score (example).
Related
Edited by Shinya Maeda