Run Duo Chat Evaluation (Prompt Library) on CI

Problem

We're currently using Prompt Library as an evaluation tool for GitLab Duo features. It works well locally or for production scores, however, one of the pain points is that we can't do anything while the evaluation is running locally. To illustrate:

A developer got an idea to improve a prompt of Duo Chat.
The developer creates an merge request for it.
The developer runs an evaluation locally. It could take long time (e.g. 20 minutes).
While the evaluation runs, the developer can't change GDK, hence can't work on the other MR.

This problem becomes more prominent if the developer got multiple ideas. They can't work in parallel.

Also, we should run the evaluation automatically by default whenever a new MR (Duo Chat related) is opened. So that developers/reviewers won't forget to run it for comparing the before/after scores.

Proposal

Spin up a review app in MR. It should be able to specify AI Gateway version as well.
Run Prompt Library, which requests to the review app as a blackbox test.
Report the score in the MR. Comparing the MR's score and Production's score (example).

Edited Apr 04, 2024 by Shinya Maeda

Run Duo Chat Evaluation (Prompt Library) on CI

Problem

Proposal

Related