One click Duo Chat evaluation in MRs

Problem

We currently manually run evaluations in Prompt Library or ELI5 projects. This is not productive because developers have to setup and run an evaluation pipeline in their local environment everytime they change a Duo feature. It's cumbersome process.

Proposal

We let developers run an evaluation with one-click in the manual job in merge request pipelines.

Introduce a new manual job duo-chat-evaluation in merge request pipelines. This job can be triggered in both GitLab-Rails and AI Gateway.
After the evaluation is done, report the result in the MR.

Workflow

A developer creates an MR to improve/fix a prompt.
The developer clicks an evaluation job on the MR pipeline, which does the following process:
1. Generate base reference
  1. The CI pipeline spins up GDK instance on the master branch.
  2. The evaluator gets the Duo Chat output from the GDK instance.
2. Generate target reference
  1. The CI pipeline spins up GDK instance on the target branch.
  2. The evaluator gets the Duo Chat output from the GDK instance.
3. Run pair-wise evaluation and generate a report.
The developer/reivewer/maintainer review the report.
If they find a quality degradation of a specific test case, it's likely that their "improve/fix" causes a regression in somewhere else. The failed test case should be debuggable in the developer's local GDK. Alternatively, they can see the corresponding logs of remote GDK through the correlation-ID of Duo Chat requests.

Downstream pipeline

Trigger a downstream pipeline:

Base inference
1. Provisioning
  1. Configuration
  2. Checkout the default branch
2. Collect inference
  1. Collect input dataset
  2. Execute Duo Chat request
  3. Store the result
Target inference
1. Same with above but checkout a feature branch.
Evaluation
1. Pair-wise evaluation

A few notes:

For collecting inference results, fixtures and input datasets can be managed in GitLab-Rails rpsec. This is similar to gitlab-duo-chat-qa job. An advantage of this approach is that:
- Resource IDs (e.g. Issue ID) can be tightly coupled with input datasets.
- Developers can run a specific test locally for debugging a problematic output.

Technical details

At first, we need a CLI to do the following things:

Deploy a containerized GDK instance on GCP Compute Engine or Cloud Run.
- The GDK instance is customizable in the following ways:
  - Specify the version (SHA/Tag) of GitLab-Rails.
  - Specify the version (SHA/Tag) of AI Gateway.
  - Specify the fixtures of the GitLab instance e.g. Project, Repository, Issues, etc.
  - Additional configuration for GitLab-Rails config, feature flag state, etc
  - Additional configuration for AI Gateway config, feature flag state, etc
  - Enable tracing by default. Submit the trace to "GitLab" organization and separate project per run or user.
Run an evaluation against the GDK instance.
- It should be evaluator-agnostic that can be integrated with any evaluators (e.g. Prompt Library, ELI5, etc).
- The evaluation implementation such as methods and dataset is not a concern for this project. It's the evaluator's responsibility.
Collect a report from the evaluators and create a report as a comment in a target MR. Report artifacts could be csv, link to bigquery or langsmith, etc.
Cleanup the instance.

Make sure these things work in local environment. Don't build it on CI/pipelines directly as it'll be hard to debug.

Links

Edited Nov 19, 2024 by Shinya Maeda