Run evaluation in merge request pipelines when Duo feature is changed

added devopsai-powered sectiondata-science labels

changed title from Run evaluation in merge request pipelines when prompt is changed to Run evaluation in merge request pipelines when Duo feature is changed

changed the description

added featureaddition label

added typefeature label

@susie.bee @oregand I think this would be a great step towards reducing the feedback loop and increasing value of ML eval. WDYT?

Rough plan of attack

Start with AI Gateway MRs (require less setup).
Only trigger on code changes (exclude doc, etc.).
Fetch the latest Prompt Library/ELI5 docker image.
Run eval on a subset of the dataset.
Report results on MR.

We might need to add handlers ion Prompt Library/ELI 5 to call AI Gateway directly (we are testing AI Features via GitLab API at the moment).

@tle_gitlab

I firmly believe this would mark a significant step forward for our ML eval value across the whole company and I think this is something we should do.

Your rough plan of attack looks like a great place to start and we can iterate over it. @achueshev would love your thoughts here too.

@tle_gitlab I definitely support a streamlined validation process -- but I do have some questions that have come up previously when considering this:

Would each feature change be run against only the available datasets for that feature? (So a change to Chat wouldn't trigger an eval against Code Completion, for example)? I can see this as a potential cost factor.
Would we run the pipeline against the entire dataset for each use case, or a representative subset of that dataset? The entire dataset in some cases would be, as you know, quite timely and costly.
- If using a subset, would we do random dynamic sampling? Using the same static subset over time could have a degrading impact on the feature.
How would we handle disparate metrics? For example, what if LLM judge score for correctness goes up, but readability goes down? Or maybe latency goes down, but correctness goes down as well?
How would users explore their results (we had started to peg out some ideas on using the MR as a jump off point for results here.)
Can users override a failed validation pipeline, for all of the potential reasons above? (so no just an emergency scenario)

This thread raises a very interesting problem that would be great to address while working on this issue. We have implemented different evaluation approaches; however, we are still on our way to providing a solid CEF (easy to use, mature, flexible, and non-controversial when outputting eval results) to fully support developers with AI feature implementation.

We probably need to classify our evaluation into several subgroups. Please find the rough idea below:

Runtime evaluation/monitoring: This is what we call a daily evaluation run. The goal is to monitor that the feature works as expected in terms of ML logic.
Pre-release evaluation: Before merging into the main branch, we need to make sure that we don't break the current feature quality. This type of evaluation can be extensive and, at this point, is the same as our daily runs.
Dev-support evaluation as part of CI: We need to support our developers when updating/developing AI features and pushing changes to MR. Developers shouldn't need to set up and run an evaluation pipeline in their local environment every time they change a Duo feature. Developers can merge a change without running evaluations. This type of evaluation can be both extensive and also provide smoke tests. The extensive part needs to be always covered by the manual CI job, as developers may or may not require running complex evaluations similar to pre-release ones. However, we always need to run smoke tests on every MR to evaluate the trajectory or reveal potential pitfalls. Every feature needs to have its own smoke tests depending on the logic.
Manual evaluation: We need to be able to repeat any evaluation approach mentioned above manually and locally for debug purposes or if we need advanced support.

I'd suggest focusing this issue on supporting Duo Chat MRs with smoke tests, such as evaluating the agent trajectory (i.e., used tools). This task seems to be very important given the attached context - gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#664 (closed). At this moment, I see running the ELI5/PL evaluation for every MR on the full dataset as suboptimal. We can make it a second step by providing a manual CI job.

So Tan's plan will look like the following:

Start with AI Gateway Duo Chat MRs.
Only trigger on Duo Chat code changes related to the agent.
Run ELI5 docker image with smoke tests to evaluate agent trajectory, i.e., used tools vs. actual tools.
Report summary on MR and provide deep overview in LangSmith.

P.S. This idea works well with our plans in gitlab-com/content-sites/handbook!8216 (comment 2127562347). ELI5 evaluators need to be compatible with the PL as an engine. We don't need complex Beam dependencies or LLM judges to evaluate Duo Chat agent trajectory - we can simply compare used tools with actual tools.

By following this plan, we can support Duo Chat team in updating their agent and parsers already in %17.5 .

@susie.bee Great questions

Would each feature change be run against only the available datasets for that feature? (So a change to Chat wouldn't trigger an eval against Code Completion, for example)? I can see this as a potential cost factor.

I think that is doable but might not the first iteration due to the complexity of mapping the code changes with particular feature. The API layer can be quite obvious but shared abstraction might be less so.

Would we run the pipeline against the entire dataset for each use case, or a representative subset of that dataset? The entire dataset in some cases would be, as you know, quite timely and costly.

I would advocate for a representative subset of the dataset. The random dynamic sampling could derive from the daily run. We already made this available for all Duo Chat tasks (ref). The sampling method can be improved, Thompson, for example.

How would we handle disparate metrics? For example, what if LLM judge score for correctness goes up, but readability goes down? Or maybe latency goes down, but correctness goes down as well?

For simplicity, we can start with a weighted score which combine all metrics. For Duo Chat, we use the following formula.

(correctness \times 1.25 + readability + comprehensiveness + (similarity \times 1.25)) / 4.5

It's not perfect but can signal further investigation when trend shifts.

How would users explore their results (we had started to peg out some ideas on using the MR as a jump off point for results here.)

It would be superb to integrate with Gitlab UI. I guess we can lean on existing patterns. For the first iteration, I think we can adopt @achueshev's suggestion, e.g report the immediate metrics in the MR and track historical trending via external tools, BQ <-> Looker or LangSmith DataSet <-> Evaluation.

Can users override a failed validation pipeline, for all of the potential reasons above? (so no just an emergency scenario)

Yep certainly. I think that's what @shinya.maeda proposal of introducing a new label to skip the job in emergency.

Thanks, @tle_gitlab for your feedback!

correctness×1.25+readability+comprehensiveness+(similarity×1.25)

I'm afraid that collecting data on correctness, readability, and comprehensiveness can be time-consuming and less accurate for the subset.
To my understanding, the formula you mentioned is not used by Product. This could potentially misalign our Developers and Products, as they might use different metrics to build the feature. (To my knowledge, our daily runs don't calculate this average value. Please correct me.)

Following https://gitlab.com/gitlab-com/content-sites/internal-handbook/-/merge_requests/5388/diffs, what do you think about starting to calculate deterministic metrics first that can provide good heuristics? For example, we could begin by comparing the next tool Duo Chat selects with the expected one (I can help with generating the dataset). This metric is faster to calculate than correctness and similar measures, and it can be a good starting point for supporting MRs with evaluation. @shinya.maeda, what do you think? Would it be helpful to understand how accurate Duo Chat is in selecting the next tool (AIGW logic) given your context gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#664 (closed)?