Collecting Data for Duo Workflow evals

added Category:Duo Workflow Duo WorkflowEval labels

assigned to @achueshev, @carolinesimpson, and @cheryl.li

@achueshev can you add the examples?

marked this issue as related to gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#362 (closed)

@mfanGitLab once RCA is settled I think it would make sense for you to be involved in this given the knowledge and experience you've built up so far.

@carolinesimpson Do you think we need to involve the Runner team for this exercise?

@cheryl.li I'm not sure. It wouldn't hurt to at least give them a heads up that we may need some of their time, but we might be able to get all the necessary context without pulling them in completely.

FYI @nicolewilliams @DarrenEastman - putting this on your radar.

No explicit ask of your team at the moment, but since we're moving away from expecting the Model Validation team to collect data for specific AI features we're building to one that the teams themselves own (Verify, in this case), your team may be asked to support here depending on what we need for Duo Workflow. (First MVP the AI Frameworks team is looking to build for is "fix a broken pipeline" with Workflow)

@cheryl.li Thanks for the ping - this is super exciting. I can't wait to see where this goes.

assigned to @AndrasHerczeg

Hi everyone,

I believe @mfanGitLab , @rutshah, @carolinesimpson are looped in to the original issue: gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#362 (closed) and gladly wanted to supporting the initiative post RCA!!. @pwietchner, would it make sense to continue in the original issue as it has more context, or should we continue here?

First, we have a manual dataset built here: https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/ml/models/1000650/versions/1000873#/

For next steps , @AndrasHerczeg is working on the pipeline for the scaled dataset

cc @rutshah, @DarrenEastman

Thank you , thank you all!!!!!

mentioned in issue gitlab-org/quality/triage-reports#18921 (closed)

changed milestone to %Next 1-3 releases

added candidate17.5 label

added Category:Continuous Integration label

unassigned @AndrasHerczeg

@oregand @pwietchner We have not prioritized this issue within Pipeline Execution yet and I understand you're starting Dogfooding efforts with a number of teams (including ours) shortly. Just confirming that aligns with your expectation, and that your current efforts on Workflow is not dependent on this issue?

cc: @carolinesimpson

@cheryl.li

Thank you for checking in!

Just confirming that aligns with your expectation, and that your current efforts on Workflow is not dependent on this issue?

Correct, we are not blocked here without this issue, it will be more important leading to GA rather than our MVP

Adding @mfanGitLab as an Assignee as he is the one who will work on this. He may have already started as a continuation of his work with RCA and Duo Chat.

Hi @pwietchner @oregand getting around to this so we can properly weight and refine this issue.

What's the current ask for the team? To get some datasets that'll help evaluate Duo Workflow? Would this be used for a CEF daily runs similarly to what we do for RCA?

This issue also looks like a pre-requisite for #474216 (closed) ?

@mfanGitLab

Thanks for checking in!

Similar to https://gitlab.slack.com/archives/C07035GQ0TB/p1727087041103339, in Gather list of test examples (gitlab-org/duo-workflow/duo-workflow-service#74 - closed) • Jan Provaznik, Alexander Chueshev we’re gathering a list of test examples we can use as good evaluation data.

1. Collecting examples from merge requests that can be used for testing scenarios.
2. Focusing on the initial scenario “Can you fix merge request X on project Y?” but expanding to other scenarios for alpha release.
3. Applying heuristics to identify appropriate samples, including:
   - Patches where the fix commit contains the same files as the breaking commit.
   - Avoiding examples with unrelated errors or allow_failure: true jobs.
4. Summarizing logs and retrieving key tags for each example in the final subset.
5. Planning to select useful examples from the collected dataset for evaluation using the recently merged pipeline in the ELI5 repo.

An example of a useful MR for evaluation would be fix: update google-cloud-aiplatform to get rid ... (gitlab-org/ai-powered/eli5!162 - merged) • Bruno Cardoso • 17.5

/cc @achueshev @jprovaznik

ah got it, thanks for the quick reply @oregand I'll spend some dedicated time to find some examples

But I guess the main difference from gitlab-org/duo-workflow/duo-workflow-service#74 (closed) would be that in this issue, there'll be more of an emphasis of finding pipeline failures Duo Workflow can fix instead of Merge Request errors? (although they're the same most of the time)

@mfanGitLab

Yes that is correct, @achueshev @jprovaznik would you be able to weigh in on the best type of data we want to collect now for our evals?

@mfanGitLab @oregand @achueshev I wonder what is priority of this issue or if we still need it?

we shifted focus from single scenario of fixing a failing pipeline to cover also other scenarios. So instead of focusing on testing/evaluating only "fix pipeline" scenario we rather prioritized testing/evaluating SWE bench
in #493697 (comment 2134184259) @achueshev found some more useful examples for fixing a pipeline, @achueshev do we need more examples?

I wonder if it would make sense to hold on with searching actively for "fix pipeline" examples?

/cc @pwietchner @bastirehm

@jprovaznik

Thank you for the follow-up here. Given your outline, if we currently don't need more examples we can pause this extra work for finding more "fix pipeline" examples

Thanks, @jprovaznik !

+1 to your update above.

Thanks everyone! If that's the case, I think we can close this issue and then re-open it of we need any additional samples?

This is so it doesn't clog up our planning issue and feature work tabs

I'll leave myself assigned so i'll get any notifications

assigned to @mfanGitLab

added priority3 label

mentioned in issue #485599 (closed)

mentioned in issue #488493 (closed)

added candidate17.7 label and removed candidate17.5 label

added typefeature label

Closing as per: #473785 (comment 2145988453) feel free to re-open if needed!

closed

mentioned in issue #498378 (closed)

Collecting Data for Duo Workflow evals

Background

Details

Examples

Designs

Child items ...

Activity

Collecting Data for Duo Workflow evals

Background

Details

Examples

Relates to

Activity