To evaluate Duo Workflow we want to have extensive data sets so we can ensure quality for our end users. Duo Workflow team and Verify will work together to curate this dataset
Details
Our first workflow is "Fixing a pipeline" and so we looking to seed a GitLab instance with lots of repositories and failed pipelines that the workflow can hopefully fix. This needs a wide range of failing conditions -
@cheryl.li I'm not sure. It wouldn't hurt to at least give them a heads up that we may need some of their time, but we might be able to get all the necessary context without pulling them in completely.
No explicit ask of your team at the moment, but since we're moving away from expecting the Model Validation team to collect data for specific AI features we're building to one that the teams themselves own (Verify, in this case), your team may be asked to support here depending on what we need for Duo Workflow. (First MVP the AI Frameworks team is looking to build for is "fix a broken pipeline" with Workflow)
@oregand@pwietchner We have not prioritized this issue within Pipeline Execution yet and I understand you're starting Dogfooding efforts with a number of teams (including ours) shortly. Just confirming that aligns with your expectation, and that your current efforts on Workflow is not dependent on this issue?
Adding @mfanGitLab as an Assignee as he is the one who will work on this. He may have already started as a continuation of his work with RCA and Duo Chat.
Hi @pwietchner@oregand getting around to this so we can properly weight and refine this issue.
What's the current ask for the team? To get some datasets that'll help evaluate Duo Workflow? Would this be used for a CEF daily runs similarly to what we do for RCA?
This issue also looks like a pre-requisite for #474216 (closed) ?
1. Collecting examples from merge requests that can be used for testing scenarios.2. Focusing on the initial scenario “Can you fix merge request X on project Y?” but expanding to other scenarios for alpha release.3. Applying heuristics to identify appropriate samples, including: - Patches where the fix commit contains the same files as the breaking commit. - Avoiding examples with unrelated errors or allow_failure: true jobs.4. Summarizing logs and retrieving key tags for each example in the final subset.5. Planning to select useful examples from the collected dataset for evaluation using the recently merged pipeline in the ELI5 repo.
ah got it, thanks for the quick reply @oregand I'll spend some dedicated time to find some examples
But I guess the main difference from gitlab-org/duo-workflow/duo-workflow-service#74 (closed) would be that in this issue, there'll be more of an emphasis of finding pipeline failures Duo Workflow can fix instead of Merge Request errors? (although they're the same most of the time)
we shifted focus from single scenario of fixing a failing pipeline to cover also other scenarios. So instead of focusing on testing/evaluating only "fix pipeline" scenario we rather prioritized testing/evaluating SWE bench
Thank you for the follow-up here. Given your outline, if we currently don't need more examples we can pause this extra work for finding more "fix pipeline" examples