This project is archived. Its data is read-only.

Ensure Duo Workflow does not encounter more errors in Sonnet 3.7 vs 3.5

Problem

Duo Workflow workflow failures increase in Claude Sonnet 3.7

Desired Outcome

Duo Workflow does not have more workflow failures in Claude Sonnet 3.7 compared to 3.5.

Implementation Plan

Create a split of SWE-bench instances that generated no model_patch after Claude 3.7 upgrade
Refine tool argument specifications to improve clarity for the LLM and ensure robust error handling
Create an MR with Claude 3.7 integration and run SWE-bench evaluation using the test cases identified in step 1
Run at least 100 SWE-bench instances to calculate a SWE-bench score
Ensure there are

No workflow failures
No tool calls with incorrect arguments
Equivalent or improved score compared to Claude 3.5

Note: When this issue is completed, do not upgrade the production to Claude Sonnet 3.7, yet. Instead, unblock https://gitlab.com/gitlab-org/duo-workflow/duo-workflow-service/-/work_items/321

Edited Mar 24, 2025 by Halil Coban