Ensure Duo Workflow does not encounter more errors in Sonnet 3.7 vs 3.5
Problem
Duo Workflow workflow failures increase in Claude Sonnet 3.7
Desired Outcome
Duo Workflow does not have more workflow failures in Claude Sonnet 3.7 compared to 3.5.
Implementation Plan
- Create a
splitof SWE-bench instances that generated no model_patch after Claude 3.7 upgrade - Refine tool argument specifications to improve clarity for the LLM and ensure robust error handling
- Create an MR with Claude 3.7 integration and run SWE-bench evaluation using the test cases identified in step 1
- Run at least 100 SWE-bench instances to calculate a SWE-bench score
- Ensure there are
- No workflow failures
- No tool calls with incorrect arguments
- Equivalent or improved score compared to Claude 3.5
Note: When this issue is completed, do not upgrade the production to Claude Sonnet 3.7, yet. Instead, unblock https://gitlab.com/gitlab-org/duo-workflow/duo-workflow-service/-/work_items/321
Edited by Halil Coban