Skip to content

Ensure Duo Workflow does not encounter more errors in Sonnet 3.7 vs 3.5

Problem

Duo Workflow workflow failures increase in Claude Sonnet 3.7

Desired Outcome

Duo Workflow does not have more workflow failures in Claude Sonnet 3.7 compared to 3.5.

Implementation Plan

  1. Create a split of SWE-bench instances that generated no model_patch after Claude 3.7 upgrade
  2. Refine tool argument specifications to improve clarity for the LLM and ensure robust error handling
  3. Create an MR with Claude 3.7 integration and run SWE-bench evaluation using the test cases identified in step 1
  4. Run at least 100 SWE-bench instances to calculate a SWE-bench score
  5. Ensure there are
  • No workflow failures
  • No tool calls with incorrect arguments
  • Equivalent or improved score compared to Claude 3.5

Note: When this issue is completed, do not upgrade the production to Claude Sonnet 3.7, yet. Instead, unblock https://gitlab.com/gitlab-org/duo-workflow/duo-workflow-service/-/work_items/321

Edited by Halil Coban