Add evaluators to monitor tool trajectory

Problem

Our existing metrics demonstrate the overall Duo Workflow quality but don't provide enough data about the intermediate steps. Having intermediate evaluation might be helpful to better explore the Duo Workflow trajectory. For example, Duo Workflow can generate the right solution but spoil it in the final step. We can also estimate how often Duo Workflow gets stuck, goes in circles, or moves in the wrong direction.

Implementation

Consider implementing an evaluator that evaluates tool trajectory by comparing tool arguments with the pre-defined context. We define the comparison functions using pure Python functions. The predefined context is the expected patch with additional information coming from the LangSmith golden dataset.

Edited Nov 04, 2024 by Alexander Chueshev