Extract tools from checkpoint for flows that don't mock requests

What does this merge request do and why?

This MR improves the way tools are collected after running a given flow. The existing approach was to read the tools from the log file, which is a good approach for SWEbench where we mock all requests, but incorrect for flows like Duo Chat. This MR updates the logic so that if the flow doesn't mock requests (like most flows except SWEbench), we extract the tools from the checkpoint instead.

How to set up and validate locally

poetry run cef agent-platform evaluate .gitlab/agent_platform_templates/duo_chat.yaml --existing-experiment=5f0b6ec8-9b69-4e79-8bbc-bb3349d4e65c

Ref: https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/blob/main/doc/eval_scenarios/agent_platform/agentic_duo_chat.md?ref_type=heads

image

Ref: https://smith.langchain.com/o/477de7ad-583e-47b6-a1c4-c4a0300e7aca/datasets/cdfc8f44-b166-47fa-96bd-87dcd61f242a/compare?selectedSessions=5f0b6ec8-9b69-4e79-8bbc-bb3349d4e65c&baseline=undefined

Merge request checklist

  • I've ran the affected pipeline(s) to validate that nothing is broken.
  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.
Edited by Alexander Chueshev

Merge request reports

Loading