[PoC] "Unit/integration" test agents/components/prompts/tools with the langsmith/pytest integration

Problem

End-to-end evaluations using SWE-bench (and other datasets) provide a lot of value in judging Duo Workflow performance. However, as end-to-end tests they don't provide the rapid, specific feedback that make unit tests such an important part of any software test suite.

Proposal

Use a test framework that allows "units" of duo-workflow-service code to be evaluated without requiring other system components except the AI provider.

For example: https://docs.smith.langchain.com/evaluation/how_to_guides/pytest

This would allow us to more quickly test changes to prompts, agents, tools, etc. that involve LLM calls, resulting in much shorter feedback cycles and a greater ability to identify the cause of problems/regressions.

We'd also have the option of mocking/caching LLM calls, so tests like these could be true unit tests, with the option of making real requests to a third party (which unit tests would normally not do)

Clarifying levels of testing

Typically, testing involving 3rd parties would be part of an end-to-end test suite (if included at all). This issue proposes integrating AI providers at a lower level of testing because there are units of code that involve LLM calls that can be tested without the entire end-to-end system.

In that sense the proposal isn't really about unit tests. It's more like integration tests where units of Duo Workflow service (agents/components/prompts/tools) are integrated with an AI provider.

Desired Outcome

Create multiple integration test examples with the outlined integration to demonstrate how this works and provide a basis for a decision on how much we want to employ this testing capability.

Results

A few tests of Executor agent tool use

The example tests in that MR focuses on Executor agent tool use. They test that the executor agent follows its prompt instructions to:

get the plan first using the get_plan tool
set the task status to completed after completing a task
use the handover_tool after completing all tasks

LangSmith results

A few tests of get_issue tool selection and URL input use

tests/duo_workflow_service/tools/test_issue_integration_simple.py
- Uses a simple system prompt rather than the real one, to focus the test on the tool prompts. Just checks 2 URLs
- LangSmith results
tests/duo_workflow_service/tools/test_issue_integration_description_variants.py
- Compares different description variants for the get_issue tool. This gives an example of how we could quickly compare a few different prompt variants to see which performs best. We could rerun the test several times to check for reliability.
- LangSmith results
- The failures show that some forms of URL input still can't be handled properly

Not included

CI changes to run the tests with appropriate environment variables (e.g., API keys (see gitlab-org/duo-workflow/duo-workflow-service#305 (moved)), and LANGSMITH_TEST_SUITE, LANGCHAIN_PROJECT, and LANGSMITH_TEST_CACHE to cache LLM calls)
Dataset management. These tests are written with the test data in the tests themselves. But there could be cases where we want the tests to use existing or new datasets that we add to LangSmith. That is possible using this langsmith/pytest integration, just not demonstrated here

Edited Mar 19, 2025 by Mark Lapierre