[PoC] "Unit/integration" test agents/components/prompts/tools with the langsmith/pytest integration
Problem
End-to-end evaluations using SWE-bench (and other datasets) provide a lot of value in judging Duo Workflow performance. However, as end-to-end tests they don't provide the rapid, specific feedback that make unit tests such an important part of any software test suite.
Proposal
Use a test framework that allows "units" of duo-workflow-service code to be evaluated without requiring other system components except the AI provider.
This would allow us to more quickly test changes to prompts, agents, tools, etc. that involve LLM calls, resulting in much shorter feedback cycles and a greater ability to identify the cause of problems/regressions.
We'd also have the option of mocking/caching LLM calls, so tests like these could be true unit tests, with the option of making real requests to a third party (which unit tests would normally not do)
Clarifying levels of testing
Typically, testing involving 3rd parties would be part of an end-to-end test suite (if included at all). This issue proposes integrating AI providers at a lower level of testing because there are units of code that involve LLM calls that can be tested without the entire end-to-end system.
In that sense the proposal isn't really about unit tests. It's more like integration tests where units of Duo Workflow service (agents/components/prompts/tools) are integrated with an AI provider.
Desired Outcome
Create multiple integration test examples with the outlined integration to demonstrate how this works and provide a basis for a decision on how much we want to employ this testing capability.
Results
A few tests of Executor agent tool use
The example tests in that MR focuses on Executor agent tool use. They test that the executor agent follows its prompt instructions to:
- get the plan first using the get_plan tool
- set the task status to completed after completing a task
- use the handover_tool after completing all tasks
A few tests of get_issue tool selection and URL input use
-
tests/duo_workflow_service/tools/test_issue_integration_simple.py- Uses a simple system prompt rather than the real one, to focus the test on the tool prompts. Just checks 2 URLs
- LangSmith results
-
tests/duo_workflow_service/tools/test_issue_integration_description_variants.py- Compares different description variants for the
get_issuetool. This gives an example of how we could quickly compare a few different prompt variants to see which performs best. We could rerun the test several times to check for reliability. - LangSmith results
- The failures show that some forms of URL input still can't be handled properly
- Compares different description variants for the
Not included
- CI changes to run the tests with appropriate environment variables (e.g., API keys (see gitlab-org/duo-workflow/duo-workflow-service#305 (moved)), and
LANGSMITH_TEST_SUITE,LANGCHAIN_PROJECT, andLANGSMITH_TEST_CACHEto cache LLM calls) - Dataset management. These tests are written with the test data in the tests themselves. But there could be cases where we want the tests to use existing or new datasets that we add to LangSmith. That is possible using this langsmith/pytest integration, just not demonstrated here