Implement test architecture for agentic system with LLM API mocking
Problem
Currently, the agentic system has no automated testing infrastructure. Testing agent behaviors requires actual LLM API calls, which is:
- Expensive (costs money per call)
- Slow (network latency)
- Non-deterministic (LLM responses vary)
- Unavailable in CI/CD environments without API keys
This makes it impossible to validate agent behaviors, test handoffs, or ensure system reliability.
Research Findings
Testing Approaches for Autogen
- Configuration-based Mocking: Autogen uses config lists that can point to mock endpoints
- Custom Model Clients: Autogen supports custom model clients that can be mocked
- Tool-based Testing: Since agents use tools, we can mock tool responses
Best Practices for Mocking LLM APIs
1. pytest-monkeypatch Approach
@pytest.fixture
def mock_openai_chatcompletion(monkeypatch):
async def mock_acreate(*args, **kwargs):
return mock_response
monkeypatch.setattr(AsyncCompletions, "create", mock_acreate)
2. openai-responses Plugin
import openai_responses
@openai_responses.mock()
def test_agent_behavior():
# Test code here
3. VCR Pattern (Record/Replay)
- Record actual API responses during development
- Replay recorded responses during tests
- Useful for realistic test scenarios
Proposed Test Architecture
Core Components
- Mock LLM Client: Custom ChatCompletionClient that returns predetermined responses
- Agent Test Fixtures: Reusable fixtures for each agent type
- Scenario-based Tests: Test specific agent behaviors and handoffs
- Integration Tests: Test multi-agent workflows
Test Structure
tests/
├── conftest.py # Shared fixtures and mocks
├── mocks/
│ ├── llm_client.py # Mock ChatCompletionClient
│ ├── gitlab_client.py # Mock GitLab API responses
│ └── responses/ # Predefined LLM responses
├── unit/
│ ├── test_todo_agent.py
│ ├── test_issue_discussion_agent.py
│ ├── test_issue_management_agent.py
│ ├── test_mr_discussion_agent.py
│ ├── test_mr_management_agent.py
│ └── test_repository_agent.py
├── integration/
│ ├── test_handoffs.py # Test agent handoff scenarios
│ ├── test_workflows.py # Test complete workflows
│ └── test_error_handling.py
└── fixtures/
└── vcr_cassettes/ # Recorded API responses
Implementation Task List
Phase 1: Research and Planning
-
Study autogen's test suite for patterns -
Document all agent behaviors to test -
Create test scenarios for each agent -
Define integration test workflows -
Choose mocking strategy (monkeypatch vs plugin)
Phase 2: Basic Infrastructure
-
Set up pytest framework -
Create MockChatCompletionClient -
Implement response builder for different message types -
Create mock GitLab client for agent tools -
Set up test configuration system
Phase 3: Agent-Specific Tests
-
Todo Agent Tests -
Test mark_todo_as_done tool usage -
Test handoff decisions
-
-
Issue Discussion Agent Tests -
Test handoff triggers for management actions -
Test discussion analysis -
Test response generation
-
-
Issue Management Agent Tests -
Test create_issue scenarios -
Test update_issue_content scenarios -
Test error handling
-
-
MR Discussion Agent Tests -
Test handoff triggers for MR management -
Test code review analysis -
Test discussion summarization
-
-
MR Management Agent Tests -
Test create_merge_request scenarios -
Test update_merge_request scenarios -
Test workflow operations
-
-
Repository Management Agent Tests -
Test file operations -
Test search capabilities -
Test analysis functions
-
Phase 4: Integration Tests
-
Handoff Tests -
Test "update description" handoff flow -
Test "close issue" handoff flow -
Test "add label" handoff flow -
Test failed handoff scenarios
-
-
Workflow Tests -
Test complete todo resolution workflow -
Test issue creation from discussion -
Test MR review workflow
-
-
Error Handling Tests -
Test API failures -
Test invalid requests -
Test permission errors
-
Phase 5: Documentation and CI/CD
-
Document test architecture -
Create test writing guide -
Add example tests for new features -
Set up CI/CD test pipeline -
Create coverage reports
Mock Response Strategy
1. Deterministic Responses
class MockChatCompletionClient:
def __init__(self, responses: Dict[str, str]):
self.responses = responses
async def create(self, messages, tools, **kwargs):
# Return predetermined response based on input
2. Scenario-based Responses
@pytest.fixture
def issue_update_scenario():
return {
"system": "You are an issue discussion agent...",
"user": "update the description",
"expected_tool_call": "transfer_to_issue_management_agent"
}
3. Tool Call Verification
def test_handoff_triggered(mock_client, issue_update_scenario):
agent = IssueDiscussionAgent(mock_client)
result = await agent.process(scenario["user"])
assert_tool_called(result, "transfer_to_issue_management_agent")
Success Criteria
-
All agents have comprehensive unit tests -
Handoff scenarios are fully tested -
Integration tests cover main workflows -
Tests run without LLM API calls -
Test execution time < 1 minute -
Code coverage > 80% -
CI/CD pipeline runs tests automatically
Benefits
- Reliability: Catch regressions before deployment
- Speed: Fast feedback during development
- Cost: No API costs for testing
- Documentation: Tests serve as behavior documentation
- Confidence: Ensure changes don't break existing functionality
Related Issues
- #105 (closed) - Issue discussion agent handoff bug (needs testing)
- #104 (closed) - Handoff message improvements (needs testing)
- #103 (closed) - Missing update_issue_content tool (needs testing)