Skip to content

Implement test architecture for agentic system with LLM API mocking

Problem

Currently, the agentic system has no automated testing infrastructure. Testing agent behaviors requires actual LLM API calls, which is:

  • Expensive (costs money per call)
  • Slow (network latency)
  • Non-deterministic (LLM responses vary)
  • Unavailable in CI/CD environments without API keys

This makes it impossible to validate agent behaviors, test handoffs, or ensure system reliability.

Research Findings

Testing Approaches for Autogen

  1. Configuration-based Mocking: Autogen uses config lists that can point to mock endpoints
  2. Custom Model Clients: Autogen supports custom model clients that can be mocked
  3. Tool-based Testing: Since agents use tools, we can mock tool responses

Best Practices for Mocking LLM APIs

1. pytest-monkeypatch Approach

@pytest.fixture
def mock_openai_chatcompletion(monkeypatch):
    async def mock_acreate(*args, **kwargs):
        return mock_response
    monkeypatch.setattr(AsyncCompletions, "create", mock_acreate)

2. openai-responses Plugin

import openai_responses

@openai_responses.mock()
def test_agent_behavior():
    # Test code here

3. VCR Pattern (Record/Replay)

  • Record actual API responses during development
  • Replay recorded responses during tests
  • Useful for realistic test scenarios

Proposed Test Architecture

Core Components

  1. Mock LLM Client: Custom ChatCompletionClient that returns predetermined responses
  2. Agent Test Fixtures: Reusable fixtures for each agent type
  3. Scenario-based Tests: Test specific agent behaviors and handoffs
  4. Integration Tests: Test multi-agent workflows

Test Structure

tests/
├── conftest.py              # Shared fixtures and mocks
├── mocks/
│   ├── llm_client.py       # Mock ChatCompletionClient
│   ├── gitlab_client.py    # Mock GitLab API responses
│   └── responses/          # Predefined LLM responses
├── unit/
│   ├── test_todo_agent.py
│   ├── test_issue_discussion_agent.py
│   ├── test_issue_management_agent.py
│   ├── test_mr_discussion_agent.py
│   ├── test_mr_management_agent.py
│   └── test_repository_agent.py
├── integration/
│   ├── test_handoffs.py    # Test agent handoff scenarios
│   ├── test_workflows.py   # Test complete workflows
│   └── test_error_handling.py
└── fixtures/
    └── vcr_cassettes/      # Recorded API responses

Implementation Task List

Phase 1: Research and Planning

  • Study autogen's test suite for patterns
  • Document all agent behaviors to test
  • Create test scenarios for each agent
  • Define integration test workflows
  • Choose mocking strategy (monkeypatch vs plugin)

Phase 2: Basic Infrastructure

  • Set up pytest framework
  • Create MockChatCompletionClient
  • Implement response builder for different message types
  • Create mock GitLab client for agent tools
  • Set up test configuration system

Phase 3: Agent-Specific Tests

  • Todo Agent Tests

    • Test mark_todo_as_done tool usage
    • Test handoff decisions
  • Issue Discussion Agent Tests

    • Test handoff triggers for management actions
    • Test discussion analysis
    • Test response generation
  • Issue Management Agent Tests

    • Test create_issue scenarios
    • Test update_issue_content scenarios
    • Test error handling
  • MR Discussion Agent Tests

    • Test handoff triggers for MR management
    • Test code review analysis
    • Test discussion summarization
  • MR Management Agent Tests

    • Test create_merge_request scenarios
    • Test update_merge_request scenarios
    • Test workflow operations
  • Repository Management Agent Tests

    • Test file operations
    • Test search capabilities
    • Test analysis functions

Phase 4: Integration Tests

  • Handoff Tests

    • Test "update description" handoff flow
    • Test "close issue" handoff flow
    • Test "add label" handoff flow
    • Test failed handoff scenarios
  • Workflow Tests

    • Test complete todo resolution workflow
    • Test issue creation from discussion
    • Test MR review workflow
  • Error Handling Tests

    • Test API failures
    • Test invalid requests
    • Test permission errors

Phase 5: Documentation and CI/CD

  • Document test architecture
  • Create test writing guide
  • Add example tests for new features
  • Set up CI/CD test pipeline
  • Create coverage reports

Mock Response Strategy

1. Deterministic Responses

class MockChatCompletionClient:
    def __init__(self, responses: Dict[str, str]):
        self.responses = responses
    
    async def create(self, messages, tools, **kwargs):
        # Return predetermined response based on input

2. Scenario-based Responses

@pytest.fixture
def issue_update_scenario():
    return {
        "system": "You are an issue discussion agent...",
        "user": "update the description",
        "expected_tool_call": "transfer_to_issue_management_agent"
    }

3. Tool Call Verification

def test_handoff_triggered(mock_client, issue_update_scenario):
    agent = IssueDiscussionAgent(mock_client)
    result = await agent.process(scenario["user"])
    assert_tool_called(result, "transfer_to_issue_management_agent")

Success Criteria

  • All agents have comprehensive unit tests
  • Handoff scenarios are fully tested
  • Integration tests cover main workflows
  • Tests run without LLM API calls
  • Test execution time < 1 minute
  • Code coverage > 80%
  • CI/CD pipeline runs tests automatically

Benefits

  1. Reliability: Catch regressions before deployment
  2. Speed: Fast feedback during development
  3. Cost: No API costs for testing
  4. Documentation: Tests serve as behavior documentation
  5. Confidence: Ensure changes don't break existing functionality

Related Issues

  • #105 (closed) - Issue discussion agent handoff bug (needs testing)
  • #104 (closed) - Handoff message improvements (needs testing)
  • #103 (closed) - Missing update_issue_content tool (needs testing)