Implement test architecture for agentic system with LLM API mocking

Problem

Currently, the agentic system has no automated testing infrastructure. Testing agent behaviors requires actual LLM API calls, which is:

Expensive (costs money per call)
Slow (network latency)
Non-deterministic (LLM responses vary)
Unavailable in CI/CD environments without API keys

This makes it impossible to validate agent behaviors, test handoffs, or ensure system reliability.

Research Findings

Testing Approaches for Autogen

Configuration-based Mocking: Autogen uses config lists that can point to mock endpoints
Custom Model Clients: Autogen supports custom model clients that can be mocked
Tool-based Testing: Since agents use tools, we can mock tool responses

Best Practices for Mocking LLM APIs

1. pytest-monkeypatch Approach

@pytest.fixture
def mock_openai_chatcompletion(monkeypatch):
    async def mock_acreate(*args, **kwargs):
        return mock_response
    monkeypatch.setattr(AsyncCompletions, "create", mock_acreate)

2. openai-responses Plugin

import openai_responses

@openai_responses.mock()
def test_agent_behavior():
    # Test code here

3. VCR Pattern (Record/Replay)

Record actual API responses during development
Replay recorded responses during tests
Useful for realistic test scenarios

Proposed Test Architecture

Core Components

Mock LLM Client: Custom ChatCompletionClient that returns predetermined responses
Agent Test Fixtures: Reusable fixtures for each agent type
Scenario-based Tests: Test specific agent behaviors and handoffs
Integration Tests: Test multi-agent workflows

Test Structure

tests/
├── conftest.py              # Shared fixtures and mocks
├── mocks/
│   ├── llm_client.py       # Mock ChatCompletionClient
│   ├── gitlab_client.py    # Mock GitLab API responses
│   └── responses/          # Predefined LLM responses
├── unit/
│   ├── test_todo_agent.py
│   ├── test_issue_discussion_agent.py
│   ├── test_issue_management_agent.py
│   ├── test_mr_discussion_agent.py
│   ├── test_mr_management_agent.py
│   └── test_repository_agent.py
├── integration/
│   ├── test_handoffs.py    # Test agent handoff scenarios
│   ├── test_workflows.py   # Test complete workflows
│   └── test_error_handling.py
└── fixtures/
    └── vcr_cassettes/      # Recorded API responses

Implementation Task List

Phase 1: Research and Planning

Study autogen's test suite for patterns
Document all agent behaviors to test
Create test scenarios for each agent
Define integration test workflows
Choose mocking strategy (monkeypatch vs plugin)

Phase 2: Basic Infrastructure

Set up pytest framework
Create MockChatCompletionClient
Implement response builder for different message types
Create mock GitLab client for agent tools
Set up test configuration system

Phase 3: Agent-Specific Tests

Phase 4: Integration Tests

Phase 5: Documentation and CI/CD

Document test architecture
Create test writing guide
Add example tests for new features
Set up CI/CD test pipeline
Create coverage reports

Mock Response Strategy

1. Deterministic Responses

class MockChatCompletionClient:
    def __init__(self, responses: Dict[str, str]):
        self.responses = responses
    
    async def create(self, messages, tools, **kwargs):
        # Return predetermined response based on input

2. Scenario-based Responses

@pytest.fixture
def issue_update_scenario():
    return {
        "system": "You are an issue discussion agent...",
        "user": "update the description",
        "expected_tool_call": "transfer_to_issue_management_agent"
    }

3. Tool Call Verification

def test_handoff_triggered(mock_client, issue_update_scenario):
    agent = IssueDiscussionAgent(mock_client)
    result = await agent.process(scenario["user"])
    assert_tool_called(result, "transfer_to_issue_management_agent")

Success Criteria

All agents have comprehensive unit tests
Handoff scenarios are fully tested
Integration tests cover main workflows
Tests run without LLM API calls
Test execution time < 1 minute
Code coverage > 80%
CI/CD pipeline runs tests automatically

Benefits

Reliability: Catch regressions before deployment
Speed: Fast feedback during development
Cost: No API costs for testing
Documentation: Tests serve as behavior documentation
Confidence: Ensure changes don't break existing functionality

Related Issues

#105 (closed) - Issue discussion agent handoff bug (needs testing)
#104 (closed) - Handoff message improvements (needs testing)
#103 (closed) - Missing update_issue_content tool (needs testing)