feat(evals): local evaluation framework for gkg part 3 - code agent

What does this MR do and why?

The primary goal of this MR is to establish a robust mechanism for running an external agent (opencode) against SWE-Bench fixtures, capturing its generated code patches and detailed session telemetry. This provides a baseline for comparative analysis against the GKG system. The core of this update is the introduction of a new, self-contained Python module (packages/gkg-evals/pipeline/src/opencode/) that orchestrates the execution of the previously mentioned opencode-ai agent.

Architecture

The new opencode module encapsulates all logic required to manage the lifecycle of an external agent process for each evaluation fixture.

  • Process Management (packages/gkg-evals/pipeline/src/opencode/opencode.py): The central Opencode class manages the execution of the opencode-ai agent. For each fixture, it spawns a sandboxed agent instance using subprocess.Popen. This class is responsible for managing timeouts, streaming stdout/stderr, and capturing the session ID from the agent's log output.
  • Configuration and State Management: The orchestrator standardizes agent configuration by dynamically generating an opencode.json file from the pipeline's primary TOML configuration. After each run, it captures the resulting code changes by executing git diff within the temporary worktree and subsequently rolls back all changes to ensure a clean state for subsequent runs.
  • Dependency Pinning: The setup_opencode_executable method ensures a reproducible environment by using npx to fetch and run a pinned version of the agent (OPENCODE_VERSION in constants.py), isolating the pipeline from upstream changes in the agent.

Architecture Details

1. Structured Data Modeling for Agent Telemetry (packages/gkg-evals/pipeline/src/opencode/models.py)

To enable detailed analysis of the agent's behavior, a structured representation of its session logs was required.

  • A new module, models.py, defines a set of Pydantic models that serve as a Python representation of the agent's internal TypeScript message and message part format (e.g., ToolPart, ReasoningPart, AssistantMessage).
  • The session_messages method in the Opencode class uses these models to parse the raw JSON log files produced by the agent into a strongly-typed SessionData object graph. This structured data is critical for granular analysis in later stages of the evaluation pipeline.

2. Pipeline Integration and Parallelization (packages/gkg-evals/pipeline/src/steps/agent.py)

The existing run_agent function was implemented to integrate the new orchestration module into the pipeline's workflow.

  • The function now initializes the Opencode orchestrator and loads the set of SWE-Bench fixtures.
  • To improve throughput, it processes fixtures in parallel batches using a ThreadPoolExecutor, with batch size being configurable.
  • For each completed fixture, the resulting OpencodeRunSessionData object—containing the generated patch, fixture metadata, and parsed session messages—is serialized to session_data.jsonl, while the raw patches are collected in patches.jsonl for consumption by the evaluation step.

3. Centralized Constants (packages/gkg-evals/pipeline/src/opencode/constants.py)

A dedicated constants.py file was added to centralize agent-specific configuration. This includes the maximum runtime (OPENCODE_MAX_TIME), definitions of which agent tools are considered mutating (OPENCODE_MUTATING_TOOLS), and the pinned agent version, ensuring that key parameters for reproducibility are explicitly defined.

Related Issues

#171 (closed), #224 (closed)

Testing

As this is out of the main codepaths, and this is part of a series of MRs, I don't think it's suitable yet to introduce testing beyond simple validation, which this framework does have.

Performance Analysis

  • This merge request does not introduce any performance regression. If a performance regression is expected, explain why.
Edited by Michael Usachenko

Merge request reports

Loading