feat(evals): local evaluation framework for gkg part 3 - code agent
What does this MR do and why?
The primary goal of this MR is to establish a robust mechanism for running an external agent (opencode) against SWE-Bench fixtures, capturing its generated code patches and detailed session telemetry. This provides a baseline for comparative analysis against the GKG system. The core of this update is the introduction of a new, self-contained Python module (packages/gkg-evals/pipeline/src/opencode/) that orchestrates the execution of the previously mentioned opencode-ai agent.
Architecture
The new opencode module encapsulates all logic required to manage the lifecycle of an external agent process for each evaluation fixture.
-
Process Management (
packages/gkg-evals/pipeline/src/opencode/opencode.py): The centralOpencodeclass manages the execution of theopencode-aiagent. For each fixture, it spawns a sandboxed agent instance usingsubprocess.Popen. This class is responsible for managing timeouts, streaming stdout/stderr, and capturing the session ID from the agent's log output. -
Configuration and State Management:
The orchestrator standardizes agent configuration by dynamically generating an
opencode.jsonfile from the pipeline's primary TOML configuration. After each run, it captures the resulting code changes by executinggit diffwithin the temporary worktree and subsequently rolls back all changes to ensure a clean state for subsequent runs. -
Dependency Pinning:
The
setup_opencode_executablemethod ensures a reproducible environment by usingnpxto fetch and run a pinned version of the agent (OPENCODE_VERSIONinconstants.py), isolating the pipeline from upstream changes in the agent.
Architecture Details
1. Structured Data Modeling for Agent Telemetry (packages/gkg-evals/pipeline/src/opencode/models.py)
To enable detailed analysis of the agent's behavior, a structured representation of its session logs was required.
- A new module,
models.py, defines a set of Pydantic models that serve as a Python representation of the agent's internal TypeScript message and message part format (e.g.,ToolPart,ReasoningPart,AssistantMessage). - The
session_messagesmethod in theOpencodeclass uses these models to parse the raw JSON log files produced by the agent into a strongly-typedSessionDataobject graph. This structured data is critical for granular analysis in later stages of the evaluation pipeline.
2. Pipeline Integration and Parallelization (packages/gkg-evals/pipeline/src/steps/agent.py)
The existing run_agent function was implemented to integrate the new orchestration module into the pipeline's workflow.
- The function now initializes the
Opencodeorchestrator and loads the set of SWE-Bench fixtures. - To improve throughput, it processes fixtures in parallel batches using a
ThreadPoolExecutor, with batch size being configurable. - For each completed fixture, the resulting
OpencodeRunSessionDataobject—containing the generated patch, fixture metadata, and parsed session messages—is serialized tosession_data.jsonl, while the raw patches are collected inpatches.jsonlfor consumption by the evaluation step.
3. Centralized Constants (packages/gkg-evals/pipeline/src/opencode/constants.py)
A dedicated constants.py file was added to centralize agent-specific configuration. This includes the maximum runtime (OPENCODE_MAX_TIME), definitions of which agent tools are considered mutating (OPENCODE_MUTATING_TOOLS), and the pinned agent version, ensuring that key parameters for reproducibility are explicitly defined.
Related Issues
Testing
As this is out of the main codepaths, and this is part of a series of MRs, I don't think it's suitable yet to introduce testing beyond simple validation, which this framework does have.
Performance Analysis
-
This merge request does not introduce any performance regression. If a performance regression is expected, explain why.