Preliminary Eval Results for GKG against SWE-Bench Lite

Michael Usachenko, Jean-Gabriel Doyon, Bohdan Parkhomchuk, Jonathan Shobrook, Michael Angelo Rivera

The Gitlab Knowledge Graph team has decided to evaluate our framework with respect to improving agentic outcomes for various programming tasks. Specifically, we are interested in evaluating the initial MCP tools GKG exposes - against the industry standard SWE-Bench Lite benchmark.

SWE-Bench Lite is a collection of real GitHub issues from popular repositories, along with their solutions. These issues include bugs, feature requests, and other tasks. The coding agent receives:

The problem statement (the original GitHub issue text)
A clean copy of the repository from right before the issue was fixed

This tests whether the agent can solve real-world problems that developers have already solved.

We are excited to share some of our early findings with you, so let’s jump right in.

Evaluation Methodology - Overview

To increase iteration speed and maximize configurability with minimal rework, we opted to use the CLI variant of the Opencode coding agent, pinned at version v0.7.9, to replicate agentic developer environments like GitLab Duo Agentic Chat and Claude Code. We opted against more expressive frameworks like LangGraph and DSPy, as the primary purpose of this exercise was the preliminary evaluation of GKG, not building a bespoke agent or attempting to match SOTA results. The extent to which Opencode can be configured is explained in more detail here. We chose to scaffold our agent on top of the “Build” agent provided by Opencode, with custom prompts for the agent_description and agent_prompt, which were designed in accordance with Anthropic’s best practices, as well as some inspiration from the context engineering discipline.

Our model of choice was Claude Sonnet 4, specifically anthropic/claude-sonnet-4-20250514 for its reputation as a strong general-purpose coding agent. Notably, Anthropic does not let you set a "random seed" for each inference call, unlike OpenAI. Therefore, our results are not guaranteed to be exactly reproducible, although they are highly likely to be. For this reason, we ran each test multiple times (n=11) to get a preliminary estimate of the results and their variance. The design of these tests will be explained in the following section.

We also decided to constrain the max_tokens parameter in Opencode to 8192, we retained the ability for Opencode to automatically compact the context window, and the task tool was also disabled for all agents. We also chose to not override Opencode’s default system prompt for Anthropic models, as time did not permit forking the framework and modifying the source.

As for the problem set, we opted to use the dev split from SWE-Bench Lite, which represents 23 different Github issues from 6 different repositories - chosen for its cost-effectiveness and friendliness to rapid iteration. As general availability approaches for GKG, we will continue testing the framework against both larger codebases, and larger problem sets like the test split of SWE-Bench Lite.

Evaluation Methodology - Agent Details

Not all SWE-Bench results are equal, despite the problem sets being identical. We intentionally decided to heavily constrain what our agents could do, as to highlight the differences between the “default” tools used by coding agents, and a coding agent powered by GKG, while being careful to remove as many sources of noise or workarounds as possible.

This constrained environment includes no shell access, no LSP access, no access to embeddings based code retrieval, no subagent capabilities, and a 6 minute time limit. In other words, both the reference agent and our GKG agent are operating with limited capabilities compared to the top performing agents on the SWE-Bench leaderboards. Our agents are required to reason about and edit codebases on purely static terms e.g with no runtime context or other niceties usually provided by an IDE.

Please note that we have 3 configuration files, representing 3 different types of agents, each having their own evaluation pipeline - one baseline, one GKG only, and one with access to both baseline and GKG tools. Please refer to the config files for fine-grained details on how our prompts differ.

More specifically, our baseline agent has access to the following Opencode provided tools: edit, read, grep, glob, todowrite, and todoread. Most need no explanation, but for clarity, todowrite and todoread function as agent memory, practically identical to Cursor’s approach. Our GKG agent has access to the following tools: edit, read, knowledge_graph_*

knowledge_graph_* represents the full list of MCP tools GKG offers out of the box. You can read more about the tools used here, but one notable shoutout is the repomap tool, which is directly inspired by Aider’s repomap. The only MCP tool that is disabled by default is the index_project tool, as the Sonnet 4 preferred calling it too much. You can find the exact set of tools we tested against with the 0.16.0 release commit.

We left the edit and read tools as is as these tools implementation were out of scope of the study. Note, using the read tool was required for the GKG agent because Opencode required it for the edit tool.

Evaluation Methodology - Eval Pipeline Design

Our evaluation pipeline is designed for modularity and flexibility, allowing us to declaratively define, evaluate, and analyze agent performance across different config files. The entire process is orchestrated through a central script from setting up a development environment to final analysis. Each pipeline run operates within a dedicated subdirectory so that results from different configurations are isolated.

Download and Preparation Phase

The pipeline begins by preparing the evaluation environment. This step ensures that every agent, regardless of its toolset, starts from an identical, clean slate for each problem.

A session directory is created under data/runs/<session_name>/, where <session_name> corresponds to the configuration being run (e.g., baseline, gkg_only). Within this directory, the pipeline clones the necessary repositories from SWE-Bench Lite into a repos/ subdirectory. For each specific problem (or fixture), dedicated git worktree is created. This provides a clean, isolated filesystem checked out to the exact base_commit defined by the benchmark, preventing interference between different problem-solving attempts. Finally, a fixtures_metadata.json file is generated, cataloging all problem instances and their corresponding worktree paths for later steps.

We leverage the datasets library to fetch the SWE-Bench Lite problem set. The process of cloning repositories and creating worktrees is parallelized to minimize setup time. This phase includes a caching mechanism; if the required repositories are already present from a previous run, the download step is skipped, and worktrees are created from the local cache.

GKG Indexing Phase

For pipelines involving the GKG-enabled agent, the index phase mutates GKG's internal graph database by indexing the source code within each of the previously created worktrees.

Before indexing, any existing GKG data is purged to ensure a fresh start. The pipeline then invokes the gkg index command-line tool for each unique worktree. This tool parses the source code, identifies definitions, references, and structural relationships, and builds the knowledge graph that the agent will later query through its specialized tools.

Agent Phase

This is the execution step where the configured Opencode agent attempts to solve each problem.

As the agent works on each fixture, its logs are streamed to agent_logs/<instance_id>/opencode_logs.txt. Upon completion or timeout, two files are updated. First, the modified code is captured as a git diff and appended as a JSON object to swebench_patches.jsonl. Second, a log of the agent's session, including every tool call, is serialized and appended to session_data.jsonl.

The GKG server is started in a background process for configurations that require it. This allows the agent to communicate with GKG via a local MCP endpoint. The pipeline processes the fixtures in parallel batches. For each fixture, the opencode run CLI command is executed from within the corresponding worktree, with the problem statement passed as the initial prompt. After the agent concludes its attempt, its generated patch is extracted using git diff. The worktree is then completely reset to its original base_commit to ensure the next phase starts with a clean environment. This phase incorporates a timeout and a retry mechanism to handle instances where an agent might fail unexpectedly due a 529 error raised by Opencode or when an agent submits an empty patch.

Evaluation Phase

Once the agent has generated patches for all fixtures, the evals phase measures their correctness using the official SWE-Bench testing harness.

The evaluation harness runs in its own directory, but the final output is a detailed JSON report containing pass/fail results for each instance. This report is temporarily created in the harness's directory and is later consolidated into our final report.

We invoke the standard SWE-Bench evaluation tooling, which is responsible for applying each agent-generated patch within the correct environment and executing the corresponding test suite. The harness first prepares the necessary Docker images to ensure a consistent testing environment for each repository. It then evaluates each entry in the swebench_patches.jsonl file. After this step is complete, all git worktrees are removed to free up disk space, as they are no longer needed.

Reporting Phase

With the evaluation complete, the report phase consolidates all logs, metrics, and evaluation results into a single report for the pipeline run. A final swebench_report.json file is created in the session directory.

This step involves parsing the raw session_data.jsonl to compute aggregate statistics for the run, such as average tool usage, token consumption, and agent session duration. These calculated metrics are then combined with the pass/fail results from the SWE-Bench harness report into a single JSON artifact that summarizes the performance of the agent configuration for that specific run.

Archiving and Analysis Phase

The archive phase copies the artifacts from the session directory—into a timestamped folder within our run_artifacts/ directory. This creates a permanent record of the run's results and its corresponding configuration file. The analysis phase reads directly from these archived directories. The analyze_cross_run script aggregates the results from multiple archived runs (e.g., all 11 runs for the gkg_only pipeline) to calculate average performance metrics. This aggregated data is then passed to our plotting scripts, which generate the comparative charts and visualizations discussed in our findings.

Preliminary Eval Outcomes

18.4 Launch Figure - Highest Accuracy

The highest performing accuracy figures were picked and used in the GitLab 18.4 launch videos, in comparison to the average accuracy figures shown in the Average Accuracy subsection. The data from the highest accuracy run is retained here. We achieved a 7/23 pass rate for the GKG Only agent 3/11 times, a 6/23 pass rate 6/11 times, and a 5/23 pass rate 2/11 times. We anticipate that the skew towards a 6/23 pass rate is due to the lack of determinism with respect to Sonnet 4 mentioned earlier, as well as other sources of non-determinism like Opencode's context window compaction feature. In the Fixture Passes By Run section below, we outline that the GKG only agent always achieves a >1/23 pass rate on the set of all passed fixtures, across runs, making it unlikely that the GKG agent is "getting lucky" and points to a subtle source of noise. We will continue to investigate what is causing the seemingly random skew towards a lower score.

Average Accuracy

To start with the simplest result first: GKG, across n=11 runs outperforms both the baseline agent, and surprisingly, the baseline + GKG agent, while taking approximately 20% less time to resolve the average issue against the baseline.

The token utilization is ~5% greater for GKG only: In general, we aimed for accuracy over minimizing token utilization for the purposes of this report, and we expect that token utilization will outperform baseline as the MCP tools mature, and that the rate of token utilization will grow slower for larger repositories (test split of SWE-Bench Lite) versus the baseline. However, the discrepancy in token utilization for GKG vs. baseline has a few likely causes. First, we believe our MCP tools may be overly permissive in their current form, with our current defaults returning around 50 items per tool call. The use of XML may also be an implicant (over the “plain text” data returned by the default tools), but the permissive database queries via MCP are the likely root cause. Second, the use of tools like repo_map which provide more broad, coarsely grained repository information to which there is no analogue in the Opencode tools, likely contributes to higher token utilization.

In the session_data.jsonl files attached to each run, you can find the entire context window, including the inputs and outputs to each tool call.

Some more general observations about the behavior of GKG and GKG + baseline versus baseline:

Sonnet prefers to read the full file, possibly because it understands that the KG indexes the codebase and therefore might not be up to date. This can happen after it has made an edit. We did not enable re-indexing for this study due to time constraints and the tendency for Sonnet to call it too much. We expect this behavior to improve when the agent is able to successfully leverage reindexing.
Sonnet shows a strong preference for using the search_codebase_definitions tool and also for calling the read tool over of read_definitions. There is no apparent cause for this, but we speculate it could be a result of ambiguous prompting or some contradictory instructions in Opencode’s system prompt.
The import_usage and the get_references tools are called sparingly, which is likely due to the average codebase and size and complexity of the repos in the dev split of SWE-Bench Lite. We estimate that as repo size/complexity increases, their use will increase too.

On the relative underperformance of Baseline + GKG, our general intuition was wrong - we believed that access to very fine grained tools like grep as well as tools like todowrite (approximating agent memory) would help Baseline + GKG test outperform the GKG only runs. We will investigate this in more depth in the following sections to figure out why this wasn't the case.

Tool Usage Distribution

Here’s a breakdown of the number of tool calls, on average, per problem statement, per run. On average, GKG makes about 21% fewer tool calls compared to baseline - which we believe to be the primary reason as to why the average resolution time for GKG is ~20% lower.

We will not exhaustively analyze each tool call, but here are some highlights:

The usage of the repo_map tool seems to be in line with our instructions in the GKG agent prompts, we instruct it to be called at the beginning of the fixture resolution process, but it is only called 0.7 times per fixture, indicating that the model doesn’t always conform to our instructions. It might be better to inline an initial repo_map call into the user_prompt instead.
The import usage and references tools are practically never used at 0.08 and 0.21 times per fixture respectively, but this could be an indication of problem-tool fit, in that the particular problem statements in the dev split of SWE-Bench Lite might be too simple to necessitate use of import/references tools.

Fixture Passes By Run

The most interesting comparisons one can make during evals are not where agents behave the same, but where they behave differently, and more specifically, where they fail. Seeing where a control agent fails against a competing agent can indicate that the latter has fundamentally new capabilities in code generation. The set of all fixtures passed by at least one of our agents is 8/23. The fixtures marshmallow_1343, pydicom-1256, pydicom-1694, and sqlfluff-2419 don’t give us much information to work with, as they carry an identical pass rate across all 3 agents.

However, successful fixture resolutions for asteroid-1196 and asteroid-1866 have outcomes entirely or mostly dominated by agents using GKG, suggesting that GKG gives agents the ability to solve problems in domains where baseline tools fail or produce highly inconsistent results.

An analysis of the execution logs for these two fixtures revealed a pattern that explains the performance difference, the distinction being in the depth of root cause analysis conducted by the agents. The GKG agent adopted a top-down approach, using tools like search_codebase_definitions and repo_map to first build a high-level understanding of the codebase, and the context surrounding the stack traces for asteroid-1196 and asteroid-1866, effectively allowing the GKG agent to avoid shallow “rabbit-holes”.

Comparatively, the baseline agent engaged in reactive debugging, starting from the error trace and applying local fixes that addressed the immediate symptom rather than the underlying cause, like adding exception handling at the error site and manipulating irrelevant variables downstream of the error trace.

We look forward to automating this process in the near future, where we can automatically identify discrepancies in fixture passes, and generate LLM-based summaries of the agent trajectories to inform future agent development and improvements.

Precision Across Runs

Across all our runs, GKG was the most permissive and unfocused with respect to file access - but this behavior is well known, as the MCP item limits for definition search are far too high. Do note that these are not unique file accesses, just an average of the total number of files accessed per run. The patch accesses metric is how many times the file which contains the gold set solution in the SWE-Bench-Lite dev split is accessed. One metric that was also interesting but not included in the figure was the proportion of patch access success, e.g on what proportion of runs was the gold set file even accessed. Not surprisingly, the GKG agent stood at 100%, the GKG + Baseline agent at 96.2%, and the Baseline agent at 99.2%, suggesting that the inclusion and reliance on the default tools from Opencode encourages going down unproductive rabbit holes, as discussed in the preceding section.

Appendix

Data Retention

Interested parties will be able to find our full data dump for all 3 pipelines, across 11 runs each at the following link: https://gitlab.com/gitlab-org/rust/knowledge-graph-evals. This includes the logs for every single agent, all tool calls, the test results from the SWE-Bench evaluation harness, and basic statistics regarding each run.

Some of our initial run data was initially stored in the gkg-evals-6 branch, at which point it was moved to a temporary gkg-evals-7 branch, which consolidated the run data from gkg-evals-6 branch linked above and runs-angelo branch. The data from gkg-evals-7 was then moved to https://gitlab.com/gitlab-org/rust/knowledge-graph-evals.

Future Work

We soon intend to stabilize the architecture of our local evaluation framework and properly containerize it, so it can be run as a CI, chatops, or issue slash command.

After which, we will fully document how to run the evaluator locally and in CI. We also intend to add support for more benchmarks, like Multi-SWE-Bench, as well as capturing more rigorous statistics like `pass@k` regarding agent performance, which were not in scope for this report.

As mentioned before, we’d like to automate the process of identifying discrepancies in fixture passes by generating LLM-based summaries of the agent trajectories to inform future agent development and improvements, as it is currently cumbersome to manually parse the trajectories.

Edited Sep 23, 2025 by Michael Usachenko