[gkg] Agent Quality/Cost Evaluation Research

Problem to Solve

We are building a suite of Knowledge Graph (KG) tools (references tool, index tool, RepoMap tool, etc.) designed to give AI agents a deep, structural understanding of a codebase. While we intuitively believe these tools will lead to more accurate, efficient, and cost-effective agent performance compared to standard RAG techniques (like keyword search and file grepping), we currently lack a formal, data-driven framework to prove this hypothesis.

Without a systematic evaluation process, we face several challenges:

  • Lack of Objective Data: We cannot quantitatively measure whether KG-powered agents are superior in terms of solution quality, token consumption, or speed.
  • Unguided Tool Development: Decisions on which new tools to build, what features to prioritize (e.g., linking external dependencies), and how to design their APIs are based on intuition rather than empirical evidence of what an agent actually needs.
  • Ineffective Agent prompting: We don't have a clear methodology for discovering the optimal prompting strategies that guide an agent to use our specialized tools effectively.
  • Value Demonstration Challenges: While the Knowledge Graph provides benefits, having the ability to show leadership and customers concrete metrics on token cost savings, improved developer productivity, and the ability to solve more complex tasks is important.

This research effort will serve as our guiding star to make our tools as effective as possible and to steer the future development of all KG-powered AI consumption.

Proposed Solution

We propose to establish a reproducible, automated evaluation framework to rigorously measure and compare the performance of AI agents using different sets of codebase-interaction tools. This framework will be built on established industry benchmarks and will allow us to iterate on tool design and prompting strategies with high confidence.

The research will be structured around the following components:

1. Evaluation Harness & Automation

  • Agent: We will use a standardized, off-the-shelf FOSS agent like OpenCode as our testbed. This ensures reproducibility and isolates the performance of the tools from the variables of proprietary, closed-box agent systems.
  • Automation: We will leverage the opencode.ai SDK to create an automated evaluation pipeline. This pipeline will be runnable via CI/CD, allowing us to consistently execute the benchmark against new versions of our tools and agent configurations.

2. Evaluation Dataset (Eval Set)

  • We will use the SWE-Bench benchmark, specifically the Java subset from Multi-SWE-bench (https://github.com/multi-swe-bench).
  • Rationale: This dataset is ideal because it consists of real-world software engineering problems from large, complex Java repositories (e.g., Alibaba, Elasticsearch, Logstash). The success metric is objective and clear: does the agent's generated code diff cause the repository's unit tests to pass?

3. Experimental Design & Tool Sets

To isolate the impact of the Knowledge Graph, we will define several experimental groups, each with a different set of available tools:

  1. Control Group (Standard RAG): The agent will only have access to baseline tools that mimic traditional RAG approaches.

    • search_files (keyword search)
    • grep_file
    • edit_file
  2. Test Group 1 (KG Purist): The agent will only have access to Knowledge Graph tools, forcing it to rely on structural understanding. Standard search and grep tools will be disabled.

    • search_definitions
    • get_references
    • get_imports
    • repo_map (future)
    • edit_file
  3. Test Group 2 (Hybrid): The agent has access to both the standard RAG tools and the KG tools, allowing it to choose the best tool for the job.

  4. Test Group 3 (Hybrid + Prompt Enhancement): The hybrid agent, but with an enhanced system prompt that provides strategic guidance on when to use specific KG tools. Examples:

    • "To understand the structure of a directory, use the repo_map tool."
    • "Start with a broad search using search_files, then use get_references recursively to trace dependencies and understand the impact of a change."

4. Data Collection & Metrics

For each experimental run, we will capture a comprehensive set of metrics to evaluate both quality and cost:

  • Quality Metrics:
    • Pass Rate: The percentage of SWE-Bench tasks successfully solved.
  • Cost & Efficiency Metrics:
    • Token Utilization: Total input and output tokens consumed. This is a direct measure of cost.
    • End-to-End Latency: Total time from prompt to final answer.
    • Time To First Token (TTFT).
  • Behavioral Metrics:
    • Tool Usage: The sequence, frequency, and type of tools used.
    • Tool Depth: The number of tool calls required before generating a solution.

5. Reproducibility Principles

To ensure our results are reliable and scientifically sound, we will adhere to strict reproducibility guidelines:

  • Fixed Seeds: Use fixed seeds for all LLM inference to ensure deterministic outputs.
  • Context Window Discipline: All code snippets passed to the LLM context must be processed consistently. This includes:
    • Canonicalization: Sorting chunks by a composite key (file path, line number).
    • Deduplication: Ensuring overlapping or identical code ranges are not sent multiple times.
  • Comprehensive Documentation: The entire experimental setup, including system prompts, tool schemas, configurations, and results, will be documented.

Expected Outcomes

The successful completion of this research will provide:

  1. Quantitative Proof of Value: Hard data comparing the quality and cost-effectiveness of KG-powered agents against standard RAG.
  2. Data-Driven Tool Design: Insights from agent behavior will directly inform the design of new tools and the refinement of existing ones (e.g., the need to link functions to external dependency imports to solve vulnerability-related tasks like the Log4J example).
  3. Optimized Prompting Strategies: A "best-practice" guide for how to prompt agents to best leverage the Knowledge Graph.
  4. A Foundational Report: A detailed report, potentially in a scientific paper format, summarizing our methodology and findings. This will be used to demonstrate the business value of our work to stakeholders and to guide future investment.
Edited by Michael Angelo Rivera