Skip to content

Draft: Add eval pipeline for Agentic Vulnerability Resolution: inference (1/2)

What does this merge request do and why?

This MR creates a new evaluation pipeline for agentic vulnerability resolution. The Sec AI team has released a new experimental feature that automatically resolves vulnerabilities using an agentic approach, but currently there is no evaluation pipeline to assess its performance. This MR provides the infrastructure to capture predictions and make them available for further evaluation.

Scope clarification

This MR is part 1 of 2:

  • Part 1 (this MR): Collect predictions (inference) from the feature.
  • Part 2 (upcoming): Evaluate and assess the quality of those predictions.

How to set up and validate locally

  1. Perform these setup steps.

  2. Set up the executor in your own laptop. Specify the path to your executor in .gitlab/agent_platform_templates/vulnerability_resolution.yml.

  3. Open your .env file and set:

    • GITLAB_BASE_URL to http://gdk-for-eval-dbecaae2.gitlab-evaluation-runner.com:3000
    • GITLAB_PRIVATE_TOKEN to the GitLab private token defined in this line

    Note: please verify that the following variables are also set: LANGCHAIN_API_KEY, LANGCHAIN_PROJECT, ANTHROPIC_API_KEY.

  4. Run the evaluation command as follows:

    poetry run cef agent-platform evaluate .gitlab/agent_platform_templates/vulnerability_resolution.yaml

    The command outputs a LangSmith experiment link where you can view the predictions (code patches) for each vulnerability in the dataset. See this experiment's result as example.

    If you open an example in the LangSmith experiment, you should observe the code patch (i.e. the agent's prediction) as follows:

    Screenshot_2025-09-26_at_19.49.48

Important note 1: The workflow may be unable to generate code patches for certain vulnerabilities. When patch generation fails, you'll see this specific log message:

No fix branch created for vuln_id=XXX, workflow_id=XX (likely no patch generated).

Important note 2: This MR uses the LangSmith dataset vulnerability.resolution.3.subset, which is configured in the .yaml input file.

References

  • Parent issue: #791

Merge request checklist

  • I've ran the affected pipeline(s) to validate that nothing is broken.
  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.
Edited by Fabrizio J. Piva

Merge request reports

Loading