Create evaluation pipeline for SAST false positive detection using local executor (!1698) · Merge requests · GitLab.org / ModelOps / AI Model Validation and Research / AI Evaluation / CEF

What does this merge request do and why?

This MR creates a new evaluation pipeline for SAST false positive detection. The Sec AI team has released a new experimental feature that checks whether a SAST vulnerability is a false positive using an agentic approach, but currently there is no evaluation pipeline to assess its performance.

This MR provides a comprehensive pipeline that captures predictions and assess their validity via an LLM judge.

How to set up and validate locally

Deploy remote GDK.

Set Up Evaluation Environment Using Evaluation Runner:

Clone the evaluation-runner and cd into it:

$ git clone git@gitlab.com:gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner.git
$ cd evaluation-runner

Clear the GitLab host variable:
```
$ unset GITLAB_HOST
```
Generate your env.list file:
```
$ make create-env-list
```

Set the environment variables required for your GDK:

$ export GOOGLE_APPLICATION_CREDENTIALS_BASE64=${HOME}/.config/gcloud/application_default_credentials.json
$ export ANTHROPIC_API_KEY=<your-anthropic-token>
$ export LANGCHAIN_API_KEY=<your-langchain-api-token>
$ export LANGCHAIN_PROJECT=<your-langchain-project>

Populate your env.list with the previous environment variables:
```
$ make update-env-list
```

Open your env.list and set:

CEF_EVALUATOR=vulnerability-resolution
GITLAB_LICENSE_FILE_BASE64=<your GitLab License File in BASE64 format>

Deploy the remote GDK:
```
$ GDK_IMAGE_VERSION=1 make deploy
```

Note: this will use the GDK image that has been tested to work with the evaluation environment.

(Temporary step, will be addressed in https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner/-/merge_requests/191) Manually enable Duo Workflow service in the deployed GDK instance.
1. SSH into the VM where GDK is deployed:
```
$ cd path/to/evaluation-runner
$ INSTANCE_NAME=$(make get-gitlab-base-url | sed -E 's#http://([^\.]+)\..*#\1#')
$ gcloud compute ssh $INSTANCE_NAME \
--zone=us-central1-a \
--project=dev-ai-research-0e2f8974
```
2. Inside the VM, Run docker ps to get the <container_id> and run docker exec -ti <container_id> /bin/bash to access the running GDK container. Then:
  1. Set your Google Cloud Platform (GCP) project:
```
$ export GOOGLE_CLOUD_PROJECT=dev-ai-research-0e2f8974
```
  2. Enable Duo Workflow:
```
$ gdk config set duo_workflow.enabled true
$ gdk reconfigure
$ gdk restart duo-workflow-service rails
```

Set up the executor in your own laptop.
Set env variables in .env:
- GITLAB_BASE_URL to the remote gdk url deployed in step 1
- GITLAB_PRIVATE_TOKEN to the GitLab private token defined in this line
Note: please verify that the following variables are also set: LANGCHAIN_API_KEY, LANGCHAIN_PROJECT, ANTHROPIC_API_KEY.
Run the evaluation command:
```
poetry run cef security-testing evaluate false-positive-detection --dataset vulnerability.resolution.2.subset --local-executor-path <path/to/your/executor/binary> --limit 2
```
This dataset vulnerability.resolution.2.subset is a small subset containing project with ground truth data. The command outputs a LangSmith experiment link where you can view the evaluation results for each vulnerability in the dataset. The evaluation results are written in the false_positive_quality column.

If you want to see the results of this evaluation pipeline over 20 examples, please check this experiment.

References

Merge request checklist

I've ran the affected pipeline(s) to validate that nothing is broken.
Tests added for new functionality. If not, please raise an issue to follow up.
Documentation added/updated, if needed.

Edited Oct 02, 2025 by Fabrizio J. Piva

Create evaluation pipeline for SAST false positive detection using local executor

What does this merge request do and why?

How to set up and validate locally

References

Merge request checklist

Merge request reports