Create evaluation pipeline for SAST false positive detection using local executor
What does this merge request do and why?
This MR creates a new evaluation pipeline for SAST false positive detection. The Sec AI team has released a new experimental feature that checks whether a SAST vulnerability is a false positive using an agentic approach, but currently there is no evaluation pipeline to assess its performance.
This MR provides a comprehensive pipeline that captures predictions and assess their validity via an LLM judge.
How to set up and validate locally
-
Deploy remote GDK.
-
Set Up Evaluation Environment Using Evaluation Runner:
-
Clone the evaluation-runner and
cd
into it:$ git clone git@gitlab.com:gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner.git $ cd evaluation-runner
-
Clear the GitLab host variable:
$ unset GITLAB_HOST
-
Generate your
env.list
file:$ make create-env-list
-
Set the environment variables required for your GDK:
$ export GOOGLE_APPLICATION_CREDENTIALS_BASE64=${HOME}/.config/gcloud/application_default_credentials.json $ export ANTHROPIC_API_KEY=<your-anthropic-token> $ export LANGCHAIN_API_KEY=<your-langchain-api-token> $ export LANGCHAIN_PROJECT=<your-langchain-project>
-
Populate your
env.list
with the previous environment variables:$ make update-env-list
-
Open your
env.list
and set:CEF_EVALUATOR=vulnerability-resolution GITLAB_LICENSE_FILE_BASE64=<your GitLab License File in BASE64 format>
-
Deploy the remote GDK:
$ GDK_IMAGE_VERSION=1 make deploy
Note: this will use the GDK image that has been tested to work with the evaluation environment.
-
-
(Temporary step, will be addressed in https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner/-/merge_requests/191) Manually enable Duo Workflow service in the deployed GDK instance.
-
SSH into the VM where GDK is deployed:
$ cd path/to/evaluation-runner $ INSTANCE_NAME=$(make get-gitlab-base-url | sed -E 's#http://([^\.]+)\..*#\1#') $ gcloud compute ssh $INSTANCE_NAME \ --zone=us-central1-a \ --project=dev-ai-research-0e2f8974
-
Inside the VM, Run
docker ps
to get the<container_id>
and rundocker exec -ti <container_id> /bin/bash
to access the running GDK container. Then:-
Set your Google Cloud Platform (GCP) project:
$ export GOOGLE_CLOUD_PROJECT=dev-ai-research-0e2f8974
-
Enable Duo Workflow:
$ gdk config set duo_workflow.enabled true $ gdk reconfigure $ gdk restart duo-workflow-service rails
-
-
-
-
Set up the executor in your own laptop.
-
Set env variables in
.env
:-
GITLAB_BASE_URL
to the remote gdk url deployed in step 1 -
GITLAB_PRIVATE_TOKEN
to the GitLab private token defined in this line
Note: please verify that the following variables are also set:
LANGCHAIN_API_KEY
,LANGCHAIN_PROJECT
,ANTHROPIC_API_KEY
. -
-
Run the evaluation command:
poetry run cef security-testing evaluate false-positive-detection --dataset vulnerability.resolution.2.subset --local-executor-path <path/to/your/executor/binary> --limit 2
This dataset
vulnerability.resolution.2.subset
is a small subset containing project with ground truth data. The command outputs a LangSmith experiment link where you can view the evaluation results for each vulnerability in the dataset. The evaluation results are written in thefalse_positive_quality
column.If you want to see the results of this evaluation pipeline over 20 examples, please check this experiment.
References
- Closes: https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner/-/issues/58 and #799 (closed)
- Partly address: https://gitlab.com/gitlab-org/gitlab/-/issues/553303#note_2783531808
Merge request checklist
-
I've ran the affected pipeline(s) to validate that nothing is broken. -
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.