Skip to content

Create evaluation pipeline for SAST false positive detection using local executor

What does this merge request do and why?

This MR creates a new evaluation pipeline for SAST false positive detection. The Sec AI team has released a new experimental feature that checks whether a SAST vulnerability is a false positive using an agentic approach, but currently there is no evaluation pipeline to assess its performance.

This MR provides a comprehensive pipeline that captures predictions and assess their validity via an LLM judge.

How to set up and validate locally

  1. Deploy remote GDK.

    • Set Up Evaluation Environment Using Evaluation Runner:

      1. Clone the evaluation-runner and cd into it:

        $ git clone git@gitlab.com:gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner.git
        $ cd evaluation-runner
      2. Clear the GitLab host variable:

        $ unset GITLAB_HOST
      3. Generate your env.list file:

        $ make create-env-list
      4. Set the environment variables required for your GDK:

        $ export GOOGLE_APPLICATION_CREDENTIALS_BASE64=${HOME}/.config/gcloud/application_default_credentials.json
        $ export ANTHROPIC_API_KEY=<your-anthropic-token>
        $ export LANGCHAIN_API_KEY=<your-langchain-api-token>
        $ export LANGCHAIN_PROJECT=<your-langchain-project>
      5. Populate your env.list with the previous environment variables:

        $ make update-env-list
      6. Open your env.list and set:

        CEF_EVALUATOR=vulnerability-resolution
        GITLAB_LICENSE_FILE_BASE64=<your GitLab License File in BASE64 format>
      7. Deploy the remote GDK:

        $ GDK_IMAGE_VERSION=1 make deploy

      Note: this will use the GDK image that has been tested to work with the evaluation environment.

    • (Temporary step, will be addressed in https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner/-/merge_requests/191) Manually enable Duo Workflow service in the deployed GDK instance.

      1. SSH into the VM where GDK is deployed:

        $ cd path/to/evaluation-runner
        $ INSTANCE_NAME=$(make get-gitlab-base-url | sed -E 's#http://([^\.]+)\..*#\1#')
        $ gcloud compute ssh $INSTANCE_NAME \
        --zone=us-central1-a \
        --project=dev-ai-research-0e2f8974
      2. Inside the VM, Run docker ps to get the <container_id> and run docker exec -ti <container_id> /bin/bash to access the running GDK container. Then:

        1. Set your Google Cloud Platform (GCP) project:

          $ export GOOGLE_CLOUD_PROJECT=dev-ai-research-0e2f8974
        2. Enable Duo Workflow:

          $ gdk config set duo_workflow.enabled true
          $ gdk reconfigure
          $ gdk restart duo-workflow-service rails
  2. Set up the executor in your own laptop.

  3. Set env variables in .env:

    • GITLAB_BASE_URL to the remote gdk url deployed in step 1
    • GITLAB_PRIVATE_TOKEN to the GitLab private token defined in this line

    Note: please verify that the following variables are also set: LANGCHAIN_API_KEY, LANGCHAIN_PROJECT, ANTHROPIC_API_KEY.

  4. Run the evaluation command:

    poetry run cef security-testing evaluate false-positive-detection --dataset vulnerability.resolution.2.subset --local-executor-path <path/to/your/executor/binary> --limit 2

    This dataset vulnerability.resolution.2.subset is a small subset containing project with ground truth data. The command outputs a LangSmith experiment link where you can view the evaluation results for each vulnerability in the dataset. The evaluation results are written in the false_positive_quality column.

    If you want to see the results of this evaluation pipeline over 20 examples, please check this experiment.

References

Merge request checklist

  • I've ran the affected pipeline(s) to validate that nothing is broken.
  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.
Edited by Fabrizio J. Piva

Merge request reports

Loading