Skip to content

New metrics for false positive detection using ground truth

Problem to solve

We have ground truth data for vulnerability on a few projects.

We need to update our metrics to compute precision and recall for false positive detection.

Dataset with ground-truth is vulnerability.resolution.2.subset

Proposal

To introduce a new metric we need to implement a new evaluator. In this case, we can take the existing evaluator as example and modify it to compute the proposed metric.

  • An evaluator has 3 components

    • Input: we usually use pydantic BaseModel to ensure type safety
    • Output: This is the results we push to LangSmith. It is a TypedDict as required by LangSmith
    • A main function to compute the metrics

    We can take the existing evaluator for FP detection as an example and a starting point

  • We then need to add the new evaluator to the top level evaluate function

  • Because the FP detection will output a probability (0-1), we can calculate precision-recall curve and AUC (area under curve) as a more robust metric than just precision and recall. This final output of the new evaluator should be AUC.

Further details

Links / references

Edited by Hongtao Yang