New metrics for false positive detection using ground truth

Problem to solve

We have ground truth data for vulnerability on a few projects.

We need to update our metrics to compute precision and recall for false positive detection.

Dataset with ground-truth is vulnerability.resolution.2.subset

Proposal

To introduce a new metric we need to implement a new evaluator. In this case, we can take the existing evaluator as example and modify it to compute the proposed metric.

An evaluator has 3 components
- Input: we usually use pydantic BaseModel to ensure type safety
- Output: This is the results we push to LangSmith. It is a TypedDict as required by LangSmith
- A main function to compute the metrics
We can take the existing evaluator for FP detection as an example and a starting point
We then need to add the new evaluator to the top level evaluate function
Because the FP detection will output a probability (0-1), we can calculate precision-recall curve and AUC (area under curve) as a more robust metric than just precision and recall. This final output of the new evaluator should be AUC.

Further details

Links / references

Edited Sep 30, 2025 by Hongtao Yang