New metrics for false positive detection using ground truth
Problem to solve
We have ground truth data for vulnerability on a few projects.
We need to update our metrics to compute precision and recall for false positive detection.
Dataset with ground-truth is vulnerability.resolution.2.subset
Proposal
To introduce a new metric we need to implement a new evaluator. In this case, we can take the existing evaluator as example and modify it to compute the proposed metric.
-
An evaluator has 3 components
- Input: we usually use pydantic
BaseModel
to ensure type safety - Output: This is the results we push to LangSmith. It is a
TypedDict
as required by LangSmith - A main function to compute the metrics
We can take the existing evaluator for FP detection as an example and a starting point
- Input: we usually use pydantic
-
We then need to add the new evaluator to the top level
evaluate
function -
Because the FP detection will output a probability (0-1), we can calculate
precision-recall curve
and AUC (area under curve) as a more robust metric than just precision and recall. This final output of the new evaluator should be AUC.