Skip to content

Vulnerability Resolution - Iterate on the Prompt - Short-Loop Evaluation

Preferred option - Running prompt-library on GDK

Implementation plan:

Original Description

We are close to having CEF in place for Vulnerability Resolution, which allows us to assess the feature's quality.
This assessment provides detailed insights into the feature's performance.

Our next goal is to enhance the feature's quality by improving the CEF indicators.
These improvements will primarily involve modifications to the prompt.

When adjusting the prompt, we must ensure that no significant regressions are introduced.
Ideally, we would like to run the evaluation on the branch before merging.
However running the evaluation on the whole dataset takes too much time (~48 hours).

The goal of this issue is the set up a process for evaluating changes to the prompt before merging.

Several options have been considered:

  1. Simulation-based LLM Judge in prompt-library
  2. Evaluation in LangSmith
  3. Running prompt-library on GDK

Also, the CEF will be able to pinpoint data point that need particular attention.
We need a way to run a local evaluation on those particular data points.

Notice: a particularity of the VR LLM Judge is that it's not judging directly the output of the LLM, but judging the output of VR, including the creation of the MR (applying the suggestion of the LLM on the actual code).

Edited by Meir Benayoun