Vulnerability Resolution - Iterate on the Prompt - Short-Loop Evaluation
prompt-library
on GDK
Preferred option - Running Implementation plan:
- Seed the GDK with projects including vulnerabilities (extend GitLab Direct Transfer)
- Export GDK vulnerabilities to JSONL format
- Wait for feat: move vulnerability extraction to Prompt L... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!827 - merged) • Andras Herczeg • 17.6 to be merged (export from GitLab to v4)
- Wait for additional extraction steps (from v4 to v7)
- See https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/work_items/378#note_2185995609
- Filter JSONL (to select only a few subset, by CWE, by language, ...)
- Run Prompt Library (input = JSONL file, output = JSONL file)
Original Description
We are close to having CEF in place for Vulnerability Resolution,
which allows us to assess the feature's quality.
This assessment provides detailed insights into the feature's performance.
Our next goal is to enhance the feature's quality by improving the CEF indicators.
These improvements will primarily involve modifications to the prompt.
When adjusting the prompt, we must ensure that no significant regressions are introduced.
Ideally, we would like to run the evaluation on the branch before merging.
However running the evaluation on the whole dataset takes too much time (~48 hours).
The goal of this issue is the set up a process for evaluating changes to the prompt before merging.
Several options have been considered:
- Simulation-based LLM Judge in
prompt-library
- Evaluation in LangSmith
- Running
prompt-library
on GDK
Also, the CEF will be able to pinpoint data point that need particular attention.
We need a way to run a local evaluation on those particular data points.
Notice: a particularity of the VR LLM Judge is that it's not judging directly the output of the LLM, but judging the output of VR, including the creation of the MR (applying the suggestion of the LLM on the actual code).