Explain the Vulnerability Evaluation Methodology
Problem to solve
As we build our curated dataset as part of gitlab-org/gitlab#427253 (closed) for chat, we would like to start solution validating whether similarity and/or cross similarity (that are currently implemented in the algorithm of Prompt Library) is a good proxy for evaluation Chat.
We would also want to understand Consensus Filtering , LLM Based Evaluator is a good fit to the proxy of correctness for chat.
There are two elements of the validation process to consider:
- confirming that the vulnerability code and corresponding description are related as part of our correctness assessment
- evaluating how well the information is being synthesized by the foundational LLM.
Proposal
Metric 1- Similarity Score as comparisons for LLM's
For explain the vulnerability , as we curate the dataset and find various evaluation methods as good proxy to quality , here is the rough methodology we would be curating upon
Echoing the method from the proposal in the epic
**Ground Truth: Known good answer from base LLM's - Match with Detailed Description from CWE
We would compare the answers and their embeddings with the embeddings of the detailed extended description of the code..
- Populated all the output for the curated dataset through , text-bison, claude, code-bison, code-llama , llama-2
- Run the prompt library pipeline with cross simlarity and similarity and manually have spot check to see if the hypothesis still holds for curated dataset and test again manual dataset
- Test if the responses come close to these answers using matching scores / matching algorithms e.g. by comparing cosine similarity or cross similarity
Metric 2- Consensus Filtering with LLM Based Evaluation
Ground Truth : Output of a chosen LLM Evaluator based on instruction and specific criteria.
We use an LLM to evaluate the output for a specific criteria. With the same setup as above, this is done by having a prompt instructing the LLM to provide the verdict of a generated answer against the reference https://arxiv.org/pdf/1609.08097.pdf
We use a different LLM claude-2 for explain the vulnerability (the "evaluation LLM") to evaluate the responses generated by the target LLM. The evaluation LLM is typically chosen based on its strong language understanding capabilities and is used to assess the quality of responses for example, which one of all the output is considered correct or prefered by a developer?
Metric 3 - Word-Level Metrics
Another evaluation approach compares the reference and generated output at the word/token (or word/token group) level. Several evaluation metrics are available, such as BLEU, ROUGE, Perplexity, and BERTScore.
Final Score
16.6 : The final score will be a combination of Metric 1 and Metric 2
16.7 : We incorporate the additional metric
The fields in the table
- Context
- Question
- Identifies and Description
- LLM 1 output
- LLM 2 output
- Similarity of Vulnerability Detailed Description and LLM 1 output
- Similarity of Vulnerability Detailed Description and LLM 2 output
- Similarity of LLM1 and LLM2
- Based on Similarity Comparison is answer accepted ( Y/N)
- Human Labelling ( If any)
- Vulnerability LLM Evaluator Score ( Criteria eg: Developer preferance) Range 1-10)
- Vulnerbaility LLM Evaluator Score( Criteria eg: Developer preferance) Range 1-10 )
- Vulnerability LLM Evaluator Score ( Criteria eg: Developer preferance) Range 1-10)
- Winning LLM
- Vulnerbaiity Score / Winning LLM Index
- Final Score Index
- Final Acceptance Proxy ( Y/N)
Further details
TBD