Explain the Vulnerability Evaluation Methodology

Problem to solve

As we build our curated dataset as part of gitlab-org/gitlab#427253 (closed) for chat, we would like to start solution validating whether similarity and/or cross similarity (that are currently implemented in the algorithm of Prompt Library) is a good proxy for evaluation Chat.

We would also want to understand Consensus Filtering , LLM Based Evaluator is a good fit to the proxy of correctness for chat.

There are two elements of the validation process to consider:

confirming that the vulnerability code and corresponding description are related as part of our correctness assessment
evaluating how well the information is being synthesized by the foundational LLM.

Proposal

Metric 1- Similarity Score as comparisons for LLM's

For explain the vulnerability , as we curate the dataset and find various evaluation methods as good proxy to quality , here is the rough methodology we would be curating upon

Echoing the method from the proposal in the epic

**Ground Truth: Known good answer from base LLM's - Match with Detailed Description from CWE

We would compare the answers and their embeddings with the embeddings of the detailed extended description of the code..

Populated all the output for the curated dataset through , text-bison, claude, code-bison, code-llama , llama-2
Run the prompt library pipeline with cross simlarity and similarity and manually have spot check to see if the hypothesis still holds for curated dataset and test again manual dataset
Test if the responses come close to these answers using matching scores / matching algorithms e.g. by comparing cosine similarity or cross similarity

Metric 2- Consensus Filtering with LLM Based Evaluation

Ground Truth : Output of a chosen LLM Evaluator based on instruction and specific criteria.

We use an LLM to evaluate the output for a specific criteria. With the same setup as above, this is done by having a prompt instructing the LLM to provide the verdict of a generated answer against the reference https://arxiv.org/pdf/1609.08097.pdf

We use a different LLM claude-2 for explain the vulnerability (the "evaluation LLM") to evaluate the responses generated by the target LLM. The evaluation LLM is typically chosen based on its strong language understanding capabilities and is used to assess the quality of responses for example, which one of all the output is considered correct or prefered by a developer?

As an example below:

Metric 3 - Word-Level Metrics

Another evaluation approach compares the reference and generated output at the word/token (or word/token group) level. Several evaluation metrics are available, such as BLEU, ROUGE, Perplexity, and BERTScore.

Final Score

16.6 : The final score will be a combination of Metric 1 and Metric 2

16.7 : We incorporate the additional metric

The fields in the table

Context
Question
Identifies and Description
LLM 1 output
LLM 2 output
Similarity of Vulnerability Detailed Description and LLM 1 output
Similarity of Vulnerability Detailed Description and LLM 2 output
Similarity of LLM1 and LLM2
Based on Similarity Comparison is answer accepted ( Y/N)
Human Labelling ( If any)
Vulnerability LLM Evaluator Score ( Criteria eg: Developer preferance) Range 1-10)
Vulnerbaility LLM Evaluator Score( Criteria eg: Developer preferance) Range 1-10 )
Vulnerability LLM Evaluator Score ( Criteria eg: Developer preferance) Range 1-10)
Winning LLM
Vulnerbaiity Score / Winning LLM Index
Final Score Index
Final Acceptance Proxy ( Y/N)

Further details

TBD

Edited May 14, 2024 by Susie Bitters