Skip to content

Explain the Vulnerability Evaluation Methodology

Problem to solve

As we build our curated dataset as part of gitlab-org/gitlab#427253 (closed) for chat, we would like to start solution validating whether similarity and/or cross similarity (that are currently implemented in the algorithm of Prompt Library) is a good proxy for evaluation Chat.

We would also want to understand Consensus Filtering , LLM Based Evaluator is a good fit to the proxy of correctness for chat.

There are two elements of the validation process to consider:

  1. confirming that the vulnerability code and corresponding description are related as part of our correctness assessment
  2. evaluating how well the information is being synthesized by the foundational LLM.

Proposal

Metric 1- Similarity Score as comparisons for LLM's

For explain the vulnerability , as we curate the dataset and find various evaluation methods as good proxy to quality , here is the rough methodology we would be curating upon

Echoing the method from the proposal in the epic

**Ground Truth: Known good answer from base LLM's - Match with Detailed Description from CWE

We would compare the answers and their embeddings with the embeddings of the detailed extended description of the code..

  • Populated all the output for the curated dataset through , text-bison, claude, code-bison, code-llama , llama-2
  • Run the prompt library pipeline with cross simlarity and similarity and manually have spot check to see if the hypothesis still holds for curated dataset and test again manual dataset
  • Test if the responses come close to these answers using matching scores / matching algorithms e.g. by comparing cosine similarity or cross similarity

Metric 2- Consensus Filtering with LLM Based Evaluation

Ground Truth : Output of a chosen LLM Evaluator based on instruction and specific criteria.

We use an LLM to evaluate the output for a specific criteria. With the same setup as above, this is done by having a prompt instructing the LLM to provide the verdict of a generated answer against the reference https://arxiv.org/pdf/1609.08097.pdf

We use a different LLM claude-2 for explain the vulnerability (the "evaluation LLM") to evaluate the responses generated by the target LLM. The evaluation LLM is typically chosen based on its strong language understanding capabilities and is used to assess the quality of responses for example, which one of all the output is considered correct or prefered by a developer?

As an example below:Chat-eval

Metric 3 - Word-Level Metrics

Another evaluation approach compares the reference and generated output at the word/token (or word/token group) level. Several evaluation metrics are available, such as BLEU, ROUGE, Perplexity, and BERTScore.

Final Score

16.6 : The final score will be a combination of Metric 1 and Metric 2

16.7 : We incorporate the additional metric

The fields in the table

  1. Context
  2. Question
  3. Identifies and Description
  4. LLM 1 output
  5. LLM 2 output
  6. Similarity of Vulnerability Detailed Description and LLM 1 output
  7. Similarity of Vulnerability Detailed Description and LLM 2 output
  8. Similarity of LLM1 and LLM2
  9. Based on Similarity Comparison is answer accepted ( Y/N)
  10. Human Labelling ( If any)
  11. Vulnerability LLM Evaluator Score ( Criteria eg: Developer preferance) Range 1-10)
  12. Vulnerbaility LLM Evaluator Score( Criteria eg: Developer preferance) Range 1-10 )
  13. Vulnerability LLM Evaluator Score ( Criteria eg: Developer preferance) Range 1-10)
  14. Winning LLM
  15. Vulnerbaiity Score / Winning LLM Index
  16. Final Score Index
  17. Final Acceptance Proxy ( Y/N)

Further details

TBD

Edited by Susie Bitters