AI Model Validation : Methodology for Evaluation of Comparison of Known Good Answer with Chat Answer

Problem to solve

As we build our curated dataset as part of gitlab-org/gitlab#427253 (closed) for chat, we would like to start solution validating whether similarity and/or cross similarity (that are currently implemented in the algorithm of Prompt Library) is a good proxy for evaluation Chat.

We would also want to understand Consensus Filtering , LLM Based Evaluator is a good fit to the proxy of correctness for chat.

Proposal

Metric 1- Similarity Score as comparisons for LLM's

For chat , we have run the similarity score of a small dataset ( 5 prompts) and has seen it is a good proxy. https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/prompt-library/-/issues/64. The next step was to run it via a larger dataset as curated to see if the same assumption holds. Further we would be also be testing cross similarity score

Echoing the method from the proposal in the epic

Ground Truth: Known good answer from base LLM's - text-bison , claude , llama

We would compare the answers and their embeddings with the embeddings of the known good answers. We would consider the other LLM's as the ground truth.

Populated all the known good answers for the curated dataset through , text-bison, claude, code-bison?
Run the prompt library pipeline with cross simlarity and similarity and manually have spot check to see if the hypothesis still holds for 553 questions
Test if the chat's responses come close to these answers using matching scores / matching algorithms e.g. by comparing cosine similarity or cross similarity
Go through the results to identify patterns how the chat fails and improve the prompts to get better.

Metric 2- Consensus Filtering with LLM Based Evaluation

Ground Truth : Output of a chosen LLM Evaluator based on instruction and specific criteria.

We use an LLM to evaluate the output for a specific criteria. With the same setup as above, this is done by having a prompt instructing the LLM to provide the verdict of a generated answer against the reference https://arxiv.org/pdf/1609.08097.pdf

We use a different LLM text-bison (the "evaluation LLM") to evaluate the responses generated by the target LLM. The evaluation LLM is typically chosen based on its strong language understanding capabilities and is used to assess the quality of responses for example, which one of all the output is considered correct or prefered by a developer?

As an example below:

Metric 3 - Word-Level Metrics

Another evaluation approach compares the reference and generated output at the word/token (or word/token group) level. Several evaluation metrics are available, such as BLEU, ROUGE, Perplexity, and BERTScore.

Final Score

16.6 : The final score will be a combination of Metric 1 and Metric 2

16.7 : We incorporate the additional metric

The fields in the table

Context
Task
Question
Chat Output
LLM 2 ( Known Good Answer 1)
LLM 3 ( Known Good Answer 2)
Similarity of Known Good Answer 1 and Chat output
Similarity of Known Good Answer 2 and Chat output
Similarity of both known good Answer
Based on Similarity Comparison is chat answer accepted ( Y/N)
Human Labelling ( If any)
Chat LLM Evaluator Score ( Criteria eg: Correctness) Range 1-10)
LLM 2 LLM Evaluator Score( Criteria eg: Readability) Range 1-10 )
LLM 3 LLM Evaluator Score ( Criteria eg: Comprehensiveness) Range 1-10)
Winning LLM
Chat Score / Winning LLM Index
Final Score Index
Final Acceptance Proxy ( High/Medium/Low)

Further details

TBD

Links / references

Here is the reference on how similarity and cross similarity score is calculated.https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/prompt-library/-/issues/35+
LLM Based Evaluation : https://arxiv.org/pdf/1609.08097.pdf
Best Practices for LLM Evaluation: https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG

Edited Feb 05, 2024 by Mon Ray