AI Model Validation : Methodology for Evaluation of Comparison of Known Good Answer with Chat Answer
Problem to solve
As we build our curated dataset as part of gitlab-org/gitlab#427253 (closed) for chat, we would like to start solution validating whether similarity and/or cross similarity (that are currently implemented in the algorithm of Prompt Library) is a good proxy for evaluation Chat.
We would also want to understand Consensus Filtering , LLM Based Evaluator is a good fit to the proxy of correctness for chat.
Proposal
Metric 1- Similarity Score as comparisons for LLM's
For chat , we have run the similarity score of a small dataset ( 5 prompts) and has seen it is a good proxy. https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/prompt-library/-/issues/64. The next step was to run it via a larger dataset as curated to see if the same assumption holds. Further we would be also be testing cross similarity score
Echoing the method from the proposal in the epic
Ground Truth: Known good answer from base LLM's - text-bison , claude , llama
We would compare the answers and their embeddings with the embeddings of the known good answers. We would consider the other LLM's as the ground truth.
- Populated all the known good answers for the curated dataset through , text-bison, claude, code-bison?
- Run the prompt library pipeline with cross simlarity and similarity and manually have spot check to see if the hypothesis still holds for 553 questions
- Test if the chat's responses come close to these answers using matching scores / matching algorithms e.g. by comparing cosine similarity or cross similarity
- Go through the results to identify patterns how the chat fails and improve the prompts to get better.
Metric 2- Consensus Filtering with LLM Based Evaluation
Ground Truth : Output of a chosen LLM Evaluator based on instruction and specific criteria.
We use an LLM to evaluate the output for a specific criteria. With the same setup as above, this is done by having a prompt instructing the LLM to provide the verdict of a generated answer against the reference https://arxiv.org/pdf/1609.08097.pdf
We use a different LLM text-bison (the "evaluation LLM") to evaluate the responses generated by the target LLM. The evaluation LLM is typically chosen based on its strong language understanding capabilities and is used to assess the quality of responses for example, which one of all the output is considered correct or prefered by a developer?
Metric 3 - Word-Level Metrics
Another evaluation approach compares the reference and generated output at the word/token (or word/token group) level. Several evaluation metrics are available, such as BLEU, ROUGE, Perplexity, and BERTScore.
Final Score
16.6 : The final score will be a combination of Metric 1 and Metric 2
16.7 : We incorporate the additional metric
The fields in the table
- Context
- Task
- Question
- Chat Output
- LLM 2 ( Known Good Answer 1)
- LLM 3 ( Known Good Answer 2)
- Similarity of Known Good Answer 1 and Chat output
- Similarity of Known Good Answer 2 and Chat output
- Similarity of both known good Answer
- Based on Similarity Comparison is chat answer accepted ( Y/N)
- Human Labelling ( If any)
- Chat LLM Evaluator Score ( Criteria eg: Correctness) Range 1-10)
- LLM 2 LLM Evaluator Score( Criteria eg: Readability) Range 1-10 )
- LLM 3 LLM Evaluator Score ( Criteria eg: Comprehensiveness) Range 1-10)
- Winning LLM
- Chat Score / Winning LLM Index
- Final Score Index
- Final Acceptance Proxy ( High/Medium/Low)
Further details
TBD
Links / references
- Here is the reference on how similarity and cross similarity score is calculated.https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/prompt-library/-/issues/35+
- LLM Based Evaluation : https://arxiv.org/pdf/1609.08097.pdf
- Best Practices for LLM Evaluation: https://www.databricks.com/blog/LLM-auto-eval-best-practices-RAG