Iterating on Chat Eval Metrics
Problem to solve
Currently we have made great progress adding LLM Based evaluation ( consesus filtering ) but we would iterate on the metrics to make it as close to human intellegence when evaluation. This issue is to explore various approaches to do it
Proposal
- Put all the answers from different LLM together and ask another LLM judge to compare and grade them (task)
Further details
Links / references
Edited by Hongtao Yang