Iterating on Chat Eval Metrics

Problem to solve

Currently we have made great progress adding LLM Based evaluation ( consesus filtering ) but we would iterate on the metrics to make it as close to human intellegence when evaluation. This issue is to explore various approaches to do it

Proposal

  • Put all the answers from different LLM together and ask another LLM judge to compare and grade them (task)

Further details

Links / references

Edited by Hongtao Yang