Iterating on Chat Eval Metrics

Problem to solve

Currently we have made great progress adding LLM Based evaluation ( consesus filtering ) but we would iterate on the metrics to make it as close to human intellegence when evaluation. This issue is to explore various approaches to do it

Proposal

Put all the answers from different LLM together and ask another LLM judge to compare and grade them (task)

Further details

Links / references

Edited Jan 30, 2024 by Hongtao Yang