Iterating on Independent Metric Judge for VM
Problem to solve
As we go through Independent LLM judge for VM we found instances where the Judge is not performing well.
The chosen judge is text-bison-32k
.
There are about 3,520
records in the ETV daily run table and 902
of them that are evaluated incorrectly (~ 26%
). That is answer
contains the word sorry
or empty and scored 3 or higher. This was picked up on the July 4th run.
Proposal
We will migrate to GPT4-0 as the better judge and tweak the prompts to cover areas where the correctness should be 1 and the judge is factually honest.
Links / References
Scores that it fails: https://docs.google.com/spreadsheets/d/1uD7Flm1HuAcxAA5m3chD2NdruJ4OFZ6YX6ruDR8LCM4/edit?gid=0#gid=0&fvid=767494884
Below is the query to pick up the incorrect evaluation.
SELECT
*
FROM
`dev-ai-research-0e2f8974.duo_chat_daily_runs.etv_daily_results_llm_judge`
WHERE
correctness >= 3
AND (answer LIKE "%sorry%"
OR TRIM(answer) = "");
Edited by Tan Le