Iterating on Independent Metric Judge for VM

Problem to solve

As we go through Independent LLM judge for VM we found instances where the Judge is not performing well. The chosen judge is text-bison-32k.

There are about 3,520 records in the ETV daily run table and 902 of them that are evaluated incorrectly (~ 26%). That is answer contains the word sorry or empty and scored 3 or higher. This was picked up on the July 4th run.

Proposal

We will migrate to GPT4-0 as the better judge and tweak the prompts to cover areas where the correctness should be 1 and the judge is factually honest.

Links / References

Scores that it fails: https://docs.google.com/spreadsheets/d/1uD7Flm1HuAcxAA5m3chD2NdruJ4OFZ6YX6ruDR8LCM4/edit?gid=0#gid=0&fvid=767494884

Below is the query to pick up the incorrect evaluation.

SELECT
  *
FROM
  `dev-ai-research-0e2f8974.duo_chat_daily_runs.etv_daily_results_llm_judge`
WHERE
  correctness >= 3
  AND (answer LIKE "%sorry%"
    OR TRIM(answer) = "");

Edited Jul 05, 2024 by Tan Le