Use LLM as judge to evaluate code suggestions
Context
We introduced the Code Suggestions evaluator in Add code suggestions evaluate script (!8 - merged). That evaluation uses the exact_match
, which means that the suggested code has to be exactly the same as the expected suggestion.
With this MR, we're adding a qa
evaluator to use LLM to judge the AI model's suggestion. We also retain the exact_match
evaluator, since it's still helpful to check that.
For Use LLM-as-judge to evaluate the code suggestio... (#5 - closed)
References
-
https://docs.smith.langchain.com/tutorials/Developers/evaluation
-
https://docs.smith.langchain.com/old/evaluation/faq/evaluator-implementations
- this lists three types of QA (question & answer) evaluators, but based on the description of each type, we should use
qa
because we have a query, the model's answer, and expected result
- this lists three types of QA (question & answer) evaluators, but based on the description of each type, we should use
Considerations
- We may still need to refine the prompt to account for other code completion examples and edge cases
- We also need to make sure that the prompt does not get too big since we have set
max_tokens
- I am assuming that requests to the LLM (in this case Anthropic) costs money. To keep costs low, or at least make this less resource-intensive, I think we can take these steps:
- make the llm-as-judge evaluator optional in each run (and disabled by default)
- do not run llm-as-judge evaluation on large datasets; we need a dataset with a few hand-picked examples and edge cases
Examples and Screenshots
Prompt: "# Write a function that says hello"
Code Completion suggestion: "def say_hello\n puts "Hello, World!"\nend\n"
When the AI code suggestion exactly matches the expected answer - we get a CORRECTNESS=1
When the AI code suggestion is logically similar to the expected answer, without any differences otherwise - we get a CORRECTNESS=1
When the AI code suggestion is logically similar but with some trivial differences to the expected answer - we get variable results in different runs
In this example, the expected answer has a puts "Hello!"
, but the model's suggestion is puts "Hello World!"
. We sometimes get CORRECTNESS=1
, and sometimes CORRECTNESS=0
.
When the AI code suggestion completely different to the expected answer - we get a CORRECTNESS=0