Use LLM as judge to evaluate code suggestions (!18) · Merge requests · GitLab.org / AI Powered / ELI5

Pam Artiaga requested to merge 465551-code-suggestions-use-llm-as-judge into main Jun 19, 2024

Context

We introduced the Code Suggestions evaluator in Add code suggestions evaluate script (!8 - merged). That evaluation uses the exact_match, which means that the suggested code has to be exactly the same as the expected suggestion.

With this MR, we're adding a qa evaluator to use LLM to judge the AI model's suggestion. We also retain the exact_match evaluator, since it's still helpful to check that.

For Use LLM-as-judge to evaluate the code suggestio... (#5 - closed)

References

https://docs.smith.langchain.com/tutorials/Developers/evaluation
https://docs.smith.langchain.com/old/evaluation/faq/evaluator-implementations
- this lists three types of QA (question & answer) evaluators, but based on the description of each type, we should use qa because we have a query, the model's answer, and expected result

Considerations

We may still need to refine the prompt to account for other code completion examples and edge cases
We also need to make sure that the prompt does not get too big since we have set max_tokens
I am assuming that requests to the LLM (in this case Anthropic) costs money. To keep costs low, or at least make this less resource-intensive, I think we can take these steps:
- make the llm-as-judge evaluator optional in each run (and disabled by default)
- do not run llm-as-judge evaluation on large datasets; we need a dataset with a few hand-picked examples and edge cases

Examples and Screenshots

Prompt: "# Write a function that says hello"

Code Completion suggestion: "def say_hello\n puts "Hello, World!"\nend\n"

When the AI code suggestion exactly matches the expected answer - we get a CORRECTNESS=1

expand for screenshot

When the AI code suggestion is logically similar to the expected answer, without any differences otherwise - we get a CORRECTNESS=1

expand for screenshot

When the AI code suggestion is logically similar but with some trivial differences to the expected answer - we get variable results in different runs

In this example, the expected answer has a puts "Hello!", but the model's suggestion is puts "Hello World!". We sometimes get CORRECTNESS=1, and sometimes CORRECTNESS=0.

expand for screenshot

CORRECTNESS=1

CORRECTNESS=0

When the AI code suggestion completely different to the expected answer - we get a CORRECTNESS=0

expand for screenshot

Edited Jun 25, 2024 by Pam Artiaga

Use LLM as judge to evaluate code suggestions

Context

References

Considerations

Examples and Screenshots

Merge request reports