Skip to content

Use LLM as judge to evaluate code suggestions

Pam Artiaga requested to merge 465551-code-suggestions-use-llm-as-judge into main

Context

We introduced the Code Suggestions evaluator in Add code suggestions evaluate script (!8 - merged). That evaluation uses the exact_match, which means that the suggested code has to be exactly the same as the expected suggestion.

With this MR, we're adding a qa evaluator to use LLM to judge the AI model's suggestion. We also retain the exact_match evaluator, since it's still helpful to check that.

For Use LLM-as-judge to evaluate the code suggestio... (#5 - closed)

References

Considerations

  • We may still need to refine the prompt to account for other code completion examples and edge cases
  • We also need to make sure that the prompt does not get too big since we have set max_tokens
  • I am assuming that requests to the LLM (in this case Anthropic) costs money. To keep costs low, or at least make this less resource-intensive, I think we can take these steps:
    • make the llm-as-judge evaluator optional in each run (and disabled by default)
    • do not run llm-as-judge evaluation on large datasets; we need a dataset with a few hand-picked examples and edge cases

Examples and Screenshots

Prompt: "# Write a function that says hello"

Code Completion suggestion: "def say_hello\n puts "Hello, World!"\nend\n"

When the AI code suggestion exactly matches the expected answer - we get a CORRECTNESS=1

expand for screenshot

Screenshot_2024-06-19_at_16.24.36

When the AI code suggestion is logically similar to the expected answer, without any differences otherwise - we get a CORRECTNESS=1

expand for screenshot

Screenshot_2024-06-20_at_11.10.36

When the AI code suggestion is logically similar but with some trivial differences to the expected answer - we get variable results in different runs

In this example, the expected answer has a puts "Hello!", but the model's suggestion is puts "Hello World!". We sometimes get CORRECTNESS=1, and sometimes CORRECTNESS=0.

expand for screenshot

CORRECTNESS=1

Screenshot_2024-06-19_at_16.24.53

CORRECTNESS=0

Screenshot_2024-06-20_at_09.56.05

When the AI code suggestion completely different to the expected answer - we get a CORRECTNESS=0

expand for screenshot

Screenshot_2024-06-19_at_16.24.46

Edited by Pam Artiaga

Merge request reports