Evaluate Duo Chat on an issue/epic-related QA dataset (!108) · Merge requests · GitLab.org / AI Powered / ELI5

What does this merge request do and why?

This MR evaluates the accuracy of a prediction against a reference using LLM judgment for issue/epic-related QA datasets. The implemented approach uses an LLM to assess the prediction's correctness, readability, and comprehensiveness based on the provided context (reference) and question (input). The LLM assigns a score from 1 to 4, where 1 is fully inaccurate and 4 is fully accurate.

How to set up and validate locally

Check out to this merge request's branch.
Update the .env file setting the right variables.

Install dependencies.

mise install # or asdf
poetry run install

Collect Duo Chat completions by running the Rake task as described in #25 (closed). Optionally, feel free to use the file I collected to save time - 895c66862d37c4d9351ab6030f7f7bc5.jsonl

Check the existing commands ELI5 provides:

poetry run eli5 --help
poetry run eli5 duo-chat --help

Run evaluation:

poetry run eli5 duo-chat evaluate qa-resources --help
poetry run eli5 duo-chat evaluate qa-resources <PATH to the Rake output, e.g., 895c66862d37c4d9351ab6030f7f7bc5.jsonl> --dataset=duo_chat.cot_qa_resources.1

Here is the completed experiment for the uploaded completions - https://smith.langchain.com/o/477de7ad-583e-47b6-a1c4-c4a0300e7aca/datasets/f0f7c18a-a282-465b-8f16-d5b763365ec4

Merge request checklist

Tests added for new functionality. If not, please raise an issue to follow up.
Documentation added/updated, if needed.

Closes #25 (closed)

Edited Sep 05, 2024 by Alexander Chueshev

Evaluate Duo Chat on an issue/epic-related QA dataset

What does this merge request do and why?

How to set up and validate locally

Merge request checklist

Merge request reports