Evaluate Duo Chat on an issue/epic-related QA dataset
What does this merge request do and why?
This MR evaluates the accuracy of a prediction against a reference using LLM judgment for issue/epic-related QA datasets. The implemented approach uses an LLM to assess the prediction's correctness, readability, and comprehensiveness based on the provided context (reference) and question (input). The LLM assigns a score from 1 to 4, where 1 is fully inaccurate and 4 is fully accurate.
How to set up and validate locally
-
Check out to this merge request's branch.
-
Update the .env file setting the right variables.
-
Install dependencies.
mise install # or asdf poetry run install
-
Collect Duo Chat completions by running the Rake task as described in #25 (closed). Optionally, feel free to use the file I collected to save time - 895c66862d37c4d9351ab6030f7f7bc5.jsonl
-
Check the existing commands ELI5 provides:
poetry run eli5 --help poetry run eli5 duo-chat --help
-
Run evaluation:
poetry run eli5 duo-chat evaluate qa-resources --help poetry run eli5 duo-chat evaluate qa-resources <PATH to the Rake output, e.g., 895c66862d37c4d9351ab6030f7f7bc5.jsonl> --dataset=duo_chat.cot_qa_resources.1
Here is the completed experiment for the uploaded completions - https://smith.langchain.com/o/477de7ad-583e-47b6-a1c4-c4a0300e7aca/datasets/f0f7c18a-a282-465b-8f16-d5b763365ec4
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Closes #25 (closed)