Add QA evaluation rspec for duo chat (!134610) · Merge requests · GitLab.org / GitLab

euko requested to merge 427251-duo-chat-qa-evaluation-rspec into master Oct 19, 2023

What does this MR do and why?

Related to #427251 (closed) and #427252 (closed).

This MR introduces a new evaluation test that uses LLMs to grade Duo Chat response using real production data saved as fixtures, a report generation script and CI configuration changes to run the script.

Adds the fixtures generated from real production data (public issues and epics)
- Unnecessary identifiers or the associated data that shouldn't be public have been scrubbed (e.g., email)
- The fixtures were generated using this script https://gitlab.com/gitlab-org/gitlab/-/snippets/3613745. More fixtures can be added as needed. I can document the procedure in a follow-up issue or MR.
  
  Here are the issues and epics added as fixtures (these have been selected by the Product/UX teams.)
Adds a new spec helper (chat_qa_evaluation_helpers.rb) that houses the "test prompt" and evaluation helpers.
Adds a shared context (duo_chat_evaluation_shared_context.rb) that loads the fixtures and recreates the data using FactoryBot.
Adds rspecs (qa_issue_spec.rb, qa_epic_spec.rb) that run a subset of "the golden questions" using the fixtures.
- The questions were grabbed from !133943 (comment 1606668911)
Updates CI configurations to parallelize test runs that require making LLM requests
A Ruby script (scripts/duo_chat/reporter.rb) to post-process the test evaluation outputs and automate posting a report to a merge request

Overview of the evaluation method

See #427251 (comment 1610124229)

tldr;

Ask a question (e.g., What is the status of https://gitlab.com/gitlab-org/gitlab/-/issues/17800) to Duo Chat and save its answer.
Ask a LLM to grade the answer given the correct context for the question.
- The context serves as the ground truth of sort. In our implementation the context is simply a JSON serialization of issue or epic.
- Since a LLM may hallucinate, we ask two or more LLMs to grade/evaluate. In our implementation, we use Claude and Vertex's text-bison. #427251 (comment 1614548760)
Match against the evaluations (REGEX match for CORRECT.)

Whenever a test is run, a summary of its run is printed to stdout:

Question: Summarize the comments into bullet points?

Resource: gitlab.org/project-30#412831

Chat answer: Unfortunately I do not have access to the full context to summarize comments into bullet points. As an AI assistant without access to the original comments, I do not have enough information to accurately summarize them. However, if provided with the full comments, I could analyze and summarize them into concise bullet points. Please feel free to provide additional context so I can assist further. In the meantime, I apologize that I cannot fully answer your question at this time.

-------------------- Evaluation --------------------

 Grade: INCORRECT

Explanation: The question asks to summarize the comments in the given context into bullet points. However, the student states that they do not have access to the full context to summarize the comments. Since the context is provided in the problem statement, this indicates the student answer is incorrect. To summarize comments into bullet points, the student would need to extract the key points from the given context description and format them into a bulleted list. However, the student states they cannot do this without the full context, despite the context being provided. Therefore, the student answer is incorrect.

-------------------- Evaluation --------------------

Grade: INCORRECT

Explanation: The student correctly states that they do not have access to the context and therefore cannot answer the question.

Note that the question actually fails in production:

IF the test fails, the result would be shown as:

expected " Grade: INCORRECT\n\nExplanation: The question asks to summarize the comments in the given context i...t the full context, despite the context being provided. Therefore, the student answer is incorrect." to match /Grade: CORRECT/i
Diff:
@@ -1,3 +1,5 @@
-/Grade: CORRECT/i
+ Grade: INCORRECT
+
+Explanation: The question asks to summarize the comments in the given context into bullet points. However, the student states that they do not have access to the full context to summarize the comments. Since the context is provided in the problem statement, this indicates the student answer is incorrect. To summarize comments into bullet points, the student would need to extract the key points from the given context description and format them into a bulleted list. However, the student states they cannot do this without the full context, despite the context being provided. Therefore, the student answer is incorrect.

GitLab project's CI has been configured to generate a report in Markdown and post it in MR automatically.

How does it work?

Each evaluation run is recorded and saved as a JSON.
scripts/duo_chat/reporter.rb processes the JSON files and posts a report (it's also made available as a CI artifact.) If the generated report is too long to be posted as a note, the report would only be available as an artfact.
- example of a report posted as a note: !134610 (comment 1619643333)
- example of a long report !134610 (comment 1619672663)

Limitations and known problems

Slow test run time

The fixtures are restored in before(:all) block and the requests to LLMs are not parallelized currently. When I last timed the test, it took 20~30 min to run all the examples.

Running evaluations using the rspec framework may not be ideal in the medium/long-term. We can iterate on the setup as we go.

How to set up and validate locally

You can either run the added rspec locally or just trigger a CI job rspec-ee unit gitlab-duo-chat pg14 in the pipeline https://gitlab.com/gitlab-org/gitlab/-/jobs/5362531867.

To run locally:

# required
export REAL_AI_REQUEST=1
export ANTHROPIC_API_KEY='<key>'  # can use dev value of Gitlab::CurrentSettings.anthropic_api_key
export VERTEX_AI_PROJECT='<vertex-project-name>' # can use dev value of Gitlab::CurrentSettings.vertex_ai_project
export VERTEX_AI_CREDENTIALS='<vertex-ai-credentials>' # can use dev value of Gitlab::CurrentSettings.vertex_ai_credentials

bundle exec rspec ee/spec/lib/gitlab/llm/chain/agents/zero_shot/qa_evaluator_spec.rb -fdoc

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Oct 27, 2023 by euko

Add QA evaluation rspec for duo chat

What does this MR do and why?

Overview of the evaluation method

Limitations and known problems

How to set up and validate locally

MR acceptance checklist

Merge request reports