Create test framework that includes testing the answers

Problem

Our current RSpec test framework just tests if the chat picks the right tool to answer the question. When this fails, we know that answer can't be good. But if this passes we still may get a wrong or useless answer.

Goal

Goal is to have a test framework

that lets team members and contributors make changes to prompts and other things and see how these changes affect the answer quality. (I use the word quality here as I don’t have a better word that would cover accuracy and helpfulness in one.)
that informs prompt engineering how to change the prompts to improve the chat.
that let’s us monitor the chat, so that we know when things go south.

Proposal (seriously only a proposal!)

Define a set of known good answers for given question be it based on out golden question list or based on user questions from the chat bashes.
- use an LLM to create different ways of asking the same question
- randomly choose contexts to apply the user question to
- use an LLM (or multiple LLMs) to generate a set of known good answers and their embeddings.
- Here is an example how that could look like based on a hand-crafted set of three questions
run the questions and the context through the chat and compare the answers and their embeddings with the embeddings of the known good answers.
- check if the chat's responses come close to these answers
Go through the results to identify patterns how the chat fails and improve the prompts to get better.
Run a subset of these potentially thousands of tests before each merge to see if the change leads to a degradation.
Also run this subset of questions in a regular monitoring job.

Edited Sep 25, 2023 by Torsten Linz