Evaluation and testing framework

We need a way to verify that responses are valid and to collect data to evaluate how well we are answering questions.

Some ideas:

  • Add 👍/👎 so that the user can indicate if their question was answered
  • Create a list of N questions and expected answers. In an automated way (e.g. QA or CI pipeline job), ask the system to answer each question, convert the answer and expected answer to embeddings and check that the distance between them is close enough.
Edited by Madelein van Niekerk