RAG evaluation framework
Problem to solve
Retrieval Augmented Generation (RAG) at GitLab (!145343 - merged) proposes a framework for RAG solutions at GitLab. A simple RAG consists of retrieval (getting context-specific information) and generation (sending the context and question to an LLM to generate a response).
Because each step of RAG can be customized and use different retrievers, we would like to have a way to measure if the overall RAG is performing well compared to the expected output and compared to other solutions.
Proposal
Develop an evaluation framework for measuring RAG metrics loosely based on Ragas.
Iteration 1
The simplest and quickest metric to implement is Answer Similarity:
Take a sample question and establish a ground truth answer. Then use the RAG to generate the answer. Finally use an LLM to generate the embeddings for the ground truth and RAG answer and find the similarity between the two. Do this for N questions. Higher similarity means better RAG.
I'd suggest having a script or rake task to run on master and on a branch which outputs the similarity so that they can be compared.