Evaluate using AI to evaluate search responses

Background

We want to make changes to Elasticsearch queries to improve relevance of results, but have no automated way to evaluate before/after.

Evaluate whether an AI platform could be used to evaluate search query/response accuracy. I'll list a few options below (feel free to add):

LLM judge to grade search response accuracy
- LLM judge is used in qa_evaluation spec
AI Framework group - Eval like I'm 5
Take inspiration from Duo Chat prompt change evaluations

Edited Jun 05, 2024 by Terri Chu