[Experimental]: Implement QA Duo Chat evaluators with their API endpoints (!431) · Merge requests · GitLab.org / ModelOps / AI Assisted (formerly Applied ML) / Code Suggestions / AI Gateway

What does this merge request do and why?

This MR implements two QA DuoChat evaluators: qa and context_qa. Both evaluators rely on the LLM, which is Anthropic claude-2 in this iteration.

Evaluator qa: this evaluator instructs claude-2 to directly grade an answer from Duo Chat as "correct" or "incorrect" based on the reference answer.
Evaluator context_qa: this evaluator instructs claude-2 to use reference "context" in determining correctness. This is useful when we have a corpus of questions but don't have ground truth answers to each question.

Please, note that this MR also provides several API endpoints for each evaluator implemented with langserve: