[Experimental]: Implement QA Duo Chat evaluators with their API endpoints
What does this merge request do and why?
This MR implements two QA DuoChat evaluators: qa
and context_qa
.
Both evaluators rely on the LLM, which is Anthropic claude-2
in this iteration.
- Evaluator
qa
: this evaluator instructsclaude-2
to directly grade an answer from Duo Chat as "correct" or "incorrect" based on the reference answer. - Evaluator
context_qa
: this evaluator instructsclaude-2
to use reference "context" in determining correctness. This is useful when we have a corpus of questions but don't have ground truth answers to each question.
Please, note that this MR also provides several API endpoints for each evaluator implemented with langserve
:
- /invoke - for invoking an evaluator with a single input
- /batch - for invoking an evaluator with multiple inputs
- /stream - for streaming the output of an evaluator
- /stream_log - for streaming intermediate outputs for an evaluator
- /input_schema - for returning the input schema of the evaluator
- /output_schema - for returning the output schema of the evaluator
- /config_schema - for returning the config schema of the evaluator
How to set up and validate locally
No additional steps to set up. The Gateway already knows how to work with the Anthropic models.


Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Blocked by !430 (merged)
Demo - https://youtu.be/e3zW6hNVKsc
Ref: gitlab-org/gitlab#427251 (closed)
Ref: https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/prompt-library/-/issues/91#metric-2-consensus-filtering-with-llm-based-evaluation
Edited by Alexander Chueshev