Skip to content

[Experimental]: Implement QA Duo Chat evaluators with their API endpoints

Alexander Chueshev requested to merge ac/chat-eval-qa-chains into main

What does this merge request do and why?

This MR implements two QA DuoChat evaluators: qa and context_qa. Both evaluators rely on the LLM, which is Anthropic claude-2 in this iteration.

  1. Evaluator qa: this evaluator instructs claude-2 to directly grade an answer from Duo Chat as "correct" or "incorrect" based on the reference answer.
  2. Evaluator context_qa: this evaluator instructs claude-2 to use reference "context" in determining correctness. This is useful when we have a corpus of questions but don't have ground truth answers to each question.

Please, note that this MR also provides several API endpoints for each evaluator implemented with langserve:

  • /invoke - for invoking an evaluator with a single input
  • /batch - for invoking an evaluator with multiple inputs
  • /stream - for streaming the output of an evaluator
  • /stream_log - for streaming intermediate outputs for an evaluator
  • /input_schema - for returning the input schema of the evaluator
  • /output_schema - for returning the output schema of the evaluator
  • /config_schema - for returning the config schema of the evaluator

How to set up and validate locally

No additional steps to set up. The Gateway already knows how to work with the Anthropic models.

Merge request checklist

  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.

Blocked by !430 (merged)
Demo - https://youtu.be/e3zW6hNVKsc

Ref: gitlab-org/gitlab#427251 (closed)
Ref: https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/prompt-library/-/issues/91#metric-2-consensus-filtering-with-llm-based-evaluation

cc @oregand @jessieay @bcardoso- @stanhu

Edited by Alexander Chueshev

Merge request reports