Implement a generic LLMJudgeEvaluator to asses prompt correctness

Problem to solve

In gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!2225 (merged), we introduced a CI job that provides a skeleton for evaluating prompts. At this moment, we don't provide any evaluators to evaluate prompts, so our metrics are currently empty.

Proposal

Consider implementing an LLMJudgeEvaluator in ELI5 that will compare expected prompt outputs (coming from the dataset) with actual outputs (coming from running the model with the given prompt) using LLM capabilities. At this iteration, we can focus on using Anthropic 3.7/3.5 as an LLM judge. Here is the PoC that can be used as a reference for implementing the LLM-judge to assess prompt correctness - gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!2292 (diffs).

An idea:

PROMPT_SYSTEM = """
You are an AI assistant ...
Compare expected input vs actual.
Expected:
{expected}

Actual:
{actual}
"""

class EvaluatorInput(TypedDict):
    # Define your input parameters for the prompt
    expected: str
    actual: str

class ModelOutput(BaseModel):
    # Define the expected schema for the model responses
    reasoning: str = Field(description="Provide your reasoning process")
    score: int = Field(description="1 if correct and 0 if incorrect")


class Evaluator(BaseLLMEvaluator[EvaluatorInput]):
   .....

Please check the docs about implementing evaluators in ELI5 here: https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/tree/main/doc/eli5/evaluators?ref_type=heads

Further details

We put all generic evaluators in eli5.core.evaluators.*.

Edited by Alexander Chueshev