Implement a generic LLMJudgeEvaluator to asses prompt correctness
Problem to solve
In gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!2225 (merged), we introduced a CI job that provides a skeleton for evaluating prompts. At this moment, we don't provide any evaluators to evaluate prompts, so our metrics are currently empty.
Proposal
Consider implementing an LLMJudgeEvaluator in ELI5 that will compare expected prompt outputs (coming from the dataset) with actual outputs (coming from running the model with the given prompt) using LLM capabilities. At this iteration, we can focus on using Anthropic 3.7/3.5 as an LLM judge. Here is the PoC that can be used as a reference for implementing the LLM-judge to assess prompt correctness - gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!2292 (diffs).
An idea:
PROMPT_SYSTEM = """
You are an AI assistant ...
Compare expected input vs actual.
Expected:
{expected}
Actual:
{actual}
"""
class EvaluatorInput(TypedDict):
# Define your input parameters for the prompt
expected: str
actual: str
class ModelOutput(BaseModel):
# Define the expected schema for the model responses
reasoning: str = Field(description="Provide your reasoning process")
score: int = Field(description="1 if correct and 0 if incorrect")
class Evaluator(BaseLLMEvaluator[EvaluatorInput]):
.....
Please check the docs about implementing evaluators in ELI5 here: https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/tree/main/doc/eli5/evaluators?ref_type=heads
Further details
We put all generic evaluators in eli5.core.evaluators.*.