Implement a generic ExactMatchEvaluator to evaluate prompts
Problem to solve
In gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!2225 (merged), we introduced a CI job that provides a skeleton for evaluating prompts. At this moment, we don't provide any evaluators to evaluate prompts, so our metrics are currently empty.
Proposal
Consider implementing an ExactMatchEvaluator in ELI5 that will compare expected prompt outputs (coming from the dataset) with actual outputs (coming from running the model with the given prompt). This evaluator will run a comparison similar to the Python assert statement.
An idea:
class EvaluationInput(TypedDict):
# Define your input parameters for the evaluator, they can be generic here - list, dict, str, int
expected_answer: str
actual_answer: str
class ExactEvaluator(BaseEvaluator[EvaluationInput]):
# BaseEvaluator is a generic class that accepts the custom EvaluationInput to define input data
def _run(self, inputs: EvaluationInput) -> EvaluatorResults | EvaluatorResult:
score = int(inputs["expected_answer"].lower() == inputs["actual_answer"].lower())
return {"key": "exact_match", "score": score}
Further details
We put all generic evaluators in eli5.core.evaluators.*.
Please check the docs about implementing evaluators in ELI5 here: https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/tree/main/doc/eli5/evaluators?ref_type=heads
Links / references
To compare outputs with an LLM, here is another issue: #665 (closed)