Implement a generic ExactMatchEvaluator to evaluate prompts

Problem to solve

In gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!2225 (merged), we introduced a CI job that provides a skeleton for evaluating prompts. At this moment, we don't provide any evaluators to evaluate prompts, so our metrics are currently empty.

Proposal

Consider implementing an ExactMatchEvaluator in ELI5 that will compare expected prompt outputs (coming from the dataset) with actual outputs (coming from running the model with the given prompt). This evaluator will run a comparison similar to the Python assert statement.

An idea:

class EvaluationInput(TypedDict):
    # Define your input parameters for the evaluator, they can be generic here - list, dict, str, int
    expected_answer: str
    actual_answer: str

class ExactEvaluator(BaseEvaluator[EvaluationInput]):
    # BaseEvaluator is a generic class that accepts the custom EvaluationInput to define input data
    def _run(self, inputs: EvaluationInput) -> EvaluatorResults | EvaluatorResult:
        score = int(inputs["expected_answer"].lower() == inputs["actual_answer"].lower())
        return {"key": "exact_match", "score": score}

Further details

We put all generic evaluators in eli5.core.evaluators.*.

Please check the docs about implementing evaluators in ELI5 here: https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/tree/main/doc/eli5/evaluators?ref_type=heads

Links / references

To compare outputs with an LLM, here is another issue: #665 (closed)

Edited by Alexander Chueshev