Add mechanisms to code prompt evaluations with generic LLMJudge evaluators

Problem to solve

Every evaluator created in ELI5 has a strictly defined input schema. The same applies for the prompt outputs. Every prompt defines an output schema that is expected by an application. To run prompt evaluations in the AIGW, we need to effectively transform prompt output parameters to a schema that is supported by an evaluator.

Proposal

To evaluate prompts in AIGW, we can use the following command: poetry run eval [prompt-id] [prompt-version] [dataset-name] or a Makefile target. When running this command, poetry triggers the eval script - https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/blob/main/eval/main.py?ref_type=heads that finds a prompt and runs evaluation for specified datasets and evaluators. At this moment, we don't pass evaluators, so our metrics are empty.

Consider updating the eval script following the PoC implemented in gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!2292 (diffs).

Please note the following key points:

The first iteration will focus on implementing a single LLM-based evaluator to assess correctness.
Implementing the LLM-based evaluator is covered in #665 (closed). The eval script should import the evaluator from the ELI5 installed as a dependency.
Invoked prompts can return either:
1. A string value, or
2. A structured output (depending on the parser associated with the LangChain chain in the Prompt Registry)
For the initial iteration, it's enough to have the LLM-based evaluator that will compare expected output against actual output
If the prompt generates a structured value (such as a Pydantic object or dictionary), convert the output to JSON before submitting to the evaluator

Further details

This work enhances the work Alejandro started by connecting ELI5 to AIGW. It doesn't require any changes on the CI side.

Edited Apr 10, 2025 by Alexander Chueshev