feat: make pairwise evaluation feature-specific with pre-defined evaluators
What does this merge request do and why?
This MR introduces a set of generic/base classes to build feature-specific pairwise evaluation pipelines. Please, note that we cannot keep one generic pairwise evaluation due to different dataset schemas and DRIs - https://gitlab.com/gitlab-com/content-sites/internal-handbook/-/merge_requests/5388 (internal only).
As an example, this MR demonstrates how to build pairwise evaluation for Duo Chat and available datasets.
How to set up and validate locally
- Check out to this merge request's branch.
- Update your .env file.
- Install dependencies.
poetry install
- Check the existing commands ELI5 provides:
poetry run eli5 --help
- Run pairwise evaluation for Duo Chat documentation-related dataset:
poetry run eli5 duo-chat evaluate pairwise c1fe0d17-32eb-4697-a5c9-0d5dbb1eb20c b6af3206-9807-4754-ac31-2deb43a1a320 --dataset=duo_chat.cot_qa_docs.1
- Run pairwise evaluation for Duo Chat issue/epic-related dataset:
poetry run eli5 duo-chat evaluate pairwise b26f592e-1398-4284-ad03-81c486e32bfc 5a67fedf-bd26-42fc-89f4-4c8875ad0f28 --dataset=duo_chat.cot_qa_resources.1
Merge request checklist
-
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.