List PL dataset creation pipelines not covered in ELI5
Problem to solve
The Prompt Library contains logic for creating datasets to run evaluations. This logic uses BigQuery and Apache Beam. Since we rely on LangSmith and given our evaluation consolidation efforts, this dataset logic is no longer maintained well and needs to be moved to ELI5.
Proposal
Move the following Prompt Library dataset creation pipelines to ELI5:
| PL dataset creation pipeline name | Link to the code file | Dataset used by the eval runner |
|---|---|---|
| duo_chat code explanation dataset | promptlib/duo_chat/make_dataset_code_explanation.py | duo_chat.explain_code.1 |
| vulnerability resolution dataset | promptlib/etv/extract/extract_data.py | will be created by #669 (closed) |
| code_suggestions testcase generation | promptlib/code_suggestions/generate_testcases.py | code-suggestions-input-testcases-v1 |
| root_cause_analysis dataset | promptlib/root_cause_analysis/extract_data.py | duo_chat.slash_troubleshoot.1 |
Further details
A recent example of a dataset creation pipeline that has not been moved to ELI5 but is already used by the eval runner can be found in this merge request: https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner/-/merge_requests/109
For comparison, here is an example of a dataset creation pipeline that is used by the Eval Runner and has already been moved to ELI5: https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/blob/main/eli5/eli5/duochat/data/collectors/qa_docs.py?ref_type=heads
We need to ensure all dataset creation pipelines are in a manageable state before transferring ownership to feature teams. Based on the collected list, we can schedule additional issues to plan the migration work.
Additional links
The Custom Models team tracks all validation datasets (both added and planned) used by the Evaluation runner in this epic: gitlab-org&16626.