Create sample dataset with control results for experimentation
Problems to solve
As a Duo Chat engineer, I would like to:
- Run the evaluation pipeline using subsets of datasets on local Duo Chat changes
- Get the evaluation metrics and compare with the control results
Running the full datasets (used by the Centralised Evaluation Framework for the daily runs) are time-consuming, costly and subjected to concurrency limit and does not add any additional value for experimentation .
Proposal
Create a new set of sample datasets for each task that:
- Help to speed up the evaluation process for small experiment.
- Include representable tests that help to drive improvements.
Iteration
For future iteration we plan to automate this as part of the daily runs and have dynamic subsets for experimentation. We plan to add more sophistication to the dataset as a proxy to production
Edited by Mon Ray