Create sample dataset with control results for experimentation

Problems to solve

As a Duo Chat engineer, I would like to:

Run the evaluation pipeline using subsets of datasets on local Duo Chat changes
Get the evaluation metrics and compare with the control results

Running the full datasets (used by the Centralised Evaluation Framework for the daily runs) are time-consuming, costly and subjected to concurrency limit and does not add any additional value for experimentation .

Proposal

Create a new set of sample datasets for each task that:

Help to speed up the evaluation process for small experiment.
Include representable tests that help to drive improvements.

Iteration

For future iteration we plan to automate this as part of the daily runs and have dynamic subsets for experimentation. We plan to add more sophistication to the dataset as a proxy to production

Edited Mar 12, 2024 by Mon Ray