Automatically generate initial datasets for prompt evaluations

Problem to Solve

In #561 (closed), we aim to implement logic for evaluating prompts with a given dataset. While we can have the logic to evaluate prompts, we are still lacking in relevant datasets. The datasets for prompt evaluations mostly differ from existing E2E feature evaluation datasets, making it challenging to reuse them without significant manual effort. To better support feature teams in their evaluations, we need to provide an approach to generate a basic dataset that can be used for initial prompt evaluation attempts. This generated dataset can offer a first overview to help teams iterate and improve their testing.

Proposal

We can leverage LLMs to generate a synthetic dataset for a given prompt. This dataset can serve as a good initial version to be refined later through manual efforts or alternative approaches. The LLM can accept the prompt and output prompt schema (either as simple strings or more structured formats like those used in Duo Chat), generating appropriate dataset rows. After uploading the dataset to LangSmith, the author will have the option to manually clean it and subsequently run prompt evaluations.

Open feedback issue

Edited May 21, 2025 by Mark Lapierre