Synthetic Prompt Eval Dataset Generator Feedback

Context

This issue is used to collect internal feedback for the Synthetic Prompt Eval Dataset Generator tool.

The tool uses an LLM (Claude) to automatically generate synthetic datasets for prompt evaluations. It enables teams to quickly generate an initial dataset for prompt evaluation, which can serve as a foundation for more refined testing approaches later. This helps bridge the gap between implementing prompt evaluation logic and having relevant datasets to test with.

How to get started

Follow these steps to use the Synthetic Prompt Eval Dataset Generator:

  1. Make sure you have valid ANTHROPIC_API_KEY and LANGCHAIN_API_KEY in your .env file

  2. Install eval dependencies: poetry install --with eval

  3. Run the generate-dataset command with appropriate parameters:

    poetry run generate-dataset <prompt_id> <version> <dataset_name> --upload

    Where:

    • prompt_id: The ID of the AIGW prompt (e.g., chat/explain_code)
    • version: The version of the AIGW prompt
    • dataset_name: Name for the output dataset (used as the filename to save the dataset locally and as the name of the dataset in Langsmith when the --upload option is used)

    For example:

    poetry run generate-dataset chat/explain_code 1.0.2 duo_chat.explain_code.2 --upload
  4. The tool will:

    • Analyze the prompt definition to understand its purpose
    • Create diverse input examples covering varied cases
    • Generate expected outputs for each input
    • Save the resulting dataset as a JSONL file
    • Optionally upload the dataset to LangSmith (if --upload flag is used)

See the documentation for more information including CLI options.

How to leave feedback

  • Please post a comment on this issue to leave your feedback
  • Include as much information as possible, e.g., the command you used, the quality of the generated dataset, any issues encountered, etc.
  • Screenshots of problems and examples of generated datasets are greatly appreciated!
  • Share how you used the generated dataset in your evaluation process
  • Positive feedback is also welcome 😸

Known limitations

  • The --upload option will show an error if a dataset with the same name already exists in LangSmith.
    • If you want to replace an existing dataset, delete it first.
    • If you want to add examples to an existing dataset, you can download the old dataset as JSONL, combine it with the new generated examples, and then upload to LangSmith as a new dataset (see also the instructions in the datasets project
  • The current implementation generates a maximum of 8,192 tokens, which limits the number of examples that can be generated in a single execution. To generate larger datasets, execute the tool multiple times without the --upload option and then combine the resulting JSONL files and upload as a new dataset.
Edited by Mark Lapierre