Skip to content

Re-visit documentation about existing dataset creation pipelines

Problem to solve

We lack comprehensive information about the available dataset creation pipelines in the CEF repo, how they function, and what outputs to expect.

Proposal

Update the documentation for existing dataset creation pipelines and create missing documentation following this structure:

Dataset creation pipeline name
How to run it using the CEF CLI
Link to the existing LangSmith dataset created with this command
Description of the pipeline
Requirements for successful dataset generation:
  - Specify whether GitLab instance data seeding is required (e.g., creating issues or epics before running the command). If required, explain how to perform the seeding.
  - Specify whether an input LangSmith dataset is required (e.g., as in `code-suggestions generate-testcase`). If required, explain how the input dataset was created.
  - etc.

Make all changes and create missing documentation in doc/datasets.

Further details

CEF currently contains the following dataset creation pipelines:

  • code-suggestions generate-testcases
  • duo-chat cot-qa-docs
  • duo-chat code-explanation
  • root-cause-analysis create-dataset
  • vulnerability-resolution create-dataset
Edited by Fabrizio J. Piva