Re-visit documentation about existing dataset creation pipelines
Problem to solve
We lack comprehensive information about the available dataset creation pipelines in the CEF repo, how they function, and what outputs to expect.
Proposal
Update the documentation for existing dataset creation pipelines and create missing documentation following this structure:
Dataset creation pipeline name
How to run it using the CEF CLI
Link to the existing LangSmith dataset created with this command
Description of the pipeline
Requirements for successful dataset generation:
- Specify whether GitLab instance data seeding is required (e.g., creating issues or epics before running the command). If required, explain how to perform the seeding.
- Specify whether an input LangSmith dataset is required (e.g., as in `code-suggestions generate-testcase`). If required, explain how the input dataset was created.
- etc.
Make all changes and create missing documentation in doc/datasets
.
Further details
CEF currently contains the following dataset creation pipelines:
-
code-suggestions generate-testcases
-
duo-chat cot-qa-docs
-
duo-chat code-explanation
-
root-cause-analysis create-dataset
-
vulnerability-resolution create-dataset
Edited by Fabrizio J. Piva