New pipeline to create the code-generation dataset.
What does this merge request do and why?
Reference: #133
This MR introduce a new pipeline to generate a code-generation dataset. This code-generation dataset uses gitlab codebases to reflect real-world code-generation complexity.
The process of generating this dataset:
- Chunk gitlab code bases into blocks of functions and classes.
- Let an LLM to read the code and ask it to generate the prompt that would generate the code.
How to set up and validate locally
poetry run promptlib duo-chat make-dataset-code-generation --config-file data/config/duochat_make_dataset_code_generation.json --test-run
Merge request checklist
-
I've ran the affected pipeline(s) to validate that nothing is broken. https://console.cloud.google.com/bigquery?authuser=0&project=dev-ai-research-0e2f8974&ws=!1m5!1m4!4m3!1sdev-ai-research-0e2f8974!2sduo_chat_experiments!3shyang_code_generation_dataset_delme -
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Edited by Hongtao Yang