Skip to content

New pipeline to create the code-generation dataset.

Hongtao Yang requested to merge hyang/code-generation-data into main

What does this merge request do and why?

Reference: #133

This MR introduce a new pipeline to generate a code-generation dataset. This code-generation dataset uses gitlab codebases to reflect real-world code-generation complexity.

The process of generating this dataset:

  1. Chunk gitlab code bases into blocks of functions and classes.
  2. Let an LLM to read the code and ask it to generate the prompt that would generate the code.

How to set up and validate locally

poetry run promptlib duo-chat make-dataset-code-generation --config-file data/config/duochat_make_dataset_code_generation.json --test-run

Merge request checklist

Edited by Hongtao Yang

Merge request reports