Move Duo Chat code explanation dataset pipeline to ELI5
Move Duo Chat code explanation from Prompt Library to ELI5
This MR migrates the code explanation dataset pipeline from Prompt Library to the ELI5 codebase. Most important changes are:
- Created the necessary structure in eli5/duochat for the code explanation dataset pipeline
- Implemented the solution reading from BigQuery
- Modified the code to write output to local JSONL files instead of BigQuery tables
- Improved robustness by using pandas DataFrame to output a jsonl file
How to set up and validate locally
- Ensure proper access to GCP project called
dev-ai-research-0e2f8974
. - Check out to this merge request's branch.
- Run the follow command to make code explanation dataset:
poetry run eli5 duo-chat collect build-dataset-code-explanation \
--output-jsonl-path="test_code_explanation.jsonl" \
--test-run \
--sample-size=20
References
Parent issue: #710 (closed)
Merge request checklist
-
I've ran the affected pipeline(s) to validate that nothing is broken. -
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Closes #710 (closed)
Edited by Tan Le