Skip to content

Move Duo Chat code explanation dataset pipeline to ELI5

Move Duo Chat code explanation from Prompt Library to ELI5

This MR migrates the code explanation dataset pipeline from Prompt Library to the ELI5 codebase. Most important changes are:

  • Created the necessary structure in eli5/duochat for the code explanation dataset pipeline
  • Implemented the solution reading from BigQuery
  • Modified the code to write output to local JSONL files instead of BigQuery tables
  • Improved robustness by using pandas DataFrame to output a jsonl file

How to set up and validate locally

  1. Ensure proper access to GCP project called dev-ai-research-0e2f8974.
  2. Check out to this merge request's branch.
  3. Run the follow command to make code explanation dataset:
poetry run eli5 duo-chat collect build-dataset-code-explanation \
  --output-jsonl-path="test_code_explanation.jsonl" \
  --test-run \
  --sample-size=20

References

Parent issue: #710 (closed)

Merge request checklist

  • I've ran the affected pipeline(s) to validate that nothing is broken.
  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.

Closes #710 (closed)

Edited by Tan Le

Merge request reports

Loading