Skip to content

Implement DF pipeline to export dataset from BQ

Alexander Chueshev requested to merge export-bq into main

This MR implements the DF pipeline to export either training, test or validation dataset from BigQuery for the specified languages.

How to run:

Using the local direct runner:

export GOOGLE_APPLICATION_CREDENTIALS=<path to json key>
export GCP_PROJECT=unreview-poc-390200e5
export GCP_REGION=us-central1
export GCP_BUCKET_TEMP=unreview-dataflow

./venv/bin/python ./data/df/export-bq.py \
  --runner=DirectRunner \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --input_bq_table="unreview-poc-390200e5.gl_code_suggestions.sample_preprocessed_dataset_v1" \
  --language=c \
  --language=python \
  --split="test" \
  --output_path="data/export/" \
  --temp_location="gs://${GCP_BUCKET_TEMP}/tmp/" \
  --save_main_session

Using the dataflow runner:

export GOOGLE_APPLICATION_CREDENTIALS=<path to json file>
export GCP_PROJECT=unreview-poc-390200e5
export GCP_REGION=us-central1
export GCP_BUCKET_TEMP=unreview-dataflow
export GCP_BUCKET_EXPORT=code-suggestions

./venv/bin/python ./data/df/export-bq.py \
  --runner=DataflowRunner \
  --project=$GCP_PROJECT \
  --region=$GCP_REGION \
  --input_bq_table="unreview-poc-390200e5.gl_code_suggestions.sample_preprocessed_dataset_v1" \
  --language=c \
  --language=python \
  --language=ruby \
  --language=rust \
  --split="test" \
  --output_path="gs://${GCP_BUCKET_EXPORT}/data/export/sample/20230314/" \
  --temp_location="gs://${GCP_BUCKET_TEMP}/tmp/" \
  --save_main_session

Example of the DF pipeline used to export the following sample to GCS.

How to use HF datasets with the data locally

Please, find the snippet of how to load the exported dataset using HF datasets. In this example we assume the following directory structure:

data/export
   -- train
      -- c
      -- ruby
      -- rust
   -- test
      -- c
      -- ruby
      -- rust
from datasets import load_dataset

if __name__ == "__main__":
    dataset = load_dataset("data/export/")
    print(dataset["train"][0]["content"])
    print(dataset["test"][0]["content"])

Ref: ai-assist#22 (closed)

Edited by Alexander Chueshev

Merge request reports