Skip to content

Resolve "Create a Method to Load Stranded Data from GCS or Local After Pipeline Failure Due to Schema Irregularity"

What does this merge request do and why?

The code-suggestions eval pipeline takes a while to run and occasionally fails on the last step when there is a schema miss-match. Fortunately the data is stored by Apache Beam in a temporary GCS bucket and the data could be loaded from there if the proper schema was applied.

This solution must be generic enough to work with all pipelines!!! Simple ETL!

How to set up and validate locally

Request some test data from @srayner or run the bellow command to fetch the data:

gsutil cp gs://prompt-library/tmp/bq_load/379cf2a5ad1e4ea2b15f78a033d3c41b/dev-ai-research-0e2f8974.code_suggestion_external_results.output_full_v5-anthropic/c90bfae0-15ee-4e13-8847-5d78736a93c4 ./anthropic-run.jsonl

This is inline with the help message that you get when you run:

Screenshot_2024-03-10_at_14.36.59

Closes #188 (closed)

Edited by Stephan Rayner

Merge request reports