Support local output of Chat evaluation results
What does this merge request do and why?
Support local output destinations.
"output_sinks": [
{
"type": "bigquery",
"path": "dev-ai-research-0e2f8974.duo_chat_external_results.your_result_table_name_here",
"prefix": "tl_chat_eval_code_generation"
},
{
"type": "local",
"path": "data/output",
"prefix": "experiment"
}
]
Resolves #162 (closed)
How to set up and validate locally
-
Ensure GCP environment variables are setup.
-
Check out to this merge request's branch.
-
Create a run config with the following content (use MBPP config so we do not need to setup Duo Chat context)
{ "beam_config": { "pipeline_options": { "runner": "DirectRunner", "project": "dev-ai-research-0e2f8974", "region": "us-central1", "temp_location": "gs://prompt-library/tmp/", "save_main_session": false } }, "input_bq_table": "dev-ai-research-0e2f8974.code_generation.mbpp_sanitized_validation", "output_bq_table": "dev-ai-research-0e2f8974.duo_chat_experiments.tl_output_sink", "output_sinks": [ { "type": "bigquery", "path": "dev-ai-research-0e2f8974.duo_chat_experiments", "prefix": "tl_output_sink" }, { "type": "local", "path": "data/output", "prefix": "experiment" } ], "throttle_sec": 1, "batch_size": 10, "input_adapter": "mbpp", "eval_setup": { "answering_models": [ { "name": "claude-2", "prompt_template_config": { "templates": [ { "name": "empty", "template_path": "data/prompts/duo_chat/answering/claude-2.txt.example" } ] } }, { "name": "text-bison@latest", "prompt_template_config": { "templates": [ { "name": "claude-2", "template_path": "data/prompts/duo_chat/answering/claude-2.txt.example" } ] } } ], "metrics": [ { "metric": "independent_llm_judge", "evaluating_models": [ { "name": "claude-2", "prompt_template_config": { "templates": [ { "name": "claude-2", "template_path": "data/prompts/duo_chat/evaluating/claude-2.txt.example" } ] } } ] }, { "metric": "similarity_score" } ] } }
-
Run the follow command to kick off the pipeline.
poetry run promptlib duo-chat eval --test-run --sample-size=1 --config-file=data/config/duochat_eval_mbpp_config.json Requesting answers from claude-2: 1it [00:20, 20.20s/it]INFO:promptlib.common.beam.io:Output written to BigQuery: dev-ai-research-0e2f8974:duo_chat_experiments.tl_output_sink__independent_llm_judgesting answers from text-bison@latest: 1it [00:04, 4.51s/it] INFO:promptlib.common.beam.io:Output written to BigQuery: dev-ai-research-0e2f8974:duo_chat_experiments.tl_output_sink__similarity_score INFO:promptlib.common.beam.io:Output written to CSV: data/output/experiment_20240228_162129__independent_llm_judge-00000-of-00001.csv INFO:promptlib.common.beam.io:Output written to CSV: data/output/experiment_20240228_162129__similarity_score-00000-of-00001.csv Requesting answers from text-bison@latest: 1it [01:17, 77.11s/it] Requesting answers from claude-2: 1it [01:17, 77.12s/it] Getting evaluation from claude-2: 2it [00:56, 28.40s/it]
-
Check the local files.
Merge request checklist
-
I've ran the affected pipeline(s) to validate that nothing is broken. -
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Edited by Tan Le