Add support to evaluate with Claude 3 Haiku
What does this merge request do and why?
Add support to evaluate Duo Chat with Claude 3 Haiku.
https://www.anthropic.com/news/claude-3-haiku
How to set up and validate locally
-
Ensure GCP environment variables are setup.
-
Check out to this merge request's branch.
-
Update the eval config with the following content.
{ "beam_config": { "pipeline_options": { "runner": "DirectRunner", "project": "dev-ai-research-0e2f8974", "region": "us-central1", "temp_location": "gs://prompt-library/tmp/", "save_main_session": false } }, "input_bq_table": "dev-ai-research-0e2f8974.duo_chat_external.experiment_code_generation__input_v1", "output_sinks": [ { "type": "bigquery", "path": "dev-ai-research-0e2f8974.duo_chat_experiments", "prefix": "tl_claude_3_haiku" }, { "type": "local", "path": "data/output", "prefix": "experiment" } ], "throttle_sec": 0.1, "batch_size": 10, "input_adapter": "mbpp", "eval_setup": { "answering_models": [ { "name": "duo-chat", "parameters": { "base_url": "http://gdk.test:8080" }, "prompt_template_config": { "templates": [ { "name": "empty", "template_path": "data/prompts/duo_chat/answering/empty.txt.example" } ] } }, { "name": "claude-3-haiku", "prompt_template_config": { "templates": [ { "name": "empty", "template_path": "data/prompts/duo_chat/answering/claude-2.txt.example" } ] } } ], "metrics": [ { "metric": "similarity_score" }, { "metric": "independent_llm_judge", "evaluating_models": [ { "name": "text-bison@latest", "prompt_template_config": { "templates": [ { "name": "claude-2", "template_path": "data/prompts/duo_chat/evaluating/claude-2.txt.example" } ] } } ] } ] } }
-
Run the follow command to kick off the pipeline.
❯ poetry run promptlib duo-chat eval --test-run --sample-size 1 --config-file=data/config/duochat_eval_config.json Requesting answers from claude-3-haiku: 1it [00:37, 37.60s/it] Requesting answers from duo-chat: 1it [00:37, 37.61s/it]0s/it] Getting evaluation from text-bison@latest: 2it [00:08, 4.17s/it] Calculating similarity scores: 2it [00:04, 2.14s/it] INFO:promptlib.common.beam.io:Output written to BigQuery: dev-ai-research-0e2f8974:duo_chat_experiments.tl_claude_3_haiku_20240314_115243__independent_llm_judge INFO:promptlib.common.beam.io:Output written to BigQuery: dev-ai-research-0e2f8974:duo_chat_experiments.tl_claude_3_haiku_20240314_115242__similarity_score INFO:promptlib.common.beam.io:Output written to CSV: data/output/experiment_20240314_115243__independent_llm_judge-00000-of-00001.csv INFO:promptlib.common.beam.io:Output written to CSV: data/output/experiment_20240314_115242__similarity_score-00000-of-00001.csv
-
Inspect the results via BQ or local CSV.
Merge request checklist
-
I've ran the affected pipeline(s) to validate that nothing is broken. -
Tests added for new functionality. If not, please raise an issue to follow up. -
Documentation added/updated, if needed.
Edited by Tan Le