Add support to evaluate with Claude 3 Haiku (!325) · Merge requests · GitLab.org / ModelOps / AI Model Validation and Research / AI Evaluation / Prompt Library

Tan Le requested to merge add-claude-3-haiku into main Mar 14, 2024

What does this merge request do and why?

Add support to evaluate Duo Chat with Claude 3 Haiku.

https://www.anthropic.com/news/claude-3-haiku

How to set up and validate locally

Ensure GCP environment variables are setup.
Check out to this merge request's branch.

Update the eval config with the following content.

{
  "beam_config": {
    "pipeline_options": {
      "runner": "DirectRunner",
      "project": "dev-ai-research-0e2f8974",
      "region": "us-central1",
      "temp_location": "gs://prompt-library/tmp/",
      "save_main_session": false
    }
  },
  "input_bq_table": "dev-ai-research-0e2f8974.duo_chat_external.experiment_code_generation__input_v1",
  "output_sinks": [
    {
      "type": "bigquery",
      "path": "dev-ai-research-0e2f8974.duo_chat_experiments",
      "prefix": "tl_claude_3_haiku"
    },
    {
      "type": "local",
      "path": "data/output",
      "prefix": "experiment"
    }
  ],
  "throttle_sec": 0.1,
  "batch_size": 10,
  "input_adapter": "mbpp",
  "eval_setup": {
    "answering_models": [
      {
        "name": "duo-chat",
        "parameters": {
          "base_url": "http://gdk.test:8080"
        },
        "prompt_template_config": {
          "templates": [
            {
              "name": "empty",
              "template_path": "data/prompts/duo_chat/answering/empty.txt.example"
            }
          ]
        }
      },
      {
        "name": "claude-3-haiku",
        "prompt_template_config": {
          "templates": [
            {
              "name": "empty",
              "template_path": "data/prompts/duo_chat/answering/claude-2.txt.example"
            }
          ]
        }
      }
    ],
    "metrics": [
      {
        "metric": "similarity_score"
      },
      {
        "metric": "independent_llm_judge",
        "evaluating_models": [
          {
            "name": "text-bison@latest",
            "prompt_template_config": {
              "templates": [
                {
                  "name": "claude-2",
                  "template_path": "data/prompts/duo_chat/evaluating/claude-2.txt.example"
                }
              ]
            }
          }
        ]
      }
    ]
  }
}

Run the follow command to kick off the pipeline.

❯ poetry run promptlib duo-chat eval --test-run --sample-size 1 --config-file=data/config/duochat_eval_config.json             
Requesting answers from claude-3-haiku: 1it [00:37, 37.60s/it]                                                                                                                                  
Requesting answers from duo-chat: 1it [00:37, 37.61s/it]0s/it]                                                                                                                                  
Getting evaluation from text-bison@latest: 2it [00:08,  4.17s/it]                                                                                                                               
Calculating similarity scores: 2it [00:04,  2.14s/it]                                                                                                                                           
INFO:promptlib.common.beam.io:Output written to BigQuery: dev-ai-research-0e2f8974:duo_chat_experiments.tl_claude_3_haiku_20240314_115243__independent_llm_judge                                
INFO:promptlib.common.beam.io:Output written to BigQuery: dev-ai-research-0e2f8974:duo_chat_experiments.tl_claude_3_haiku_20240314_115242__similarity_score                                     
INFO:promptlib.common.beam.io:Output written to CSV: data/output/experiment_20240314_115243__independent_llm_judge-00000-of-00001.csv                                                           
INFO:promptlib.common.beam.io:Output written to CSV: data/output/experiment_20240314_115242__similarity_score-00000-of-00001.csv

Inspect the results via BQ or local CSV.

Merge request checklist

I've ran the affected pipeline(s) to validate that nothing is broken.
Tests added for new functionality. If not, please raise an issue to follow up.
Documentation added/updated, if needed.

Relates to https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/issues/180

Edited Mar 14, 2024 by Tan Le

Add support to evaluate with Claude 3 Haiku

What does this merge request do and why?

How to set up and validate locally

Merge request checklist

Merge request reports