Support local output of Chat evaluation results (!274) · Merge requests · GitLab.org / ModelOps / AI Model Validation and Research / AI Evaluation / Prompt Library

Tan Le requested to merge support-local-input-output into main Feb 15, 2024

What does this merge request do and why?

Support local output destinations.

"output_sinks": [                                                                           
  {                                                                                         
    "type": "bigquery",                                                                     
    "path": "dev-ai-research-0e2f8974.duo_chat_external_results.your_result_table_name_here",
    "prefix": "tl_chat_eval_code_generation"
    
  },                                                                                        
  {                                                                                         
    "type": "local",                                                                        
    "path": "data/output",
    "prefix": "experiment"                                                               
  }                                                                                         
]

Resolves #162 (closed)

How to set up and validate locally

Ensure GCP environment variables are setup.
Check out to this merge request's branch.

Create a run config with the following content (use MBPP config so we do not need to setup Duo Chat context)

{
  "beam_config": {
    "pipeline_options": {
      "runner": "DirectRunner",
      "project": "dev-ai-research-0e2f8974",
      "region": "us-central1",
      "temp_location": "gs://prompt-library/tmp/",
      "save_main_session": false
    }
  },
  "input_bq_table": "dev-ai-research-0e2f8974.code_generation.mbpp_sanitized_validation",
  "output_bq_table": "dev-ai-research-0e2f8974.duo_chat_experiments.tl_output_sink",
  "output_sinks": [
    {
      "type": "bigquery",
      "path": "dev-ai-research-0e2f8974.duo_chat_experiments",
      "prefix": "tl_output_sink"
    },
    {
      "type": "local",
      "path": "data/output",
      "prefix": "experiment"
    }
  ],
  "throttle_sec": 1,
  "batch_size": 10,
  "input_adapter": "mbpp",
  "eval_setup": {
    "answering_models": [
      {
        "name": "claude-2",
        "prompt_template_config": {
          "templates": [
            {
              "name": "empty",
              "template_path": "data/prompts/duo_chat/answering/claude-2.txt.example"
            }
          ]
        }
      },
      {
        "name": "text-bison@latest",
        "prompt_template_config": {
          "templates": [
            {
              "name": "claude-2",
              "template_path": "data/prompts/duo_chat/answering/claude-2.txt.example"
            }
          ]
        }
      }
    ],
    "metrics": [
      {
        "metric": "independent_llm_judge",
        "evaluating_models": [
          {
            "name": "claude-2",
            "prompt_template_config": {
              "templates": [
                {
                  "name": "claude-2",
                  "template_path": "data/prompts/duo_chat/evaluating/claude-2.txt.example"
                }
              ]
            }
          }
        ]
      },
      {
        "metric": "similarity_score"
      }
    ]
  }
}

Run the follow command to kick off the pipeline.

poetry run promptlib duo-chat eval --test-run --sample-size=1 --config-file=data/config/duochat_eval_mbpp_config.json
Requesting answers from claude-2: 1it [00:20, 20.20s/it]INFO:promptlib.common.beam.io:Output written to BigQuery: dev-ai-research-0e2f8974:duo_chat_experiments.tl_output_sink__independent_llm_judgesting answers from text-bison@latest: 1it [00:04,  4.51s/it]
INFO:promptlib.common.beam.io:Output written to BigQuery: dev-ai-research-0e2f8974:duo_chat_experiments.tl_output_sink__similarity_score
INFO:promptlib.common.beam.io:Output written to CSV: data/output/experiment_20240228_162129__independent_llm_judge-00000-of-00001.csv
INFO:promptlib.common.beam.io:Output written to CSV: data/output/experiment_20240228_162129__similarity_score-00000-of-00001.csv
Requesting answers from text-bison@latest: 1it [01:17, 77.11s/it]
Requesting answers from claude-2: 1it [01:17, 77.12s/it]
Getting evaluation from claude-2: 2it [00:56, 28.40s/it]

Check the local files.

Merge request checklist

I've ran the affected pipeline(s) to validate that nothing is broken.
Tests added for new functionality. If not, please raise an issue to follow up.
Documentation added/updated, if needed.

Edited Feb 28, 2024 by Tan Le

Support local output of Chat evaluation results

What does this merge request do and why?

How to set up and validate locally

Merge request checklist

Merge request reports