Skip to content

Support local output of Chat evaluation results

Tan Le requested to merge support-local-input-output into main

What does this merge request do and why?

Support local output destinations.

"output_sinks": [                                                                           
  {                                                                                         
    "type": "bigquery",                                                                     
    "path": "dev-ai-research-0e2f8974.duo_chat_external_results.your_result_table_name_here",
    "prefix": "tl_chat_eval_code_generation"
    
  },                                                                                        
  {                                                                                         
    "type": "local",                                                                        
    "path": "data/output",
    "prefix": "experiment"                                                               
  }                                                                                         
]

Resolves #162 (closed)

How to set up and validate locally

  1. Ensure GCP environment variables are setup.

  2. Check out to this merge request's branch.

  3. Create a run config with the following content (use MBPP config so we do not need to setup Duo Chat context)

    {
      "beam_config": {
        "pipeline_options": {
          "runner": "DirectRunner",
          "project": "dev-ai-research-0e2f8974",
          "region": "us-central1",
          "temp_location": "gs://prompt-library/tmp/",
          "save_main_session": false
        }
      },
      "input_bq_table": "dev-ai-research-0e2f8974.code_generation.mbpp_sanitized_validation",
      "output_bq_table": "dev-ai-research-0e2f8974.duo_chat_experiments.tl_output_sink",
      "output_sinks": [
        {
          "type": "bigquery",
          "path": "dev-ai-research-0e2f8974.duo_chat_experiments",
          "prefix": "tl_output_sink"
        },
        {
          "type": "local",
          "path": "data/output",
          "prefix": "experiment"
        }
      ],
      "throttle_sec": 1,
      "batch_size": 10,
      "input_adapter": "mbpp",
      "eval_setup": {
        "answering_models": [
          {
            "name": "claude-2",
            "prompt_template_config": {
              "templates": [
                {
                  "name": "empty",
                  "template_path": "data/prompts/duo_chat/answering/claude-2.txt.example"
                }
              ]
            }
          },
          {
            "name": "text-bison@latest",
            "prompt_template_config": {
              "templates": [
                {
                  "name": "claude-2",
                  "template_path": "data/prompts/duo_chat/answering/claude-2.txt.example"
                }
              ]
            }
          }
        ],
        "metrics": [
          {
            "metric": "independent_llm_judge",
            "evaluating_models": [
              {
                "name": "claude-2",
                "prompt_template_config": {
                  "templates": [
                    {
                      "name": "claude-2",
                      "template_path": "data/prompts/duo_chat/evaluating/claude-2.txt.example"
                    }
                  ]
                }
              }
            ]
          },
          {
            "metric": "similarity_score"
          }
        ]
      }
    }
  4. Run the follow command to kick off the pipeline.

    poetry run promptlib duo-chat eval --test-run --sample-size=1 --config-file=data/config/duochat_eval_mbpp_config.json
    Requesting answers from claude-2: 1it [00:20, 20.20s/it]INFO:promptlib.common.beam.io:Output written to BigQuery: dev-ai-research-0e2f8974:duo_chat_experiments.tl_output_sink__independent_llm_judgesting answers from text-bison@latest: 1it [00:04,  4.51s/it]
    INFO:promptlib.common.beam.io:Output written to BigQuery: dev-ai-research-0e2f8974:duo_chat_experiments.tl_output_sink__similarity_score
    INFO:promptlib.common.beam.io:Output written to CSV: data/output/experiment_20240228_162129__independent_llm_judge-00000-of-00001.csv
    INFO:promptlib.common.beam.io:Output written to CSV: data/output/experiment_20240228_162129__similarity_score-00000-of-00001.csv
    Requesting answers from text-bison@latest: 1it [01:17, 77.11s/it]
    Requesting answers from claude-2: 1it [01:17, 77.12s/it]
    Getting evaluation from claude-2: 2it [00:56, 28.40s/it]
  5. Check the local files.

Merge request checklist

  • I've ran the affected pipeline(s) to validate that nothing is broken.
  • Tests added for new functionality. If not, please raise an issue to follow up.
  • Documentation added/updated, if needed.
Edited by Tan Le

Merge request reports