Tool example improvements (!146634) · Merge requests · GitLab.org / GitLab

Tim Zallmann requested to merge tz-tool-prompt-improvements into master Mar 05, 2024

What does this MR do and why?

Used the MR for tool improvements as a base - !146589 (merged)

Changed the order of tools
Changed the format how examples are written (to use the same step format in the actual execution)
Added more definition of current, this and that to issue reader

This merge request updates the codebase of a large language model (LLM) tool. The changes introduce a new tool called "CiEditorAssistant" and enhance the existing tools, such as "IssueReader", "GitlabDocumentation", and "EpicReader". These tools help users interact with the LLM more effectively by providing more accurate and informative responses to their queries. The updates also improve the overall user experience by making the tool descriptions more comprehensive and providing better examples of how to use each tool.

Prompt Library configuration

Input dataset: duo_chat_external.sm_chat_dataset_2_v1_copy_v3
- This dataset contains only the problematic response I am sorry, I am unable to find what you are looking for from dev-ai-research-0e2f8974.duo_chat.chat_dataset_2_v1. See this comment for the extraction process.
Output dataset: duo_chat_external_results.sm_chat_dataset_2_v1_copy_v3_mr_146634_latest.

full configuration

{
  "beam_config": {
    "pipeline_options": {
      "runner": "DirectRunner",
      "project": "dev-ai-research-0e2f8974",
      "region": "us-central1",
      "temp_location": "gs://prompt-library/tmp/",
      "save_main_session": false
    }
  },
  "input_bq_table": "dev-ai-research-0e2f8974.duo_chat.sm_chat_dataset_2_v1_copy_v3",
  "output_sinks": [
    {
      "type": "bigquery",
      "path": "dev-ai-research-0e2f8974.duo_chat_external_results",
      "prefix": "sm_chat_dataset_2_v1_copy_v3_mr_146634_latest"
    }
  ],
  "throttle_sec": 0.1,
  "batch_size": 10,
  "eval_setup": {
    "answering_models": [
      {
        "name": "claude-2",
        "prompt_template_config": {
          "templates": [
            {
              "name": "empty",
              "template_path": "data/prompts/duo_chat/answering/claude-2.txt.example"
            }
          ]
        }
      },
      {
        "name": "duo-chat",
        "parameters": {
          "base_url": "http://gdk.test:3000"
        },
        "prompt_template_config": {
          "templates": [
            {
              "name": "empty",
              "template_path": "data/prompts/duo_chat/answering/empty.txt.example"
            }
          ]
        }
      }
    ],
    "metrics": [
      {
        "metric": "similarity_score"
      },
      {
        "metric": "independent_llm_judge",
        "evaluating_models": [
          {
            "name": "claude-2",
            "prompt_template_config": {
              "templates": [
                {
                  "name": "claude-2",
                  "template_path": "data/prompts/duo_chat/evaluating/claude-2.txt.example"
                }
              ]
            }
          }
        ]
      }
    ]
  }
}

Evaluation results - Independent LLM Judge - Correctness

Before: Latest result from daily production evaluation (master)
After: This MR (tz-tool-prompt-improvements - SHA: 53c29e508e379abe57acf9ead2bd24f1d77e3bbb)

grade	before_percentage	after_percentage
4	35.6	50.0
3	13.9	28.1
2	2.3	9.4
1	28.7	9.4

query

WITH grades as (
  SELECT 4 as grade union all
  SELECT 3 as grade union all
  SELECT 2 as grade union all
  SELECT 1 as grade
), before_base_table AS (
  SELECT *
  FROM `dev-ai-research-0e2f8974.duo_chat_daily_runs.chat_dataset_2_v1__independent_llm_judge`
  WHERE answering_model = 'duo-chat'
    AND EXTRACT(DATE FROM created_at) = EXTRACT(DATE FROM CURRENT_TIMESTAMP())
), after_base_table AS (
  SELECT *
  FROM `dev-ai-research-0e2f8974.duo_chat_external_results.sm_chat_dataset_2_v1_copy_v3_mr_146634_latest_20240307_161253__independent_llm_judge`
  WHERE answering_model = 'duo-chat'
), before_correctness_grade AS (
  SELECT correctness as grade, COUNT(*) as count
  FROM before_base_table
  GROUP BY correctness
), after_correctness_grade AS (
  SELECT correctness as grade, COUNT(*) as count
  FROM after_base_table
  GROUP BY correctness
)

SELECT grades.grade AS grade,
       ROUND((COALESCE(before_correctness_grade.count, 0) / (SELECT COUNT(*) FROM before_base_table)) * 100.0, 1) AS before_percentage,
       ROUND((COALESCE(after_correctness_grade.count, 0) / (SELECT COUNT(*) FROM after_base_table)) * 100.0, 1) AS after_percentage,
FROM grades
LEFT OUTER JOIN before_correctness_grade ON before_correctness_grade.grade = grades.grade
LEFT OUTER JOIN after_correctness_grade ON after_correctness_grade.grade = grades.grade;

Evaluation results - Similarity score

similarity_score_range	before_percentage	after_percentage
1.0	2.8	3.1
0.9	33.8	59.4
0.8	20.8	18.8
0.7	9.7	12.5
0.6	8.8	0.0
0.5	24.1	6.3
0.4	0.0	0.0
0.3	0.0	0.0
0.2	0.0	0.0
0.1	0.0	0.0

query

WITH buckets as (
  SELECT 1.0 as bucket union all
  SELECT 0.9 as bucket union all
  SELECT 0.8 as bucket union all
  SELECT 0.7 as bucket union all
  SELECT 0.6 as bucket union all
  SELECT 0.5 as bucket union all
  SELECT 0.4 as bucket union all
  SELECT 0.3 as bucket union all
  SELECT 0.2 as bucket union all
  SELECT 0.1 as bucket
), before_similarity_score AS (
  SELECT *
  FROM `dev-ai-research-0e2f8974.duo_chat_daily_runs.chat_dataset_2_v1__similarity_score`
  WHERE answering_model = 'duo-chat'
    AND comparison_model = 'claude-2'
    AND EXTRACT(DATE FROM created_at) = EXTRACT(DATE FROM CURRENT_TIMESTAMP())
), after_similarity_score AS (
  SELECT *
  FROM `dev-ai-research-0e2f8974.duo_chat_external_results.sm_chat_dataset_2_v1_copy_v3_mr_146634_latest_20240307_161253__similarity_score`
  WHERE answering_model = 'duo-chat'
)

SELECT buckets.bucket AS similarity_score_range,
    (
        SELECT ROUND((COUNT(*) / (SELECT COUNT(*) FROM before_similarity_score)) * 100.0, 1)
        FROM before_similarity_score
        WHERE buckets.bucket = ROUND(before_similarity_score.comparison_similarity, 1)
    ) AS before_percentage,
    (
        SELECT ROUND((COUNT(*) / (SELECT COUNT(*) FROM after_similarity_score)) * 100.0, 1)
        FROM after_similarity_score
        WHERE buckets.bucket = ROUND(after_similarity_score.comparison_similarity, 1)
    ) AS after_percentage,
FROM buckets

Edited Mar 07, 2024 by Shinya Maeda

Tool example improvements

What does this MR do and why?

Prompt Library configuration

Evaluation results - Independent LLM Judge - Correctness

Evaluation results - Similarity score

Merge request reports