Improve prompt (!145959) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

I was testing hypothesis that removing certain guardrails from prompt will improve responses about writing functions.

It worked in some cases, also chat is still protected against questions not connected to coding. Please see screenshots.

Closes gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#1 (closed)

Prompt Library configuration

Input dataset: duo_chat_external.experiment_code_generation__input_v1
Output dataset: duo_chat_external_results.lulalala_mr145959_58624778.

full configuration

{
  "beam_config": {
    "pipeline_options": {
      "runner": "DirectRunner",
      "project": "dev-ai-research-0e2f8974",
      "region": "us-central1",
      "temp_location": "gs://prompt-library/tmp/",
      "save_main_session": false
    }
  },
  "input_bq_table": "dev-ai-research-0e2f8974.duo_chat_external.experiment_code_generation__input_v1",
  "output_bq_table": "dev-ai-research-0e2f8974.duo_chat_external_results.lulalala_mr145959_58624778",
  "throttle_sec": 0.1,
  "batch_size": 10,
  "input_adapter": "mbpp",
  "eval_setup": {
    "answering_models": [
      {
        "name": "claude-2",
        "prompt_template_config": {
          "templates": [
            {
              "name": "claude-2",
              "template_path": "data/prompts/duo_chat/answering/claude-2.txt.example"
            }
          ]
        }
      },
      {
        "name": "duo-chat",
        "parameters": {
          "base_url": "http://gdk.test:3000"
        },
        "prompt_template_config": {
          "templates": [
            {
              "name": "empty",
              "template_path": "data/prompts/duo_chat/answering/empty.txt.example"
            }
          ]
        }
      }
    ],
    "metrics": [
      {
        "metric": "similarity_score"
      },
      {
        "metric": "independent_llm_judge",
        "evaluating_models": [
          {
            "name": "claude-2",
            "prompt_template_config": {
              "templates": [
                {
                  "name": "claude-2",
                  "template_path": "data/prompts/duo_chat/evaluating/claude-2.txt.example"
                }
              ]
            }
          }
        ]
      }
    ]
  }
}

Evaluation results - Independent LLM Judge - Correctness

Before: Production (master - SHA: 446571bef621b5e99732cb5b245782d0d9a51355)
After: This MR (mk-prompt-improvement - SHA: 58624778)

grade	before_percentage	after_percentage
4	65.0	100.0
3	0.0	0.0
2	5.0	0.0
1	30.0	0.0

query

WITH grades as (
  SELECT 4 as grade union all
  SELECT 3 as grade union all
  SELECT 2 as grade union all
  SELECT 1 as grade
), before_base_table AS (
  SELECT *
  FROM `dev-ai-research-0e2f8974.duo_chat_external_results.sm_experiment_code_generation__input_v1_legacy__independent_llm_judge`
  WHERE answering_model = 'duo-chat'
), after_base_table AS (
  SELECT *
  FROM `dev-ai-research-0e2f8974.duo_chat_external_results.lulalala_mr145959_58624778__independent_llm_judge`
  WHERE answering_model = 'duo-chat'
), before_correctness_grade AS (
  SELECT correctness as grade, COUNT(*) as count
  FROM before_base_table
  GROUP BY correctness
), after_correctness_grade AS (
  SELECT correctness as grade, COUNT(*) as count
  FROM after_base_table
  GROUP BY correctness
)

SELECT grades.grade AS grade,
       ROUND((COALESCE(before_correctness_grade.count, 0) / (SELECT COUNT(*) FROM before_base_table)) * 100.0, 1) AS before_percentage,
       ROUND((COALESCE(after_correctness_grade.count, 0) / (SELECT COUNT(*) FROM after_base_table)) * 100.0, 1) AS after_percentage,
FROM grades
LEFT OUTER JOIN before_correctness_grade ON before_correctness_grade.grade = grades.grade
LEFT OUTER JOIN after_correctness_grade ON after_correctness_grade.grade = grades.grade;

Evaluation results - Similarity score

similarity_score_range	before_percentage	after_percentage
1.0	15.0	40.0
0.9	30.0	45.0
0.8	20.0	5.0
0.7	5.0	0.0
0.6	20.0	5.0
0.5	10.0	5.0
0.4	0.0	0.0
0.3	0.0	0.0
0.2	0.0	0.0
0.1	0.0	0.0

query

WITH buckets as (
  SELECT 1.0 as bucket union all
  SELECT 0.9 as bucket union all
  SELECT 0.8 as bucket union all
  SELECT 0.7 as bucket union all
  SELECT 0.6 as bucket union all
  SELECT 0.5 as bucket union all
  SELECT 0.4 as bucket union all
  SELECT 0.3 as bucket union all
  SELECT 0.2 as bucket union all
  SELECT 0.1 as bucket
), before_similarity_score AS (
  SELECT *
  FROM `dev-ai-research-0e2f8974.duo_chat_external_results.sm_experiment_code_generation__input_v1_legacy__similarity_score`
  WHERE answering_model = 'duo-chat'
), after_similarity_score AS (
  SELECT *
  FROM `dev-ai-research-0e2f8974.duo_chat_external_results.lulalala_mr145959_58624778__similarity_score`
  WHERE answering_model = 'duo-chat'
)

SELECT buckets.bucket AS similarity_score_range,
    (
        SELECT ROUND((COUNT(*) / (SELECT COUNT(*) FROM before_similarity_score)) * 100.0, 1)
        FROM before_similarity_score
        WHERE buckets.bucket = ROUND(before_similarity_score.comparison_similarity, 1)
    ) AS before_percentage,
    (
        SELECT ROUND((COUNT(*) / (SELECT COUNT(*) FROM after_similarity_score)) * 100.0, 1)
        FROM after_similarity_score
        WHERE buckets.bucket = ROUND(after_similarity_score.comparison_similarity, 1)
    ) AS after_percentage,
FROM buckets

Regression tests - `gitlab-duo-chat-qa`

Before: Production (master - SHA: 446571bef621b5e99732cb5b245782d0d9a51355)
After: This MR (mk-prompt-improvement - SHA: 586247788adabc6082eb66f69296de637889119f)

item	before	after
CORRECT	74.6%	67.2%
INCORRECT	14.3%	18.8%

Comparison details

Before (`master` - SHA: `446571bef621b5e99732cb5b245782d0d9a51355`):

Summary

The total number of evaluations: 63

The number of evaluations in which all LLMs graded CORRECT: 47 (74.6%)

Note: if an evaluation request failed or its response was not parsable, it was ignored. For example, ✅ ⚠ would count as CORRECT.

The number of evaluations in which all LLMs graded INCORRECT: 9 (14.3%)

Note: if an evaluation request failed or its response was not parsable, it was ignored. For example, ❌ ⚠ would count as INCORRECT.

The number of evaluations in which LLMs disagreed: 7 (11.1%)

Report: !146151 (comment 1794574818)
Test job: https://gitlab.com/gitlab-org/gitlab/-/jobs/6284140140

After (`mk-prompt-improvement` - SHA: `586247788adabc6082eb66f69296de637889119f`):

Summary

The total number of evaluations: 64

The number of evaluations in which all LLMs graded CORRECT: 43 (67.2%)

Note: if an evaluation request failed or its response was not parsable, it was ignored. For example, ✅ ⚠ would count as CORRECT.

The number of evaluations in which all LLMs graded INCORRECT: 12 (18.8%)

Note: if an evaluation request failed or its response was not parsable, it was ignored. For example, ❌ ⚠ would count as INCORRECT.

The number of evaluations in which LLMs disagreed: 9 (14.1%)

Report: !145959 (comment 1792856406)
Test job: https://gitlab.com/gitlab-org/gitlab/-/jobs/6284140140

Regression tests - `gitlab-duo-chat-zeroshot`

item	before	after
error_rate	4.8 % (5 / 104)	6.7 % (7 / 104)

Summary: Most of the errors are caused by ambiguous question for tool selections. e.g. Asking GitLab-CI related question and chose GitlabDocumentation over CiEditorAssistant. Therefore, the error rate increase is negligible.

Logs

Before (`master` - SHA: `446571bef621b5e99732cb5b245782d0d9a51355`):

8 examples, 5 failures
Failed examples:
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:1:5:1:1]' # Gitlab::Llm::Completions::Chat real requests with blob as resource with blob for code containing gitlab references behaves like successful prompt processing answers query using expected tools
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:4:8:1:1]' # Gitlab::Llm::Completions::Chat real requests when asking to explain code input_template: "Write documentation for \"\"def hello_world\\nputs(\\\"\"Hello, world!\\n\\\"\");\\nend\"\"?", tools: [] behaves like successful prompt processing answers query using expected tools
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:6:1:2:1:1:1]' # Gitlab::Llm::Completions::Chat real requests with predefined epic with predefined tools with `this epic` input_template: "Can you list all labels on this epic?", tools: ["EpicReader"] behaves like successful prompt processing answers query using expected tools
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:7:4:1:1]' # Gitlab::Llm::Completions::Chat real requests when asked about CI/CD input_template: "How do I optimize my pipelines so that they do not cost so much money?", tools: ["CiEditorAssistant"] behaves like successful prompt processing answers query using expected tools
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:7:5:1:1]' # Gitlab::Llm::Completions::Chat real requests when asked about CI/CD input_template: "How can I migrate from GitHub Actions to GitLab CI?", tools: ["CiEditorAssistant"] behaves like successful prompt processing answers query using expected tools

https://gitlab.com/gitlab-org/gitlab/-/jobs/6284140134

Error rate: 4.8 % (5 / 104)

After (`mk-prompt-improvement` - SHA: `586247788adabc6082eb66f69296de637889119f`):

10 examples, 7 failures
Failed examples:
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:1:5:1:1]' # Gitlab::Llm::Completions::Chat real requests with blob as resource with blob for code containing gitlab references behaves like successful prompt processing answers query using expected tools
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:5:5:1:1]' # Gitlab::Llm::Completions::Chat real requests when asking about how to use GitLab input_template: "What is DevOps? What is DevSecOps?", tools: ["GitlabDocumentation"] behaves like successful prompt processing answers query using expected tools
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:5:9:1:1]' # Gitlab::Llm::Completions::Chat real requests when asking about how to use GitLab input_template: "Is it possible to add stages to pipelines?", tools: ["GitlabDocumentation"] behaves like successful prompt processing answers query using expected tools
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:5:14:1:1]' # Gitlab::Llm::Completions::Chat real requests when asking about how to use GitLab input_template: "How can I migrate from Jenkins to GitLab CI/CD?", tools: ["GitlabDocumentation"] behaves like successful prompt processing answers query using expected tools
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:5:16:1:1]' # Gitlab::Llm::Completions::Chat real requests when asking about how to use GitLab input_template: "How do I run unit tests for my Next JS application in a GitLab pipeline?", tools: ["GitlabDocumentation"] behaves like successful prompt processing answers query using expected tools
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:5:24:1:1]' # Gitlab::Llm::Completions::Chat real requests when asking about how to use GitLab input_template: "How do I create a secure connection from a ci job to AWS?", tools: ["GitlabDocumentation"] behaves like successful prompt processing answers query using expected tools
rspec './ee/spec/lib/gitlab/llm/completions/chat_real_requests_spec.rb[1:1:7:4:1:1]' # Gitlab::Llm::Completions::Chat real requests when asked about CI/CD input_template: "How do I optimize my pipelines so that they do not cost so much money?", tools: ["CiEditorAssistant"] behaves like successful prompt processing answers query using expected tools

https://gitlab.com/gitlab-org/gitlab/-/jobs/6283616601

Screenshots or screen recordings

Manual QA

How to set up and validate locally

Ask chat questions from gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#1 (comment 1783850127)

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited Feb 29, 2024 by Shinya Maeda

Improve prompt

What does this MR do and why?

Prompt Library configuration

Evaluation results - Independent LLM Judge - Correctness

Evaluation results - Similarity score

Regression tests - gitlab-duo-chat-qa

Before (master - SHA: 446571bef621b5e99732cb5b245782d0d9a51355):

Summary

After (mk-prompt-improvement - SHA: 586247788adabc6082eb66f69296de637889119f):

Summary

Regression tests - gitlab-duo-chat-zeroshot

Before (master - SHA: 446571bef621b5e99732cb5b245782d0d9a51355):

After (mk-prompt-improvement - SHA: 586247788adabc6082eb66f69296de637889119f):

Screenshots or screen recordings

How to set up and validate locally

MR acceptance checklist

Merge request reports

Regression tests - `gitlab-duo-chat-qa`

Before (`master` - SHA: `446571bef621b5e99732cb5b245782d0d9a51355`):

After (`mk-prompt-improvement` - SHA: `586247788adabc6082eb66f69296de637889119f`):

Regression tests - `gitlab-duo-chat-zeroshot`

Before (`master` - SHA: `446571bef621b5e99732cb5b245782d0d9a51355`):

After (`mk-prompt-improvement` - SHA: `586247788adabc6082eb66f69296de637889119f`):