Document how to evaluate prompt improvements and A/B testing

Problem

groupai framework and groupduo chat teams started using Prompt Library (provided by groupai model validation) for evaluating the performance scores of prompt improvement MRs and A/B testing.

Example:

Prompt improvements: Improve prompt (!145959 - merged)
A/B testing: !142787 (diffs)

But these process are not documented yet, so engineers have no way to use them.

Proposal

Document how to evaluate prompt improvements and A/B testing with Prompt Library.

Document how to get the new scores with local GDK, AI Gateway and Prompt Library. @tle_gitlab
- Prompt library docs: https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/blob/main/doc/how-to/run_duo_chat_eval.md#seeding-the-local-gdk-instance-with-issue-and-epics-data
Document how to get the baseline scores from daily production evaluation. @tle_gitlab
- We can fetch the baseline scores with promptlib duo-chat fetch-sample.
Document how to compare the scores in merge requests. @tle_gitlab
- We don't have this functionality in prompt library yet. Support for a summary at the end of the run is being worked on here gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#216. But this might be automated with #454406 (cc @shinya.maeda).
~~[ ] Document how to subtract dataset for fast feedback loop.~~ @tle_gitlab
- already done by gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#201 (closed)
(if rake task and local input dataset is used) Document how to re-synchronize the input dataset with daily production evaluation (see gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#179).
- We don't have this yet. A proposal is being worked on at gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#179
Document how to seed data for Epic and Issues.
- Prompt library docs https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/blob/main/doc/how-to/run_duo_chat_eval.md#seeding-the-local-gdk-instance-with-issue-and-epics-data
Demo of running prompt lib locally: https://drive.google.com/file/d/1X6CARf0gebFYX4Rc9ULhcfq9LLLnJ_O-/view?usp=sharing.
Document how to use LangSmith for identifying where exactly the chain failed.
- Gitlab docs on how to use LangSmith https://docs.gitlab.com/ee/development/ai_features/duo_chat.html#tracing-with-langsmith
- Add a section to Prompt library docs linking to the above.
Best Practise of doing A/B with CEF: https://docs.google.com/document/d/18c3GYxelFbVJOIjibshBIxT0orxDdNxqQM3RbZ3j34A/edit

Related to

Tool example improvements (!146634 - merged)
Prompt development lifecycle by Anthropic https://docs.anthropic.com/claude/docs/prompt-engineering

Edited Apr 09, 2024 by Bruno Cardoso