Document how to evaluate prompt improvements and A/B testing
Problem
groupai framework and groupduo chat teams started using Prompt Library (provided by groupai model validation) for evaluating the performance scores of prompt improvement MRs and A/B testing.
Example:
- Prompt improvements: Improve prompt (!145959 - merged)
- A/B testing: !142787 (diffs)
But these process are not documented yet, so engineers have no way to use them.
Proposal
Document how to evaluate prompt improvements and A/B testing with Prompt Library.
-
Document how to get the new scores with local GDK, AI Gateway and Prompt Library. @tle_gitlab -
Document how to get the baseline scores from daily production evaluation. @tle_gitlab - We can fetch the baseline scores with
promptlib duo-chat fetch-sample
.
- We can fetch the baseline scores with
-
Document how to compare the scores in merge requests. @tle_gitlab - We don't have this functionality in prompt library yet. Support for a summary at the end of the run is being worked on here gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#216. But this might be automated with #454406 (cc @shinya.maeda).
-
[ ] Document how to subtract dataset for fast feedback loop.@tle_gitlab -
(if rake task and local input dataset is used) Document how to re-synchronize the input dataset with daily production evaluation (see gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#179). - We don't have this yet. A proposal is being worked on at gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#179
-
Document how to seed data for Epic and Issues. -
Demo of running prompt lib locally: https://drive.google.com/file/d/1X6CARf0gebFYX4Rc9ULhcfq9LLLnJ_O-/view?usp=sharing. -
Document how to use LangSmith for identifying where exactly the chain failed. - Gitlab docs on how to use LangSmith https://docs.gitlab.com/ee/development/ai_features/duo_chat.html#tracing-with-langsmith
-
Add a section to Prompt library docs linking to the above.
-
Best Practise of doing A/B with CEF: https://docs.google.com/document/d/18c3GYxelFbVJOIjibshBIxT0orxDdNxqQM3RbZ3j34A/edit
Related to
- Tool example improvements (!146634 - merged)
- Prompt development lifecycle by Anthropic https://docs.anthropic.com/claude/docs/prompt-engineering
Edited by Bruno Cardoso