Experiment with System Prompts as instructed by Anthropic
🔦 :
Objective The Objective is to experiment with Duo Chat to reach par with the foundational model Claude in terms of quality, as measured by the similarity score. In order to do we have noticed in the snippet , we are not using System Prompt. It was mentioned this was used before and there was a degradation of performance and there was no to support it them . We recommend running a quick experiment to transform user to system and see the impact.
Below is the screenshot from the current dashboard overview, that demonstrates where Duo Chat lies in respect to Claude.
#⃣ :
Primary Metric for Success The primary metric for success in this iteration of experimentation is the Comparison Similarity Score. This score specifically compares the output generated by the Answering Model (Duo Chat) and the Comparison Model (Claude).
📚 :
Dataset for Diagnostic Testing/Experimentation For this iteration, we will utilize the below datasets for experimentation. This dataset is a subset of the Centralized Evaluation Framework, and represents 119 rows of data derived from both the Code Generation and Issue/Epic datasets. This subset of data captures similarity scores ranging from 0.1 to 0.71. This dataset is based on areas where chat is not performing well, based on the Similarity Score, allowing developers to focus and iterate on areas where Chat Duo is weakest. The Diagnostic Test is intended to be a rapid, low-cost experiment for developers to have confidence in the changes they make to tools and prompts as they iterate on code. Diagnostic Tests are not meant to be understanding how chat is working at scale for every code change. Instead, the Centralized Evaluation Framework serves that purpose with the daily runs.
- The Experiment input dataset: duo_chat_external.experiment_code_generation__input_v1 (required GCP access to
dev-ai-research-0e2f8974
project). : This dataset contains the input question data - The Experiment control dataset: duo_chat_external_results.experiment_code_generation__control__comparison_v1 (required GCP access to
dev-ai-research-0e2f8974
project). : This dataset contains the input question data with the metrics as a subset of the Centralised Evaluation Framework
The diagnostic experiments can be of two phases:
- Phase 1: Experimentation with the Code Generation Dataset
- Phase 2: Experimentation with the Issue/Epic once Rake task is worked on
Please kindly see a walkthrough video showing how to run Duo Chat Diagnostic Test using the aforementioned dataset.
🔍 :
Metrics - Control Metric Score: Comparison Similarity: (avg similarity score) 0.57
- Experiment Metric Score: TBD post Experiment
- Variance:
📶 : TBD Post Experiment
✍🏼
Experiment Details: Recommendation: Consider trying system prompt instead of user prompt