Context Placement Shift

Objective 🔦 :

The Objective is to experiment with Duo Chat to reach par with the foundational model Claude in terms of quality, as measured by the similarity score. In order to do that we have noticed with LLM context placement is Key and we would recommend an experiment with a subset of data where the final request to Claude-2 could be places differently and try various techniques

Below is the screenshot from the current dashboard overview, that demonstrates where Duo Chat lies in respect to Claude.

Primary Metric for Success #⃣ :

The primary metric for success in this iteration of experimentation is the Comparison Similarity Score. This score specifically compares the output generated by the Answering Model (Duo Chat) and the Comparison Model (Claude).

Dataset for Diagnostic Testing/Experimentation 📚 :

For this iteration, we will utilize the below datasets for experimentation. This dataset is a subset of the Centralized Evaluation Framework, and represents 119 rows of data derived from both the Code Generation and Issue/Epic datasets. This subset of data captures similarity scores ranging from 0.1 to 0.71. This dataset is based on areas where chat is not performing well, based on the Similarity Score, allowing developers to focus and iterate on areas where Chat Duo is weakest. The Diagnostic Test is intended to be a rapid, low-cost experiment for developers to have confidence in the changes they make to tools and prompts as they iterate on code. Diagnostic Tests are not meant to be understanding how chat is working at scale for every code change. Instead, the Centralized Evaluation Framework serves that purpose with the daily runs.

The Experiment input dataset: duo_chat_external.experiment_code_generation__input_v1 (required GCP access to dev-ai-research-0e2f8974 project). : This dataset contains the input question data
The Experiment control dataset: duo_chat_external_results.experiment_code_generation__control__comparison_v1 (required GCP access to dev-ai-research-0e2f8974 project). : This dataset contains the input question data with the metrics as a subset of the Centralised Evaluation Framework

The diagnostic experiments can be of two phases:

Phase 1: Experimentation with the Code Generation Dataset
Phase 2: Experimentation with the Issue/Epic once Rake task is worked on

Please kindly see a walkthrough video showing how to run Duo Chat Diagnostic Test using the aforementioned dataset.

Metrics 🔍 :

Control Metric Score: Comparison Similarity: (avg similarity score) 0.57
Experiment Metric Score: TBD post Experiment
Variance: 📶 : TBD Post Experiment

Experiment Details: ✍🏼

Recommendation: Consider placing context as key value pair based on key the task and the value pair the context and ask the question

Success (Y/N): 👍 👎

Future Experiments:

Roll-out Plan:

Edited Feb 16, 2024 by Mon Ray