Chat Diagnostic Experiment Recommendation : 02-16 : Investigation of Incomplete Responses from DuoChat Related to Functions
🔦
Objective The Objective is to experiment with Duo Chat to reach par with the foundational model Claude in terms of quality, as measured by the similarity score. Below is the screenshot from the current dashboard overview, that demonstrates where Duo Chat lies in respect to Claude.
#⃣
Primary Metric for Success The primary metric for success in this iteration of experimentation is the Comparison Similarity Score. This score specifically compares the output generated by the Answering Model (Duo Chat) and the Comparison Model (Claude).
📚
Dataset for Diagnostic Testing/Experimentation For this iteration, we will utilize the below datasets for experimentation. This dataset is a subset of the Centralized Evaluation Framework, and represents 119 rows of data derived from both the Code Generation and Issue/Epic datasets. This subset of data captures similarity scores ranging from 0.1 to 0.71. This dataset is based on areas where chat is not performing well, based on the Similarity Score, allowing developers to focus and iterate on areas where Chat Duo is weakest. The Diagnostic Test is intended to be a rapid, low-cost experiment for developers to have confidence in the changes they make to tools and prompts as they iterate on code. Diagnostic Tests are not meant to be understanding how chat is working at scale for every code change. Instead, the Centralized Evaluation Framework serves that purpose with the daily runs.
- The Experiment input dataset: duo_chat_external.experiment_code_generation__input_v1 (required GCP access to
dev-ai-research-0e2f8974
project). : This dataset contains the input question data - The Experiment control dataset: duo_chat_external_results.experiment_code_generation__control__comparison_v1 (required GCP access to
dev-ai-research-0e2f8974
project). : This dataset contains the input question data with the metrics as a subset of the Centralised Evaluation Framework
The diagnostic experiments can be of two phases:
- Phase 1: Experimentation with the Code Generation Dataset (20 rows)
- Phase 2: Experimentation with the Issue/Epic once Rake task is worked on (79 rows) Not Included Above
We have created 3 videos to walkthrough the idea behind this change and an example how to incorporate it to Duo Chat development
- How to run end to end experiments as an example of trimmed prompts (https://www.youtube.com/watch?v=H2oykA5THac)
- Curious on how the datasets are build and run the experiments (https://www.youtube.com/watch?v=swN2EtAzdWA)
- A generic walkthrough of end to end pipeline to get familiar with accessing Prompt Library( https://youtu.be/U2CW95yylMs)
🔍
Metrics - Control Metric Score: Comparison Similarity: (avg similarity score) 0.57
- Experiment Metric Score: TBD post Experiment
- Variance:
📶 : TBD Post Experiment
✍🏼
Experiment Details Recommendation: consider investigating a pattern in the input/response where Chat Duo appears to 'think' that it has replied 'above' but has not. These instances appear to all be related to function creation, so consider investigating the tools in that tool-chain.
code_generation_non_gitlab | Write a function that takes in a sorted array, its length (n), and an element and returns whether the element is the majority element in the given sorted array. (The majority element is the element that occurs more than n/2 times.) | duo-chat | See above function implementation and explanation. | claude-2 | 0.6093105078 |
Since no context is provided, here is a function that solves the problem stated in the question: |
code_generation_non_gitlab | Write a function to remove all elements from a given list present in another list. | duo-chat | See above code snippet for function to remove elements from one list present in another list. | claude-2 | 0.7738357782 |
Since there is no context provided, here is an example function to remove all elements from a list A that are present in another list B: |
code_generation_non_gitlab | Write a function to concatenate each element of tuple by the delimiter. | duo-chat | The above function concatenates the elements of a tuple with a given delimiter. | claude-2 | 0.7842617035 |
Since there is no context provided, I will attempt to answer the question on concatenating tuple elements by delimiter based on my own understanding. Here is a function to do that: |
code_generation_non_gitlab | Write a python function to check whether the given list contains consecutive numbers or not. | duo-chat | The above Python function checks if the given list contains consecutive numbers. | claude-2 | 0.8169140816 |
Here is a python function to check if a list contains consecutive numbers: |
|
code_generation_non_gitlab | Write a python function to check whether all the bits are unset in the given range or not. | duo-chat | The above Python function checks if all bits in the given range are unset in the input number. | claude-2 | 0.8191574812 |
Here is a python function to check whether all the bits are unset in a given range: |