Investigation for Chat Output where Chat seems to think it was asked to explain code but what was actually asked is to generate code

Objective 🔦

The Objective is to increase the overall correctiness score for duo chat.

Primary Metric for Success #⃣

The primary metric for success for this experiment is correctness as based out of LLM Evaluator Judge

Dataset for Diagnostic Testing/Experimentation 📚

Here is a small subset of data for experimentation: https://docs.google.com/spreadsheets/d/1Y8rFkXMPbYpOr0Q4yBFPHwqurKI3Kl_ulCJO80CBUm4/edit#gid=0

We have created 3 videos to walkthrough the idea behind this change and an example how to incorporate it to Duo Chat development

How to run end to end experiments as an example of trimmed prompts (https://www.youtube.com/watch?v=H2oykA5THac)
Curious on how the datasets are build and run the experiments (https://www.youtube.com/watch?v=swN2EtAzdWA)
A generic walkthrough of end to end pipeline to get familiar with accessing Prompt Library( https://youtu.be/U2CW95yylMs)

Metrics 🔍

Control Metric Score: Correctness : Avg 2
Experiment Metric Score: TBD post Experiment
Variance: 📶 : TBD Post Experiment

Experiment Details ✍🏼

Recommendation: consider investigating a pattern in the input/response where Chat Duo appears to 'think' that it is supposed to explain code from above rather than generate code . One example from the subset is here

chat task type	Question	Answer	Explanation	Correctness	Comrephensiveness	Readability
code_generation_non_gitlab	Write a python function to check whether all the bits are unset in the given range or not.	The above Python function checks if all bits in the given range are unset in the input number.	Since there is no actual function code provided in the answer, I cannot properly evaluate its correctness or completeness. However, the explanation text provided does attempt to describe the intended functionality of checking if bits in a range are unset.	2	2	3

Edited Feb 19, 2024 by Mon Ray