Investigation for Chat Output where Chat seems to think it was asked to explain code but what was actually asked is to generate code
🔦
Objective The Objective is to increase the overall correctiness score for duo chat.
#⃣
Primary Metric for Success The primary metric for success for this experiment is correctness as based out of LLM Evaluator Judge
📚
Dataset for Diagnostic Testing/Experimentation Here is a small subset of data for experimentation: https://docs.google.com/spreadsheets/d/1Y8rFkXMPbYpOr0Q4yBFPHwqurKI3Kl_ulCJO80CBUm4/edit#gid=0
We have created 3 videos to walkthrough the idea behind this change and an example how to incorporate it to Duo Chat development
- How to run end to end experiments as an example of trimmed prompts (https://www.youtube.com/watch?v=H2oykA5THac)
- Curious on how the datasets are build and run the experiments (https://www.youtube.com/watch?v=swN2EtAzdWA)
- A generic walkthrough of end to end pipeline to get familiar with accessing Prompt Library( https://youtu.be/U2CW95yylMs)
🔍
Metrics - Control Metric Score: Correctness : Avg 2
- Experiment Metric Score: TBD post Experiment
- Variance:
📶 : TBD Post Experiment
✍🏼
Experiment Details Recommendation: consider investigating a pattern in the input/response where Chat Duo appears to 'think' that it is supposed to explain code from above rather than generate code . One example from the subset is here
chat task type | Question | Answer | Explanation | Correctness | Comrephensiveness | Readability |
---|---|---|---|---|---|---|
code_generation_non_gitlab | Write a python function to check whether all the bits are unset in the given range or not. | The above Python function checks if all bits in the given range are unset in the input number. | Since there is no actual function code provided in the answer, I cannot properly evaluate its correctness or completeness. However, the explanation text provided does attempt to describe the intended functionality of checking if bits in a range are unset. | 2 | 2 | 3 |
Edited by Mon Ray